Discussion:
Trouble with Prolog scripts
Lee, Ian
2014-10-08 00:49:33 UTC
Permalink
Looking for some help with / understanding of how the Prolog scripts work in Slurm.

I have a Slurm 14.03.7 installation on a cluster that I am administering. We want to add a process to check that a node has enough disk space on a particular device (/dev/shm in this case) and if not, then set that node to DRAIN with a Reason of "Insufficient diskspace on /dev/shm". For simplicity, imagine I currently have only one node, "n0" in the cluster.

Preparing for this I wrote a simple test prolog script:

#! /bin/bash
exit 10

As I understand it when I then call srun / sbatch the prolog script will return a non-zero exit code (10) and then the node where this failed will go into the "drain" state while the job should get rescheduled on a different node. However, this seems to not be happening.

$ srun hostname
n0
# All nodes are idle, srun prints n0


I added the explicit command I would want to run to drain the node to my script before it exits:

#! /bin/bash
scontrol update NodeName=n0 State=drain Reason="Insufficient diskspace on /dev/shm"
exit 10

And then when I call `srun hostname` I get the node printing out the hostname, and the node ends up in the drain state.

$ srun hostname
n0
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 drain n0

So I know that the prolog script is running, and if I try to run again it fails to queue, as I would expect:

$ srun hostname
srun: Required node not available (down or drained)
srun: job 235 queued and waiting for resources

Relevant settings from my slurm.conf file:

Prolog=/etc/slurm/slurm.prolog.sh
PrologFlags=Alloc


My question really is after the Prolog script is failing (exit(10)) why does the job continue along?

Alternatively, how would I configure SLURM so that I can do what I really want to do which is to drain a node if the diskspace of a particular disk is insufficient?

Thank you,


~ Ian Lee
Lawrence Livermore National Laboratory
(W) 925-423-4941=
Kilian Cavalotti
2014-10-08 04:32:31 UTC
Permalink
Hi Ian,

That doesn't answer your question about prolog scripts, but for that
sort of checks, we use NHC
(http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check). It
integrates very well with Slurm and provides all sorts of ready-to-use
checks.

Cheers,
--
Kilian
Lee, Ian
2014-10-09 16:48:42 UTC
Permalink
Thanks Kilian,

This looks like exactly what I was looking for.

~ Ian Lee
Lawrence Livermore National Laboratory
(W) 925-423-4941


-----Original Message-----
From: Kilian Cavalotti [mailto:***@gmail.com]
Sent: Tuesday, October 07, 2014 9:33 PM
To: slurm-dev
Subject: [slurm-dev] Re: Trouble with Prolog scripts


Hi Ian,

That doesn't answer your question about prolog scripts, but for that sort of checks, we use NHC (http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check). It integrates very well with Slurm and provides all sorts of ready-to
Loading...