Trouble with Prolog scripts

Lee, Ian

2014-10-08 00:49:33 UTC

Looking for some help with / understanding of how the Prolog scripts work in Slurm.

I have a Slurm 14.03.7 installation on a cluster that I am administering. We want to add a process to check that a node has enough disk space on a particular device (/dev/shm in this case) and if not, then set that node to DRAIN with a Reason of "Insufficient diskspace on /dev/shm". For simplicity, imagine I currently have only one node, "n0" in the cluster.

Preparing for this I wrote a simple test prolog script:

#! /bin/bash
exit 10

As I understand it when I then call srun / sbatch the prolog script will return a non-zero exit code (10) and then the node where this failed will go into the "drain" state while the job should get rescheduled on a different node. However, this seems to not be happening.

$ srun hostname
n0
# All nodes are idle, srun prints n0

I added the explicit command I would want to run to drain the node to my script before it exits:

#! /bin/bash
scontrol update NodeName=n0 State=drain Reason="Insufficient diskspace on /dev/shm"
exit 10

And then when I call `srun hostname` I get the node printing out the hostname, and the node ends up in the drain state.

$ srun hostname
n0
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 drain n0

So I know that the prolog script is running, and if I try to run again it fails to queue, as I would expect:

$ srun hostname
srun: Required node not available (down or drained)
srun: job 235 queued and waiting for resources

Relevant settings from my slurm.conf file:

Prolog=/etc/slurm/slurm.prolog.sh
PrologFlags=Alloc

My question really is after the Prolog script is failing (exit(10)) why does the job continue along?

Alternatively, how would I configure SLURM so that I can do what I really want to do which is to drain a node if the diskspace of a particular disk is insufficient?

Thank you,

~ Ian Lee
Lawrence Livermore National Laboratory
(W) 925-423-4941=