Lee, Ian
2014-10-08 00:49:33 UTC
Looking for some help with / understanding of how the Prolog scripts work in Slurm.
I have a Slurm 14.03.7 installation on a cluster that I am administering. We want to add a process to check that a node has enough disk space on a particular device (/dev/shm in this case) and if not, then set that node to DRAIN with a Reason of "Insufficient diskspace on /dev/shm". For simplicity, imagine I currently have only one node, "n0" in the cluster.
Preparing for this I wrote a simple test prolog script:
#! /bin/bash
exit 10
As I understand it when I then call srun / sbatch the prolog script will return a non-zero exit code (10) and then the node where this failed will go into the "drain" state while the job should get rescheduled on a different node. However, this seems to not be happening.
$ srun hostname
n0
# All nodes are idle, srun prints n0
I added the explicit command I would want to run to drain the node to my script before it exits:
#! /bin/bash
scontrol update NodeName=n0 State=drain Reason="Insufficient diskspace on /dev/shm"
exit 10
And then when I call `srun hostname` I get the node printing out the hostname, and the node ends up in the drain state.
$ srun hostname
n0
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 drain n0
So I know that the prolog script is running, and if I try to run again it fails to queue, as I would expect:
$ srun hostname
srun: Required node not available (down or drained)
srun: job 235 queued and waiting for resources
Relevant settings from my slurm.conf file:
Prolog=/etc/slurm/slurm.prolog.sh
PrologFlags=Alloc
My question really is after the Prolog script is failing (exit(10)) why does the job continue along?
Alternatively, how would I configure SLURM so that I can do what I really want to do which is to drain a node if the diskspace of a particular disk is insufficient?
Thank you,
~ Ian Lee
Lawrence Livermore National Laboratory
(W) 925-423-4941=
I have a Slurm 14.03.7 installation on a cluster that I am administering. We want to add a process to check that a node has enough disk space on a particular device (/dev/shm in this case) and if not, then set that node to DRAIN with a Reason of "Insufficient diskspace on /dev/shm". For simplicity, imagine I currently have only one node, "n0" in the cluster.
Preparing for this I wrote a simple test prolog script:
#! /bin/bash
exit 10
As I understand it when I then call srun / sbatch the prolog script will return a non-zero exit code (10) and then the node where this failed will go into the "drain" state while the job should get rescheduled on a different node. However, this seems to not be happening.
$ srun hostname
n0
# All nodes are idle, srun prints n0
I added the explicit command I would want to run to drain the node to my script before it exits:
#! /bin/bash
scontrol update NodeName=n0 State=drain Reason="Insufficient diskspace on /dev/shm"
exit 10
And then when I call `srun hostname` I get the node printing out the hostname, and the node ends up in the drain state.
$ srun hostname
n0
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 drain n0
So I know that the prolog script is running, and if I try to run again it fails to queue, as I would expect:
$ srun hostname
srun: Required node not available (down or drained)
srun: job 235 queued and waiting for resources
Relevant settings from my slurm.conf file:
Prolog=/etc/slurm/slurm.prolog.sh
PrologFlags=Alloc
My question really is after the Prolog script is failing (exit(10)) why does the job continue along?
Alternatively, how would I configure SLURM so that I can do what I really want to do which is to drain a node if the diskspace of a particular disk is insufficient?
Thank you,
~ Ian Lee
Lawrence Livermore National Laboratory
(W) 925-423-4941=