Discussion:
starting slurmd only after GPUs are fully initialized
Lev Givon
2014-08-29 16:13:35 UTC
Permalink
I recently set up slurm 2.6.5 on a cluster of Ubuntu 14.04.1 systems hosting several
NVIDIA GPUs set up as generic resources. When the compute nodes are rebooted, I
noticed that they attempt to start slurmd before the device files initialized by
the nvidia kernel module appear, i.e., the following message appears in syslog
some number of lines before the GPU kernel driver load messages.

slurmd[1453]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory

Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't
started before any GPU device files appear?
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/
Andy Riebs
2014-08-29 17:48:36 UTC
Permalink
One way to work around this is to set the node definition(s) in
slurm.conf with "State=DOWN". That way, manual intervention will be
required when a node is rebooted, allowing the rest of the system to
finish coming up.

Andy
Post by Lev Givon
I recently set up slurm 2.6.5 on a cluster of Ubuntu 14.04.1 systems hosting several
NVIDIA GPUs set up as generic resources. When the compute nodes are rebooted, I
noticed that they attempt to start slurmd before the device files initialized by
the nvidia kernel module appear, i.e., the following message appears in syslog
some number of lines before the GPU kernel driver load messages.
slurmd[1453]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't
started before any GPU device files appear?
Lev Givon
2014-08-31 16:32:34 UTC
Permalink
Post by Andy Riebs
Post by Lev Givon
I recently set up slurm 2.6.5 on a cluster of Ubuntu 14.04.1 systems hosting several
NVIDIA GPUs set up as generic resources. When the compute nodes are rebooted, I
noticed that they attempt to start slurmd before the device files initialized by
the nvidia kernel module appear, i.e., the following message appears in syslog
some number of lines before the GPU kernel driver load messages.
slurmd[1453]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't
started before any GPU device files appear?
One way to work around this is to set the node definition(s) in
slurm.conf with "State=DOWN". That way, manual intervention will be
required when a node is rebooted, allowing the rest of the system to
finish coming up.
Not sure how the above suggestion remedies the problem; as things stand,
I already need to manually start slurmd on the compute nodes after a
reboot because the absence of the device files prevents the daemon from starting.

Perhaps I should have phrased my question differently: is there a
recommended method on Ubuntu for ensuring that slurmd starts only after the GPU
device files appear if a GPU generic resource has been defined in a node's SLURM
configuration? One possibility that I'll try if no other solutions present
themselves involves modifying the init.d startup script to poll for the device
files if a GPU resource exists, but I'm curious whether there are any existing
fixes given that SLURM packages for Ubuntu have already existed for several years.
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/
Antony Cleave
2014-09-01 15:49:36 UTC
Permalink
I think modifying the init scripts is likely to be the only way:

When I built my own version of slurm 14.03 of ubuntu 10.04 I installed
both slurm and munge on an nfs filesystem to be sure that slurm.conf was
identical across the cluster. This meant that the default init.d scripts
would fail as it would always try to start before the
/store/cluster/apps filesystem had been mounted. The way I fixed this
was to create an upstart script for munge (which I then use to trigger
slurm) which was started by the "remote-filesystems" event AND poll to
see if the directory existed yet and only start munge when it was a
valid path. You can do exactly the same to test for /dev/nvidia0.

since all of the polling is done in my munge upstart script and not in
slurm here is my /init/munge.conf.
Please not that this is the first ever upstart script I ever wrote so
I'm not claiming this is the best way, only that it works, I've not even
gone back and cleaned it up

--------------------------------------------------------------------------------
# Munge (My custom build)
#

description "Munge (My custom build for slurm)"

start on remote-filesystems
stop on runlevel [06S]

respawn

pre-start script
prefix="/store/cluster/apps/munge/gcc"
exec_prefix="${prefix}"
sbindir="${exec_prefix}/sbin"
sysconfdir="${prefix}/etc"
localstatedir="${prefix}/var"
DAEMON="$sbindir/munged"
RETRYCOUNT=10
RETRYDELAY=10
mycount=0

logger -is -t "$UPSTART_JOB" "checking prefix ${prefix}"
mkdir -p /var/run/munge
for dir in /home/share /store/cluster/apps/munge ; do
logger -is -t "$UPSTART_JOB" "checking dir \"$dir\" exists "
logger -is -t "$UPSTART_JOB" "RETRYCOUNT=$RETRYCOUNT and
mycount=$mycount"
while [ $mycount -lt ${RETRYCOUNT} ] ; do
mycount=`expr $mycount + 1`
if [ -d "$dir" ]
then
logger -is -t "$UPSTART_JOB" "$dir exists! lets go!"
break;
else
logger -is -t "$UPSTART_JOB" "WARNING: Required remote DIR
\"$dir\" not yet mounted waiting ${RETRYDELAY} seconds to retry (attempt
${mycount} of ${RETRYCOUNT} )"
sleep $RETRYDELAY
fi
done
if [ $mycount -eq 5 ]
then
logger -is -t "$UPSTART_JOB" "$dir does not exist giving up!"
stop
fi
done
# exit 0
end script

expect daemon
exec /store/cluster/apps/munge/gcc/sbin/munged 2>&1
--------------------------------------------------------------------------------------------------------------

I start slurm one munged has started using the "start on started munge"
upstart directive

Hopefully this is a useful example

Antony
Post by Lev Givon
Post by Andy Riebs
Post by Lev Givon
I recently set up slurm 2.6.5 on a cluster of Ubuntu 14.04.1 systems hosting several
NVIDIA GPUs set up as generic resources. When the compute nodes are rebooted, I
noticed that they attempt to start slurmd before the device files initialized by
the nvidia kernel module appear, i.e., the following message appears in syslog
some number of lines before the GPU kernel driver load messages.
slurmd[1453]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't
started before any GPU device files appear?
One way to work around this is to set the node definition(s) in
slurm.conf with "State=DOWN". That way, manual intervention will be
required when a node is rebooted, allowing the rest of the system to
finish coming up.
Not sure how the above suggestion remedies the problem; as things stand,
I already need to manually start slurmd on the compute nodes after a
reboot because the absence of the device files prevents the daemon from starting.
Perhaps I should have phrased my question differently: is there a
recommended method on Ubuntu for ensuring that slurmd starts only after the GPU
device files appear if a GPU generic resource has been defined in a node's SLURM
configuration? One possibility that I'll try if no other solutions present
themselves involves modifying the init.d startup script to poll for the device
files if a GPU resource exists, but I'm curious whether there are any existing
fixes given that SLURM packages for Ubuntu have already existed for several years.
Christopher Samuel
2014-08-31 23:45:33 UTC
Permalink
Post by Lev Givon
Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't
started before any GPU device files appear?
To be honest my policy has been for many years to never start queuing
system daemons on boot, it's too easy to have a node go bad, reboot,
come back up, take a job, go bad, reboot, take a job, go bad, reboot,
repeat until no jobs left.

DIMMs go bad, IB & accelerator cards go bad and cause NMIs, for us it's
not worth the risk.

We rarely reboot nodes other than hardware failure or for a software
upgrade so if one does go bad we want to go and find out why before we
let it back into the cluster.

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
Brown George Andrew
2014-09-01 07:26:34 UTC
Permalink
We make use of the node health check (HealthCheckProgram in slurm.conf) to automatically put nodes online/ offline if things like mount points are not available. If something fails our script executes an scontrol command to drain the node and updates the reason with something like "$SCRATCH is not mounted". We don't have nodes start as being online so if the health check runs and everything looks ok the node gets put online, otherwise the node will continue to not be available for job submission.

As an side we have logic in our check so if the keyword "ADMIN" appears in the reason field the node health check takes no action.

Kind regards,
George
________________________________________
From: Christopher Samuel [samuel-***@public.gmane.org]
Sent: 01 September 2014 01:45
To: slurm-dev
Subject: [slurm-dev] Re: starting slurmd only after GPUs are fully initialized
Post by Lev Givon
Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't
started before any GPU device files appear?
To be honest my policy has been for many years to never start queuing
system daemons on boot, it's too easy to have a node go bad, reboot,
come back up, take a job, go bad, reboot, take a job, go bad, reboot,
repeat until no jobs left.

DIMMs go bad, IB & accelerator cards go bad and cause NMIs, for us it's
not worth the risk.

We rarely reboot nodes other than hardware failure or for a software
upgrade so if one does go bad we want to go and find out why before we
let it back into the cluster.

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
Loading...