Discussion:
BLCR does not checkpoint jobs
Arjun J Rao
2014-07-04 07:05:36 UTC
Permalink
I have managed to get SLURM running with checkpointing enabled on my
cluster of two machines, named qdr3 and qdr4. However when I run the
command
srun -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/ACIM/Ctrl ./MPIJob

The MPIJob's code does get executed. However, all of the checkpointing
instructions fail. slurmctld shows the following messages :

slurmctld: debug3: checkpoint/blcr: sending checkpoint tasks request 3 to
81.0
slurmctld: debug2: Tree head got back 0 looking for 2
slurmctld: debug3: Tree sending to qdr4
slurmctld: debug3: Tree sending to qdr3
slurmctld: debug: _slurm_recv_timeout at 0 of 4, timeout
slurmctld: error:* slurm_receive_msgs: Socket timed out on send/recv
operation*
slurmctld: debug: _slurm_recv_timeout at 0 of 4, timeout
slurmctld: error: *slurm_receive_msgs: Socket timed out on send/recv
operation*
slurmctld: debug3: problems with qdr3
slurmctld: debug3: problems with qdr4
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: Tree head got back 2
slurmctld: error:* checkpoint/blcr: error on checkpoint request 3 to 81.0:
Communication connection failure*
slurmctld: debug: *checkpoint/blcr: file /usr/local/sbin/scch not found*


What could be the reason for the failing of the checkpointing commands ?
Also, is the missing /usr/local/sbin/scch an integral part of the problem ?
Christopher Samuel
2014-07-06 23:57:33 UTC
Permalink
Post by Arjun J Rao
Also, is the missing /usr/local/sbin/scch an integral part of the problem ?
I think that's a red herring, the man page for slurm.conf says:

checkpoint/blcr Berkeley Lab Checkpoint Restart (BLCR).
NOTE: If a file is found at sbin/scch (relative
to the SLURM installation location), it will be
executed upon completion of the checkpoint. This
can be a script used for managing the checkpoint
files. NOTE: SLURM’s BLCR logic only supports batch
jobs.

*However* I think that NOTE at the end may explain it, you say you are doing:

srun -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/ACIM/Ctrl ./MPIJob

I think you'll need to do that inside an sbatch script for
this to work.

Caveat: We've never used this, so YMMV.

All the best,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Loading...