Arjun J Rao
2014-07-04 07:05:36 UTC
I have managed to get SLURM running with checkpointing enabled on my
cluster of two machines, named qdr3 and qdr4. However when I run the
command
srun -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/ACIM/Ctrl ./MPIJob
The MPIJob's code does get executed. However, all of the checkpointing
instructions fail. slurmctld shows the following messages :
slurmctld: debug3: checkpoint/blcr: sending checkpoint tasks request 3 to
81.0
slurmctld: debug2: Tree head got back 0 looking for 2
slurmctld: debug3: Tree sending to qdr4
slurmctld: debug3: Tree sending to qdr3
slurmctld: debug: _slurm_recv_timeout at 0 of 4, timeout
slurmctld: error:* slurm_receive_msgs: Socket timed out on send/recv
operation*
slurmctld: debug: _slurm_recv_timeout at 0 of 4, timeout
slurmctld: error: *slurm_receive_msgs: Socket timed out on send/recv
operation*
slurmctld: debug3: problems with qdr3
slurmctld: debug3: problems with qdr4
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: Tree head got back 2
slurmctld: error:* checkpoint/blcr: error on checkpoint request 3 to 81.0:
Communication connection failure*
slurmctld: debug: *checkpoint/blcr: file /usr/local/sbin/scch not found*
What could be the reason for the failing of the checkpointing commands ?
Also, is the missing /usr/local/sbin/scch an integral part of the problem ?
cluster of two machines, named qdr3 and qdr4. However when I run the
command
srun -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/ACIM/Ctrl ./MPIJob
The MPIJob's code does get executed. However, all of the checkpointing
instructions fail. slurmctld shows the following messages :
slurmctld: debug3: checkpoint/blcr: sending checkpoint tasks request 3 to
81.0
slurmctld: debug2: Tree head got back 0 looking for 2
slurmctld: debug3: Tree sending to qdr4
slurmctld: debug3: Tree sending to qdr3
slurmctld: debug: _slurm_recv_timeout at 0 of 4, timeout
slurmctld: error:* slurm_receive_msgs: Socket timed out on send/recv
operation*
slurmctld: debug: _slurm_recv_timeout at 0 of 4, timeout
slurmctld: error: *slurm_receive_msgs: Socket timed out on send/recv
operation*
slurmctld: debug3: problems with qdr3
slurmctld: debug3: problems with qdr4
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: Tree head got back 2
slurmctld: error:* checkpoint/blcr: error on checkpoint request 3 to 81.0:
Communication connection failure*
slurmctld: debug: *checkpoint/blcr: file /usr/local/sbin/scch not found*
What could be the reason for the failing of the checkpointing commands ?
Also, is the missing /usr/local/sbin/scch an integral part of the problem ?