Arjun J Rao
2014-06-23 13:21:09 UTC
Have installed SLURM and BLCR with MVAPICH2 on 2 nodes (named qdr3 and
qdr4)
When I run the command
srun -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/ACIM/Ctrl ./MPIJob
The Code runs fine, but after some time I get the message on the slurmctld
daemon:
checkpoint/blcr : Sending checkpoint tasks request 3 to 76.0 (job id.step
id)
...
checkpoint/blcr : error on checkpoint request 3 to 76.0 : Communication
connection failure
I have attached the full output of slurmctld and slurmd. Why is the
checkpoint order failing ? munge is running fine on both machines.
qdr4)
When I run the command
srun -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/ACIM/Ctrl ./MPIJob
The Code runs fine, but after some time I get the message on the slurmctld
daemon:
checkpoint/blcr : Sending checkpoint tasks request 3 to 76.0 (job id.step
id)
...
checkpoint/blcr : error on checkpoint request 3 to 76.0 : Communication
connection failure
I have attached the full output of slurmctld and slurmd. Why is the
checkpoint order failing ? munge is running fine on both machines.