Paul Edmon
2014-04-21 20:00:32 UTC
Occassionally when we reset the master some of our nodes go into an
unknown state or take a bit to get back in contact with the master. If
srun is being launched on the nodes at that time it tends to make it
hang which causes the mpirun dependent on the srun being launched to
fail. Even stranger the sbatch that originally launched the srun keeps
running and not failing out right.
Is there a way to prevent srun from failing but rather just have it wait
until the master comes back? Or is the timeout the only way to set
this? Or if this isn't possible can we have the parent sbatch die with
an error rather than have srun just hang up?
Thanks for any insight.
-Paul Edmon-
unknown state or take a bit to get back in contact with the master. If
srun is being launched on the nodes at that time it tends to make it
hang which causes the mpirun dependent on the srun being launched to
fail. Even stranger the sbatch that originally launched the srun keeps
running and not failing out right.
Is there a way to prevent srun from failing but rather just have it wait
until the master comes back? Or is the timeout the only way to set
this? Or if this isn't possible can we have the parent sbatch die with
an error rather than have srun just hang up?
Thanks for any insight.
-Paul Edmon-