srun and node unknown state

Discussion:

Paul Edmon

2014-04-21 20:00:32 UTC

Occassionally when we reset the master some of our nodes go into an
unknown state or take a bit to get back in contact with the master. If
srun is being launched on the nodes at that time it tends to make it
hang which causes the mpirun dependent on the srun being launched to
fail. Even stranger the sbatch that originally launched the srun keeps
running and not failing out right.

Is there a way to prevent srun from failing but rather just have it wait
until the master comes back? Or is the timeout the only way to set
this? Or if this isn't possible can we have the parent sbatch die with
an error rather than have srun just hang up?

Thanks for any insight.

-Paul Edmon-

Paul Edmon

2014-04-21 20:16:30 UTC

Permalink

For a relevant error:

Apr 20 17:45:19 itc041 slurmd[50099]: launch task 9285091.51 request
from 56441.33234-Fg5fAaHyjJnrJ5Amzi+***@public.gmane.org (port 704)
Apr 20 17:45:19 itc041 slurmstepd[9850]: switch NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherProfile NONE plugin
loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherEnergy NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherInfiniband NONE
plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherFilesystem NONE
plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: Job accounting gather LINUX
plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: task NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: Checkpoint plugin loaded:
checkpoint/none
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: debug level = 2
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: task 0 (9855) started
2014-04-20T17:45:19
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: auth plugin for Munge
(http://code.google.com/p/munge/) loaded
Apr 20 17:51:54 itc041 slurmd[itc041][9850]: task 0 (9855) exited with
exit code 0.
Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: slurm_receive_msg:
Socket timed out on send/recv operation
Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: Rank 0 failed
sending step completion message directly to slurmctld (0.0.0.0:0), retrying
Apr 20 17:53:59 itc041 slurmd[itc041][9850]: error: Abandoning IO 60
secs after job shutdown initiated
Apr 20 17:54:35 itc041 slurmd[itc041][9850]: Rank 0 sent step completion
message directly to slurmctld (0.0.0.0:0)
Apr 20 17:54:35 itc041 slurmd[itc041][9850]: done with job

Is there anyway to prevent this? When this fails it creates a Zombie
task that holds the job still open. I think part of the reason why is
that the user is looping over mpirun's like this:

do i=1,1000
mpirun -np 64 ./executable
enddo

Each run lasts about 5 minutes. If one of the mpirun's fails to launch
the entire thing hangs. It would be better if srun kept trying instead
of just failing.

-Paul Edmon-

Post by Paul Edmon
Occassionally when we reset the master some of our nodes go into an
unknown state or take a bit to get back in contact with the master.
If srun is being launched on the nodes at that time it tends to make
it hang which causes the mpirun dependent on the srun being launched
to fail. Even stranger the sbatch that originally launched the srun
keeps running and not failing out right.
Is there a way to prevent srun from failing but rather just have it
wait until the master comes back? Or is the timeout the only way to
set this? Or if this isn't possible can we have the parent sbatch die
with an error rather than have srun just hang up?
Thanks for any insight.
-Paul Edmon-

Loris Bennett

2014-06-26 09:13:34 UTC

Permalink

Hi Paul,

Apr 20 17:45:19 itc041 slurmd[50099]: launch task 9285091.51 request from
Apr 20 17:45:19 itc041 slurmstepd[9850]: switch NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherProfile NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherEnergy NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherInfiniband NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherFilesystem NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: Job accounting gather LINUX plugin
loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: task NONE plugin loaded
checkpoint/none
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: debug level = 2
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: task 0 (9855) started
2014-04-20T17:45:19
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: auth plugin for Munge
(http://code.google.com/p/munge/) loaded
Apr 20 17:51:54 itc041 slurmd[itc041][9850]: task 0 (9855) exited with exit code
0.
Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: slurm_receive_msg: Socket
timed out on send/recv operation
Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: Rank 0 failed sending step
completion message directly to slurmctld (0.0.0.0:0), retrying
Apr 20 17:53:59 itc041 slurmd[itc041][9850]: error: Abandoning IO 60 secs after
job shutdown initiated
Apr 20 17:54:35 itc041 slurmd[itc041][9850]: Rank 0 sent step completion message
directly to slurmctld (0.0.0.0:0)
Apr 20 17:54:35 itc041 slurmd[itc041][9850]: done with job
Is there anyway to prevent this? When this fails it creates a Zombie task that
holds the job still open. I think part of the reason why is that the user is
do i=1,1000
mpirun -np 64 ./executable
enddo
Each run lasts about 5 minutes. If one of the mpirun's fails to launch the
entire thing hangs. It would be better if srun kept trying instead of just
failing.
-Paul Edmon-

Occassionally when we reset the master some of our nodes go into an unknown
state or take a bit to get back in contact with the master. If srun is being
launched on the nodes at that time it tends to make it hang which causes the
mpirun dependent on the srun being launched to fail. Even stranger the sbatch
that originally launched the srun keeps running and not failing out right.
Is there a way to prevent srun from failing but rather just have it wait until
the master comes back? Or is the timeout the only way to set this? Or if
this isn't possible can we have the parent sbatch die with an error rather
than have srun just hang up?
Thanks for any insight.
-Paul Edmon-

Did you ever get to the bottom of this? We are seeing something similar
with Slurm 2.4.5 and a user running a script which generates batch
scripts and submits them within a loop.

Cheers,

Loris

--
This signature is currently under construction.