Discussion:
Nodes in a perpetual "drain" state.
Arjun J Rao
2014-06-27 07:25:33 UTC
Permalink
Have SLURM set up on a cluster of 2 nodes qdr[3-4]
Running sinfo shows the two nodes to be in a perpetual drain state.

sinfo -R yields the following :
REASON USER TIMESTAMP NODELIST
Epilog error root 2014-02-03 T15:53:40
qdr3
Epilog error root 2014-02-03 T15:52:42
qdr4

The epilog error occured on 3rd February! (More than 4 months ago)

Why is this happening ?
Paddy Doyle
2014-06-27 09:04:36 UTC
Permalink
Hi Arjun,
Post by Arjun J Rao
Have SLURM set up on a cluster of 2 nodes qdr[3-4]
Running sinfo shows the two nodes to be in a perpetual drain state.
REASON USER TIMESTAMP NODELIST
Epilog error root 2014-02-03 T15:53:40
qdr3
Epilog error root 2014-02-03 T15:52:42
qdr4
The epilog error occured on 3rd February! (More than 4 months ago)
Why is this happening ?
Maybe an obvious question, but have you set the nodes to be 'resume' or 'idle'
using scontrol since then? In our setup at least, once a node is marked 'down',
we have to manually clear it to either 'resume' or 'idle'.

Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
Arjun J Rao
2014-06-27 13:25:34 UTC
Permalink
I didn't mark the node as "drained"
But after issuing the command scontrol update NodeName="qdr3" State="IDLE";
sinfo showed both nodes to be idle and usable.
I was also able to execute MPI jobs.

Thanks.
Post by Paddy Doyle
Hi Arjun,
Post by Arjun J Rao
Have SLURM set up on a cluster of 2 nodes qdr[3-4]
Running sinfo shows the two nodes to be in a perpetual drain state.
REASON USER TIMESTAMP NODELIST
Epilog error root 2014-02-03 T15:53:40
qdr3
Epilog error root 2014-02-03 T15:52:42
qdr4
The epilog error occured on 3rd February! (More than 4 months ago)
Why is this happening ?
Maybe an obvious question, but have you set the nodes to be 'resume' or 'idle'
using scontrol since then? In our setup at least, once a node is marked 'down',
we have to manually clear it to either 'resume' or 'idle'.
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
<http://t.signauxun.com/link?url=http%3A%2F%2Fwww.tchpc.tcd.ie%2F&ukey=agxzfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgL27w5MKDA&k=1f7bb1d4-b936-4bb0-a3ec-61a63d3e760a>
Loading...