slurm reconnect on unscheduled reboot

Discussion:

Brian B

2014-09-09 13:11:37 UTC

Greetings,

Is it possible to have slurm compute nodes added back into a partition after they perform an unscheduled reboot? That is we have some machines that fail on brown outs (we are working on solving that problem) and will reboot when this occurs. They come back fine but slurm doesn’t add them back into he partition. I am able to do so using control by updating their state to IDLE. Is this able to be automated?

Regards,
Brian

Brian Schwark

2014-09-09 13:37:31 UTC

Permalink

This is controlled by the slurm.conf directive 'ReturnToService'.

From https://computing.llnl.gov/linux/slurm/slurm.conf.html

*ReturnToService* Controls when a DOWN node will be returned to service.
The default value is 0. Supported values include
*0* A node will remain in the DOWN state until a system administrator
explicitly changes its state (even if the slurmd daemon registers and
resumes communications). *1* A DOWN node will become available for use upon
registration with a valid configuration only if it was set DOWN due to
being non-responsive. If the node was set DOWN for any other reason (low
memory, prolog failure, epilog failure, unexpected reboot, etc.), its state
will not automatically be changed. *2* A DOWN node will become available
for use upon registration with a valid configuration. The node could have
been set DOWN for any reason. (Disabled on Cray systems.)

Brian

Post by Brian B
Greetings,
Is it possible to have slurm compute nodes added back into a partition
after they perform an unscheduled reboot? That is we have some machines
that fail on brown outs (we are working on solving that problem) and will
reboot when this occurs. They come back fine but slurm doesn’t add them
back into he partition. I am able to do so using control by updating their
state to IDLE. Is this able to be automated?
Regards,
Brian

j***@public.gmane.org

2014-09-09 15:01:54 UTC

Permalink

Note the documentation at LLNL is all three years old. The current
documentation is at
http://slurm.schedmd.com/slurm.conf.html

Post by Brian Schwark
This is controlled by the slurm.conf directive 'ReturnToService'.
From https://computing.llnl.gov/linux/slurm/slurm.conf.html
*ReturnToService* Controls when a DOWN node will be returned to service.
The default value is 0. Supported values include
*0* A node will remain in the DOWN state until a system administrator
explicitly changes its state (even if the slurmd daemon registers and
resumes communications). *1* A DOWN node will become available for use upon
registration with a valid configuration only if it was set DOWN due to
being non-responsive. If the node was set DOWN for any other reason (low
memory, prolog failure, epilog failure, unexpected reboot, etc.), its state
will not automatically be changed. *2* A DOWN node will become available
for use upon registration with a valid configuration. The node could have
been set DOWN for any reason. (Disabled on Cray systems.)
Brian

--
Morris "Moe" Jette
CTO, SchedMD LLC

Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/slurm_ug_agenda.html