Uwe Sauter
2014-08-28 11:18:34 UTC
Hi all,
(configuration and scripts below text)
I have configured SLURM to power down idle nodes but it probably is
misconfigured. I aim for a configuration where after a certain period
(say 10min) idle nodes are powered down.
As you can see from the configuration below I have SLURM call either
"node_poweroff.slurm" or "node_poweron.slurm" which are wrapper scripts
that handle the conversion of SLURM's nodelist syntax and call
"node_poweroff" or "node_poweron" for each node.
"node_power{off,on}" log their actions into /var/log/slurm/powermgmt.log
so I can follow and in the future analyze which nodes were turned off
and on.
The current situation is that although I see 36 out of 54 nodes in a
IDLE+POWER state all nodes are powered on and accessible via SSH.
Output from "grep -i power /var/log/slurm/slurmctld.log | tail"
[2014-08-28T12:01:24.975] Power save mode: 30 nodes
[2014-08-28T12:11:44.080] Power save mode: 30 nodes
[2014-08-28T12:22:44.194] Power save mode: 30 nodes
[2014-08-28T12:33:44.306] Power save mode: 30 nodes
[2014-08-28T12:44:01.425] Power save mode: 30 nodes
[2014-08-28T12:51:44.514] power_save: suspending nodes
n[510301,510601,511901]
[2014-08-28T12:54:26.547] Power save mode: 33 nodes
[2014-08-28T12:54:26.547] power_save: suspending nodes n[511101,512501]
[2014-08-28T12:57:08.581] power_save: suspending nodes n510901
[2014-08-28T13:05:10.666] Power save mode: 36 nodes
Output from "tail /var/log/slurm/powermgmt.log"
2014-08-27 16:39:36 power on n512501
2014-08-27 16:51:17 power on n512601
2014-08-27 17:59:38 power on n512601
2014-08-28 09:05:54 power on n511101
2014-08-28 09:06:05 power on n511201
2014-08-28 09:06:11 power on n512001
2014-08-28 09:06:19 power on n512201
2014-08-28 10:41:51 power on n510501
2014-08-28 10:41:51 power on n510701
2014-08-28 11:31:41 power on n511101
grep does not find "down" in /var/log/slurm/powermgmt.log which it
should if "node_poweroff" has been executed.
My impression is that something (misconfiguration? bad sudo
configuration? other right stuff?) doesn't allow SLURM to execute one of
the mentioned scripts.
Can someone check my configuration and give some advice on how to debug
this issue further?
Thank you,
Uwe
### slurm.conf excerpt ###
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
SuspendTime=600
SuspendRate=30
ResumeRate=10
SuspendProgram=/opt/system/slurm/etc/node_poweroff.slurm
ResumeProgram=/opt/system/slurm/etc/node_poweron.slurm
SuspendTimeout=120
ResumeTimeout=300
#SuspendExcNodes=n51[03,04,29,30][01],n52[04,05][01]
#SuspendExcParts=
BatchStartTimeout=60
##########################
### /opt/system/slurm/etc/node_poweroff.slurm ###
#!/bin/bash
set -o nounset
NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)
for NODE in ${NODES}; do
sudo /opt/system/slurm/etc/node_poweroff ${NODE}
done
exit 0
#################################################
### /opt/system/slurm/etc/node_poweron.slurm ###
#!/bin/bash
set -o nounset
NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)
for NODE in ${NODES}; do
/opt/system/slurm/etc/node_poweron ${NODE}
done
#################################################
### /opt/system/slurm/etc/node_poweroff ###
#!/bin/bash
set -o nounset
NODE=$1
echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log
ssh ${NODE} "/etc/init.d/lustre_client stop"
ssh ${NODE} "umount /localscratch /nfs/*"
ssh ${NODE} "service slurm stop"
ssh ${NODE} "service munge stop"
ssh ${NODE} "poweroff"
sleep 10
ping -c1 ${NODE} >/dev/null 2>&1
[ $? -eq 0 ] && /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H
${NODE}-bmc power off
exit 0
#############################################
### /opt/system/slurm/etc/node_poweron ###
#!/bin/bash
set -o nounset
NODE=${1}
echo "$(date +'%F %T') power on ${NODE}" >> /var/log/slurm/powermgmt.log
/usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H ${NODE}-bmc power on
exit 0
##########################################
### /etc/sudoers excerpt ###
slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweron
slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweroff
############################
(configuration and scripts below text)
I have configured SLURM to power down idle nodes but it probably is
misconfigured. I aim for a configuration where after a certain period
(say 10min) idle nodes are powered down.
As you can see from the configuration below I have SLURM call either
"node_poweroff.slurm" or "node_poweron.slurm" which are wrapper scripts
that handle the conversion of SLURM's nodelist syntax and call
"node_poweroff" or "node_poweron" for each node.
"node_power{off,on}" log their actions into /var/log/slurm/powermgmt.log
so I can follow and in the future analyze which nodes were turned off
and on.
The current situation is that although I see 36 out of 54 nodes in a
IDLE+POWER state all nodes are powered on and accessible via SSH.
Output from "grep -i power /var/log/slurm/slurmctld.log | tail"
[2014-08-28T12:01:24.975] Power save mode: 30 nodes
[2014-08-28T12:11:44.080] Power save mode: 30 nodes
[2014-08-28T12:22:44.194] Power save mode: 30 nodes
[2014-08-28T12:33:44.306] Power save mode: 30 nodes
[2014-08-28T12:44:01.425] Power save mode: 30 nodes
[2014-08-28T12:51:44.514] power_save: suspending nodes
n[510301,510601,511901]
[2014-08-28T12:54:26.547] Power save mode: 33 nodes
[2014-08-28T12:54:26.547] power_save: suspending nodes n[511101,512501]
[2014-08-28T12:57:08.581] power_save: suspending nodes n510901
[2014-08-28T13:05:10.666] Power save mode: 36 nodes
Output from "tail /var/log/slurm/powermgmt.log"
2014-08-27 16:39:36 power on n512501
2014-08-27 16:51:17 power on n512601
2014-08-27 17:59:38 power on n512601
2014-08-28 09:05:54 power on n511101
2014-08-28 09:06:05 power on n511201
2014-08-28 09:06:11 power on n512001
2014-08-28 09:06:19 power on n512201
2014-08-28 10:41:51 power on n510501
2014-08-28 10:41:51 power on n510701
2014-08-28 11:31:41 power on n511101
grep does not find "down" in /var/log/slurm/powermgmt.log which it
should if "node_poweroff" has been executed.
My impression is that something (misconfiguration? bad sudo
configuration? other right stuff?) doesn't allow SLURM to execute one of
the mentioned scripts.
Can someone check my configuration and give some advice on how to debug
this issue further?
Thank you,
Uwe
### slurm.conf excerpt ###
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
SuspendTime=600
SuspendRate=30
ResumeRate=10
SuspendProgram=/opt/system/slurm/etc/node_poweroff.slurm
ResumeProgram=/opt/system/slurm/etc/node_poweron.slurm
SuspendTimeout=120
ResumeTimeout=300
#SuspendExcNodes=n51[03,04,29,30][01],n52[04,05][01]
#SuspendExcParts=
BatchStartTimeout=60
##########################
### /opt/system/slurm/etc/node_poweroff.slurm ###
#!/bin/bash
set -o nounset
NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)
for NODE in ${NODES}; do
sudo /opt/system/slurm/etc/node_poweroff ${NODE}
done
exit 0
#################################################
### /opt/system/slurm/etc/node_poweron.slurm ###
#!/bin/bash
set -o nounset
NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)
for NODE in ${NODES}; do
/opt/system/slurm/etc/node_poweron ${NODE}
done
#################################################
### /opt/system/slurm/etc/node_poweroff ###
#!/bin/bash
set -o nounset
NODE=$1
echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log
ssh ${NODE} "/etc/init.d/lustre_client stop"
ssh ${NODE} "umount /localscratch /nfs/*"
ssh ${NODE} "service slurm stop"
ssh ${NODE} "service munge stop"
ssh ${NODE} "poweroff"
sleep 10
ping -c1 ${NODE} >/dev/null 2>&1
[ $? -eq 0 ] && /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H
${NODE}-bmc power off
exit 0
#############################################
### /opt/system/slurm/etc/node_poweron ###
#!/bin/bash
set -o nounset
NODE=${1}
echo "$(date +'%F %T') power on ${NODE}" >> /var/log/slurm/powermgmt.log
/usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H ${NODE}-bmc power on
exit 0
##########################################
### /etc/sudoers excerpt ###
slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweron
slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweroff
############################