Kevin Abbey
2014-10-13 23:48:44 UTC
Hi,
I upgraded from 2.5.3 to 14.03.8 a few days ago. We have nodes shared
for multiple jobs. Suddenly, we realized that once one of the jobs on a
node ended all other jobs were being killed. I'm not sure if I broke
the system during the upgrade. I did not examine the database. I
checked the epilog since that is a likely source. I noticed that the
variable: $SLURM_UID, which was in the original script I had been using
from 2.5.3 is still included with the new version. However, the
variable is no longer present when I do salloc, then env | grep SLURM.
I pasted the output below.
The epilog file provided is the same as with 2.5.3 but since the
$SLURM_UID is not present I modified the variable to use $SLURM_JOB_UID.
It appears to be working as expected now. If anyone is setting the path
to the slurm bin in their environment be sure to add the / before squeue
for the job list line.
I have made the changes in bold below. Can anyone confirm this problem
and solution? If there is something wrong in my database then I'd like
to find it.
Thank you,
Kevin
--- slurm.epilog.clean.default 2014-10-13 18:39:29.146111716 -0400
+++ slurm.epilog.clean 2014-10-13 18:51:51.192922992 -0400
@@ -8,7 +8,7 @@
# SLURM_BIN can be used for testing with private version of SLURM
#SLURM_BIN="/usr/bin/"
#
-if [ x$SLURM_UID = "x" ] ; then
+if [ x*$SLURM_JOB_UID* = "x" ] ; then
exit 0
fi
if [ x$SLURM_JOB_ID = "x" ] ; then
@@ -18,20 +18,19 @@
#
# Don't try to kill user root or system daemon jobs
#
-if [ $SLURM_UID -lt 100 ] ; then
+if [ *$SLURM_JOB_UID* -lt 100 ] ; then
exit 0
fi
-job_list=`${SLURM_BIN}squeue --noheader --format=%A --user=$SLURM_UID
--node=localhost`
+job_list=`*${SLURM_BIN}/squeue* --noheader --format=%i
--user=*$SLURM_JOB_UID* --node=localhost`
for job_id in $job_list
do
if [ $job_id -ne $SLURM_JOB_ID ] ; then
exit 0
fi
done
-
#
# No other SLURM jobs, purge all remaining processes of this user
#
-pkill -KILL -U $SLURM_UID
+pkill -KILL -U *$SLURM_JOB_UID*
exit 0
==============================================================
[***@kestrel slurm]$ salloc --partition=batch --nodelist=node12
--mem=30G
salloc: Granted job allocation 16010
[***@node12 slurm]$ env | grep SLURM
SLURM_CHECKPOINT_IMAGE_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_NODELIST=node12
SLURMD_NODENAME=node12
SLURM_TOPOLOGY_ADDR=node12
SLURM_PRIO_PROCESS=0
SLURM_SRUN_COMM_PORT=48272
SLURM_PTY_WIN_ROW=40
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_NNODES=1
SLURM_STEP_NUM_NODES=1
SLURM_JOBID=16010
SLURM_NTASKS=1
SLURM_LAUNCH_NODE_IPADDR=192.168.0.169
SLURM_STEP_ID=0
SLURM_STEP_LAUNCHER_PORT=48272
SLURM_TASKS_PER_NODE=1
*SLURM_JOB_ID=16010**
*SLURM_JOB_USER=kabbey
SLURM_STEPID=0
SLURM_SRUN_COMM_HOST=192.168.0.169
SLURM_PTY_WIN_COL=209
*SLURM_JOB_UID=12901*
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_TASK_PID=85622
SLURM_NPROCS=1
SLURM=/g0/opt/slurm/14.03.8
SLURM_CPUS_ON_NODE=2
SLURM_DISTRIBUTION=cyclic
SLURM_PROCID=0
SLURM_JOB_NODELIST=node12
SLURM_PTY_PORT=52017
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=2
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=kestrel.ccib.rutgers.edu
SLURM_JOB_PARTITION=batch
SLURM_STEP_NUM_TASKS=1
SLURM_JOB_NUM_NODES=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_NODELIST=node12
SLURM_MEM_PER_NODE=30720
[***@node12 slurm]$
--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/
Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.abbey-***@public.gmane.org
I upgraded from 2.5.3 to 14.03.8 a few days ago. We have nodes shared
for multiple jobs. Suddenly, we realized that once one of the jobs on a
node ended all other jobs were being killed. I'm not sure if I broke
the system during the upgrade. I did not examine the database. I
checked the epilog since that is a likely source. I noticed that the
variable: $SLURM_UID, which was in the original script I had been using
from 2.5.3 is still included with the new version. However, the
variable is no longer present when I do salloc, then env | grep SLURM.
I pasted the output below.
The epilog file provided is the same as with 2.5.3 but since the
$SLURM_UID is not present I modified the variable to use $SLURM_JOB_UID.
It appears to be working as expected now. If anyone is setting the path
to the slurm bin in their environment be sure to add the / before squeue
for the job list line.
I have made the changes in bold below. Can anyone confirm this problem
and solution? If there is something wrong in my database then I'd like
to find it.
Thank you,
Kevin
--- slurm.epilog.clean.default 2014-10-13 18:39:29.146111716 -0400
+++ slurm.epilog.clean 2014-10-13 18:51:51.192922992 -0400
@@ -8,7 +8,7 @@
# SLURM_BIN can be used for testing with private version of SLURM
#SLURM_BIN="/usr/bin/"
#
-if [ x$SLURM_UID = "x" ] ; then
+if [ x*$SLURM_JOB_UID* = "x" ] ; then
exit 0
fi
if [ x$SLURM_JOB_ID = "x" ] ; then
@@ -18,20 +18,19 @@
#
# Don't try to kill user root or system daemon jobs
#
-if [ $SLURM_UID -lt 100 ] ; then
+if [ *$SLURM_JOB_UID* -lt 100 ] ; then
exit 0
fi
-job_list=`${SLURM_BIN}squeue --noheader --format=%A --user=$SLURM_UID
--node=localhost`
+job_list=`*${SLURM_BIN}/squeue* --noheader --format=%i
--user=*$SLURM_JOB_UID* --node=localhost`
for job_id in $job_list
do
if [ $job_id -ne $SLURM_JOB_ID ] ; then
exit 0
fi
done
-
#
# No other SLURM jobs, purge all remaining processes of this user
#
-pkill -KILL -U $SLURM_UID
+pkill -KILL -U *$SLURM_JOB_UID*
exit 0
==============================================================
[***@kestrel slurm]$ salloc --partition=batch --nodelist=node12
--mem=30G
salloc: Granted job allocation 16010
[***@node12 slurm]$ env | grep SLURM
SLURM_CHECKPOINT_IMAGE_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_NODELIST=node12
SLURMD_NODENAME=node12
SLURM_TOPOLOGY_ADDR=node12
SLURM_PRIO_PROCESS=0
SLURM_SRUN_COMM_PORT=48272
SLURM_PTY_WIN_ROW=40
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_NNODES=1
SLURM_STEP_NUM_NODES=1
SLURM_JOBID=16010
SLURM_NTASKS=1
SLURM_LAUNCH_NODE_IPADDR=192.168.0.169
SLURM_STEP_ID=0
SLURM_STEP_LAUNCHER_PORT=48272
SLURM_TASKS_PER_NODE=1
*SLURM_JOB_ID=16010**
*SLURM_JOB_USER=kabbey
SLURM_STEPID=0
SLURM_SRUN_COMM_HOST=192.168.0.169
SLURM_PTY_WIN_COL=209
*SLURM_JOB_UID=12901*
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_TASK_PID=85622
SLURM_NPROCS=1
SLURM=/g0/opt/slurm/14.03.8
SLURM_CPUS_ON_NODE=2
SLURM_DISTRIBUTION=cyclic
SLURM_PROCID=0
SLURM_JOB_NODELIST=node12
SLURM_PTY_PORT=52017
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=2
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=kestrel.ccib.rutgers.edu
SLURM_JOB_PARTITION=batch
SLURM_STEP_NUM_TASKS=1
SLURM_JOB_NUM_NODES=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_NODELIST=node12
SLURM_MEM_PER_NODE=30720
[***@node12 slurm]$
--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/
Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.abbey-***@public.gmane.org