Discussion:
upgrade 2.5.3 to 14.03.8 documentation
Kevin Abbey
2014-10-13 23:48:44 UTC
Permalink
Hi,

I upgraded from 2.5.3 to 14.03.8 a few days ago. We have nodes shared
for multiple jobs. Suddenly, we realized that once one of the jobs on a
node ended all other jobs were being killed. I'm not sure if I broke
the system during the upgrade. I did not examine the database. I
checked the epilog since that is a likely source. I noticed that the
variable: $SLURM_UID, which was in the original script I had been using
from 2.5.3 is still included with the new version. However, the
variable is no longer present when I do salloc, then env | grep SLURM.
I pasted the output below.

The epilog file provided is the same as with 2.5.3 but since the
$SLURM_UID is not present I modified the variable to use $SLURM_JOB_UID.
It appears to be working as expected now. If anyone is setting the path
to the slurm bin in their environment be sure to add the / before squeue
for the job list line.

I have made the changes in bold below. Can anyone confirm this problem
and solution? If there is something wrong in my database then I'd like
to find it.

Thank you,
Kevin




--- slurm.epilog.clean.default 2014-10-13 18:39:29.146111716 -0400
+++ slurm.epilog.clean 2014-10-13 18:51:51.192922992 -0400
@@ -8,7 +8,7 @@
# SLURM_BIN can be used for testing with private version of SLURM
#SLURM_BIN="/usr/bin/"
#
-if [ x$SLURM_UID = "x" ] ; then
+if [ x*$SLURM_JOB_UID* = "x" ] ; then
exit 0
fi
if [ x$SLURM_JOB_ID = "x" ] ; then
@@ -18,20 +18,19 @@
#
# Don't try to kill user root or system daemon jobs
#
-if [ $SLURM_UID -lt 100 ] ; then
+if [ *$SLURM_JOB_UID* -lt 100 ] ; then
exit 0
fi

-job_list=`${SLURM_BIN}squeue --noheader --format=%A --user=$SLURM_UID
--node=localhost`
+job_list=`*${SLURM_BIN}/squeue* --noheader --format=%i
--user=*$SLURM_JOB_UID* --node=localhost`
for job_id in $job_list
do
if [ $job_id -ne $SLURM_JOB_ID ] ; then
exit 0
fi
done
-
#
# No other SLURM jobs, purge all remaining processes of this user
#
-pkill -KILL -U $SLURM_UID
+pkill -KILL -U *$SLURM_JOB_UID*
exit 0






==============================================================

[***@kestrel slurm]$ salloc --partition=batch --nodelist=node12
--mem=30G
salloc: Granted job allocation 16010




[***@node12 slurm]$ env | grep SLURM
SLURM_CHECKPOINT_IMAGE_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_NODELIST=node12
SLURMD_NODENAME=node12
SLURM_TOPOLOGY_ADDR=node12
SLURM_PRIO_PROCESS=0
SLURM_SRUN_COMM_PORT=48272
SLURM_PTY_WIN_ROW=40
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_NNODES=1
SLURM_STEP_NUM_NODES=1
SLURM_JOBID=16010
SLURM_NTASKS=1
SLURM_LAUNCH_NODE_IPADDR=192.168.0.169
SLURM_STEP_ID=0
SLURM_STEP_LAUNCHER_PORT=48272
SLURM_TASKS_PER_NODE=1
*SLURM_JOB_ID=16010**
*SLURM_JOB_USER=kabbey
SLURM_STEPID=0
SLURM_SRUN_COMM_HOST=192.168.0.169
SLURM_PTY_WIN_COL=209
*SLURM_JOB_UID=12901*
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_TASK_PID=85622
SLURM_NPROCS=1
SLURM=/g0/opt/slurm/14.03.8
SLURM_CPUS_ON_NODE=2
SLURM_DISTRIBUTION=cyclic
SLURM_PROCID=0
SLURM_JOB_NODELIST=node12
SLURM_PTY_PORT=52017
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=2
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=kestrel.ccib.rutgers.edu
SLURM_JOB_PARTITION=batch
SLURM_STEP_NUM_TASKS=1
SLURM_JOB_NUM_NODES=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_NODELIST=node12
SLURM_MEM_PER_NODE=30720
[***@node12 slurm]$
--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/

Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.abbey-***@public.gmane.org
j***@public.gmane.org
2014-10-15 18:19:44 UTC
Permalink
The original epilog works for me and both env vars should be set. See
src/slurmd/slurmd/req.c:
setenvf(&env, "SLURM_JOB_UID", "%u", job_env->uid);
setenvf(&env, "SLURM_UID", "%u", job_env->uid);
Post by Kevin Abbey
Hi,
I upgraded from 2.5.3 to 14.03.8 a few days ago. We have nodes
shared for multiple jobs. Suddenly, we realized that once one of
the jobs on a node ended all other jobs were being killed. I'm not
sure if I broke the system during the upgrade. I did not examine
the database. I checked the epilog since that is a likely source. I
noticed that the variable: $SLURM_UID, which was in the original
script I had been using from 2.5.3 is still included with the new
version. However, the variable is no longer present when I do
salloc, then env | grep SLURM. I pasted the output below.
The epilog file provided is the same as with 2.5.3 but since the
$SLURM_UID is not present I modified the variable to use
$SLURM_JOB_UID.
It appears to be working as expected now. If anyone is setting the
path to the slurm bin in their environment be sure to add the /
before squeue for the job list line.
I have made the changes in bold below. Can anyone confirm this
problem and solution? If there is something wrong in my database
then I'd like to find it.
Thank you,
Kevin
--- slurm.epilog.clean.default 2014-10-13 18:39:29.146111716 -0400
+++ slurm.epilog.clean 2014-10-13 18:51:51.192922992 -0400
@@ -8,7 +8,7 @@
# SLURM_BIN can be used for testing with private version of SLURM
#SLURM_BIN="/usr/bin/"
#
-if [ x$SLURM_UID = "x" ] ; then
+if [ x*$SLURM_JOB_UID* = "x" ] ; then
exit 0
fi
if [ x$SLURM_JOB_ID = "x" ] ; then
@@ -18,20 +18,19 @@
#
# Don't try to kill user root or system daemon jobs
#
-if [ $SLURM_UID -lt 100 ] ; then
+if [ *$SLURM_JOB_UID* -lt 100 ] ; then
exit 0
fi
-job_list=`${SLURM_BIN}squeue --noheader --format=%A
--user=$SLURM_UID --node=localhost`
+job_list=`*${SLURM_BIN}/squeue* --noheader --format=%i
--user=*$SLURM_JOB_UID* --node=localhost`
for job_id in $job_list
do
if [ $job_id -ne $SLURM_JOB_ID ] ; then
exit 0
fi
done
-
#
# No other SLURM jobs, purge all remaining processes of this user
#
-pkill -KILL -U $SLURM_UID
+pkill -KILL -U *$SLURM_JOB_UID*
exit 0
==============================================================
salloc: Granted job allocation 16010
SLURM_CHECKPOINT_IMAGE_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_NODELIST=node12
SLURMD_NODENAME=node12
SLURM_TOPOLOGY_ADDR=node12
SLURM_PRIO_PROCESS=0
SLURM_SRUN_COMM_PORT=48272
SLURM_PTY_WIN_ROW=40
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_NNODES=1
SLURM_STEP_NUM_NODES=1
SLURM_JOBID=16010
SLURM_NTASKS=1
SLURM_LAUNCH_NODE_IPADDR=192.168.0.169
SLURM_STEP_ID=0
SLURM_STEP_LAUNCHER_PORT=48272
SLURM_TASKS_PER_NODE=1
*SLURM_JOB_ID=16010**
*SLURM_JOB_USER=kabbey
SLURM_STEPID=0
SLURM_SRUN_COMM_HOST=192.168.0.169
SLURM_PTY_WIN_COL=209
*SLURM_JOB_UID=12901*
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_TASK_PID=85622
SLURM_NPROCS=1
SLURM=/g0/opt/slurm/14.03.8
SLURM_CPUS_ON_NODE=2
SLURM_DISTRIBUTION=cyclic
SLURM_PROCID=0
SLURM_JOB_NODELIST=node12
SLURM_PTY_PORT=52017
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=2
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=kestrel.ccib.rutgers.edu
SLURM_JOB_PARTITION=batch
SLURM_STEP_NUM_TASKS=1
SLURM_JOB_NUM_NODES=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_NODELIST=node12
SLURM_MEM_PER_NODE=30720
--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/
Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
--
Morris "Moe" Jette
CTO, SchedMD LLC
Loading...