Discussion:
Associations and DefaultQOS
John Desantis
2014-10-15 19:04:35 UTC
Permalink
Hello all,

I am not sure if I've stumbled upon a bug (14.03.6) or if this is the
intended behavior of an account tied into an association with a
DefQOS.

Basically, I've configured our cluster to use associations with a
DefQOS other than "normal" for specific accounts (Priority,
preemption, etc.). When I add a new user via sacctmgr add user... or
via sacctmgr load file=... and that new user submits a job, the
default qos I've configured is not respected. If I restart
slurmd/slurmctld (service slurm restart) across the cluster and submit
a job again, the default qos is respected.

Here is the configuration I've set:

sacctmgr show association where account=faculty
Cluster Account User Partition Share GrpJobs GrpNodes
GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm faculty 1
512 faculty
elevated,faculty,no+
c_slurm faculty davidrogers parent
512 faculty
elevated,faculty,no+
c_slurm faculty hlw parent
512 faculty
elevated,faculty,no+
c_slurm faculty sathan parent
512 faculty
elevated,faculty,no+

Here is the user I've added:

sacctmgr add user pshudson defaultaccount=faculty fairshare=parent
Adding User(s)
pshudson
Settings =
Default Account = faculty
Associations =
U = pshudson A = faculty C = c_slur
Non Default Settings
Fairshare = 2147483647
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

Double-check the new entry:

sacctmgr show association where account=faculty
Cluster Account User Partition Share GrpJobs GrpNodes
GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm faculty 1
512 faculty
elevated,faculty,no+
c_slurm faculty davidrogers parent
512 faculty
elevated,faculty,no+
c_slurm faculty hlw parent
512 faculty
elevated,faculty,no+
c_slurm faculty pshudson parent
512 faculty
elevated,faculty,no+
c_slurm faculty sathan parent
512 faculty
elevated,faculty,no+

Here is the user immediately submitting a job:

[***@host ~]$ salloc -N 4 -n 4 -t 01:00:00
salloc: Granted job allocation 24150

Here is the squeue output:

JOBID ST PRIO QOS PARTI NAME USER ACCOUNT
SUBMIT_TIME START_TIME TIME TIMELIMIT
EXEC_HOST CPUS NODES MIN_M NODELIST(REASON)
24150 R 0.00 norma satur bash pshudson faculty
2014-10-15T14:40:46 2014-10-15T14:40:46 0:03 1:00:00
rcslurm 4 4 0 rack-5-[16-19]

As you can see, the "QOS" is "normal". I go ahead and restart
slurmd/slurmctld across the cluster and resubmit:

[***@host ~]$ salloc -N 4 -n 4 -t 01:00:00
salloc: Granted job allocation 24153

Here is the squeue output:

JOBID ST PRIO QOS PARTI NAME USER ACCOUNT
SUBMIT_TIME START_TIME TIME TIMELIMIT
EXEC_HOST CPUS NODES MIN_M NODELIST(REASON)
24153 R 0.00 facul satur bash pshudson faculty
2014-10-15T14:42:23 2014-10-15T14:42:23 0:03 1:00:00
rcslurm 4 4 0 rack-5-[16-19]

The default qos is now respected.

Is a restart of the slurmd/slurmctld daemons necessary and just
undocumented or is this a potential bug?

Thank you,
John DeSantis
John Desantis
2014-10-16 15:41:45 UTC
Permalink
Hello all,

Just an update to this issue.

If I restart the primary slurmctld, I can avoid a service restart
across the cluster.

John DeSantis
Post by John Desantis
Hello all,
I am not sure if I've stumbled upon a bug (14.03.6) or if this is the
intended behavior of an account tied into an association with a
DefQOS.
Basically, I've configured our cluster to use associations with a
DefQOS other than "normal" for specific accounts (Priority,
preemption, etc.). When I add a new user via sacctmgr add user... or
via sacctmgr load file=... and that new user submits a job, the
default qos I've configured is not respected. If I restart
slurmd/slurmctld (service slurm restart) across the cluster and submit
a job again, the default qos is respected.
sacctmgr show association where account=faculty
Cluster Account User Partition Share GrpJobs GrpNodes
GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm faculty 1
512 faculty
elevated,faculty,no+
c_slurm faculty davidrogers parent
512 faculty
elevated,faculty,no+
c_slurm faculty hlw parent
512 faculty
elevated,faculty,no+
c_slurm faculty sathan parent
512 faculty
elevated,faculty,no+
sacctmgr add user pshudson defaultaccount=faculty fairshare=parent
Adding User(s)
pshudson
Settings =
Default Account = faculty
Associations =
U = pshudson A = faculty C = c_slur
Non Default Settings
Fairshare = 2147483647
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
sacctmgr show association where account=faculty
Cluster Account User Partition Share GrpJobs GrpNodes
GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm faculty 1
512 faculty
elevated,faculty,no+
c_slurm faculty davidrogers parent
512 faculty
elevated,faculty,no+
c_slurm faculty hlw parent
512 faculty
elevated,faculty,no+
c_slurm faculty pshudson parent
512 faculty
elevated,faculty,no+
c_slurm faculty sathan parent
512 faculty
elevated,faculty,no+
salloc: Granted job allocation 24150
JOBID ST PRIO QOS PARTI NAME USER ACCOUNT
SUBMIT_TIME START_TIME TIME TIMELIMIT
EXEC_HOST CPUS NODES MIN_M NODELIST(REASON)
24150 R 0.00 norma satur bash pshudson faculty
2014-10-15T14:40:46 2014-10-15T14:40:46 0:03 1:00:00
rcslurm 4 4 0 rack-5-[16-19]
As you can see, the "QOS" is "normal". I go ahead and restart
salloc: Granted job allocation 24153
JOBID ST PRIO QOS PARTI NAME USER ACCOUNT
SUBMIT_TIME START_TIME TIME TIMELIMIT
EXEC_HOST CPUS NODES MIN_M NODELIST(REASON)
24153 R 0.00 facul satur bash pshudson faculty
2014-10-15T14:42:23 2014-10-15T14:42:23 0:03 1:00:00
rcslurm 4 4 0 rack-5-[16-19]
The default qos is now respected.
Is a restart of the slurmd/slurmctld daemons necessary and just
undocumented or is this a potential bug?
Thank you,
John DeSantis
Christopher Samuel
2014-10-16 22:13:40 UTC
Permalink
Post by John Desantis
Basically, I've configured our cluster to use associations with a
DefQOS other than "normal" for specific accounts (Priority,
preemption, etc.). When I add a new user via sacctmgr add user... or
via sacctmgr load file=... and that new user submits a job, the
default qos I've configured is not respected.
Hmm, could you try and mark a partition as UP with scontrol and see if
that helps? It's something we do here on Slurm 2.6 and (I believe)
resolves this for us.

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
John Desantis
2014-10-17 12:52:34 UTC
Permalink
Chris,
Post by Christopher Samuel
Hmm, could you try and mark a partition as UP with scontrol and see if
that helps? It's something we do here on Slurm 2.6 and (I believe)
resolves this for us.
Thanks for the suggestion!

I tried this and unfortunately, there was no change.

John DeSantis
Post by Christopher Samuel
Post by John Desantis
Basically, I've configured our cluster to use associations with a
DefQOS other than "normal" for specific accounts (Priority,
preemption, etc.). When I add a new user via sacctmgr add user... or
via sacctmgr load file=... and that new user submits a job, the
default qos I've configured is not respected.
Hmm, could you try and mark a partition as UP with scontrol and see if
that helps? It's something we do here on Slurm 2.6 and (I believe)
resolves this for us.
All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
http://www.vlsci.org.au/ http://twitter.com/vlsci
Loading...