John Desantis
2014-10-15 19:04:35 UTC
Hello all,
I am not sure if I've stumbled upon a bug (14.03.6) or if this is the
intended behavior of an account tied into an association with a
DefQOS.
Basically, I've configured our cluster to use associations with a
DefQOS other than "normal" for specific accounts (Priority,
preemption, etc.). When I add a new user via sacctmgr add user... or
via sacctmgr load file=... and that new user submits a job, the
default qos I've configured is not respected. If I restart
slurmd/slurmctld (service slurm restart) across the cluster and submit
a job again, the default qos is respected.
Here is the configuration I've set:
sacctmgr show association where account=faculty
Cluster Account User Partition Share GrpJobs GrpNodes
GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm faculty 1
512 faculty
elevated,faculty,no+
c_slurm faculty davidrogers parent
512 faculty
elevated,faculty,no+
c_slurm faculty hlw parent
512 faculty
elevated,faculty,no+
c_slurm faculty sathan parent
512 faculty
elevated,faculty,no+
Here is the user I've added:
sacctmgr add user pshudson defaultaccount=faculty fairshare=parent
Adding User(s)
pshudson
Settings =
Default Account = faculty
Associations =
U = pshudson A = faculty C = c_slur
Non Default Settings
Fairshare = 2147483647
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
Double-check the new entry:
sacctmgr show association where account=faculty
Cluster Account User Partition Share GrpJobs GrpNodes
GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm faculty 1
512 faculty
elevated,faculty,no+
c_slurm faculty davidrogers parent
512 faculty
elevated,faculty,no+
c_slurm faculty hlw parent
512 faculty
elevated,faculty,no+
c_slurm faculty pshudson parent
512 faculty
elevated,faculty,no+
c_slurm faculty sathan parent
512 faculty
elevated,faculty,no+
Here is the user immediately submitting a job:
[***@host ~]$ salloc -N 4 -n 4 -t 01:00:00
salloc: Granted job allocation 24150
Here is the squeue output:
JOBID ST PRIO QOS PARTI NAME USER ACCOUNT
SUBMIT_TIME START_TIME TIME TIMELIMIT
EXEC_HOST CPUS NODES MIN_M NODELIST(REASON)
24150 R 0.00 norma satur bash pshudson faculty
2014-10-15T14:40:46 2014-10-15T14:40:46 0:03 1:00:00
rcslurm 4 4 0 rack-5-[16-19]
As you can see, the "QOS" is "normal". I go ahead and restart
slurmd/slurmctld across the cluster and resubmit:
[***@host ~]$ salloc -N 4 -n 4 -t 01:00:00
salloc: Granted job allocation 24153
Here is the squeue output:
JOBID ST PRIO QOS PARTI NAME USER ACCOUNT
SUBMIT_TIME START_TIME TIME TIMELIMIT
EXEC_HOST CPUS NODES MIN_M NODELIST(REASON)
24153 R 0.00 facul satur bash pshudson faculty
2014-10-15T14:42:23 2014-10-15T14:42:23 0:03 1:00:00
rcslurm 4 4 0 rack-5-[16-19]
The default qos is now respected.
Is a restart of the slurmd/slurmctld daemons necessary and just
undocumented or is this a potential bug?
Thank you,
John DeSantis
I am not sure if I've stumbled upon a bug (14.03.6) or if this is the
intended behavior of an account tied into an association with a
DefQOS.
Basically, I've configured our cluster to use associations with a
DefQOS other than "normal" for specific accounts (Priority,
preemption, etc.). When I add a new user via sacctmgr add user... or
via sacctmgr load file=... and that new user submits a job, the
default qos I've configured is not respected. If I restart
slurmd/slurmctld (service slurm restart) across the cluster and submit
a job again, the default qos is respected.
Here is the configuration I've set:
sacctmgr show association where account=faculty
Cluster Account User Partition Share GrpJobs GrpNodes
GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm faculty 1
512 faculty
elevated,faculty,no+
c_slurm faculty davidrogers parent
512 faculty
elevated,faculty,no+
c_slurm faculty hlw parent
512 faculty
elevated,faculty,no+
c_slurm faculty sathan parent
512 faculty
elevated,faculty,no+
Here is the user I've added:
sacctmgr add user pshudson defaultaccount=faculty fairshare=parent
Adding User(s)
pshudson
Settings =
Default Account = faculty
Associations =
U = pshudson A = faculty C = c_slur
Non Default Settings
Fairshare = 2147483647
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
Double-check the new entry:
sacctmgr show association where account=faculty
Cluster Account User Partition Share GrpJobs GrpNodes
GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm faculty 1
512 faculty
elevated,faculty,no+
c_slurm faculty davidrogers parent
512 faculty
elevated,faculty,no+
c_slurm faculty hlw parent
512 faculty
elevated,faculty,no+
c_slurm faculty pshudson parent
512 faculty
elevated,faculty,no+
c_slurm faculty sathan parent
512 faculty
elevated,faculty,no+
Here is the user immediately submitting a job:
[***@host ~]$ salloc -N 4 -n 4 -t 01:00:00
salloc: Granted job allocation 24150
Here is the squeue output:
JOBID ST PRIO QOS PARTI NAME USER ACCOUNT
SUBMIT_TIME START_TIME TIME TIMELIMIT
EXEC_HOST CPUS NODES MIN_M NODELIST(REASON)
24150 R 0.00 norma satur bash pshudson faculty
2014-10-15T14:40:46 2014-10-15T14:40:46 0:03 1:00:00
rcslurm 4 4 0 rack-5-[16-19]
As you can see, the "QOS" is "normal". I go ahead and restart
slurmd/slurmctld across the cluster and resubmit:
[***@host ~]$ salloc -N 4 -n 4 -t 01:00:00
salloc: Granted job allocation 24153
Here is the squeue output:
JOBID ST PRIO QOS PARTI NAME USER ACCOUNT
SUBMIT_TIME START_TIME TIME TIMELIMIT
EXEC_HOST CPUS NODES MIN_M NODELIST(REASON)
24153 R 0.00 facul satur bash pshudson faculty
2014-10-15T14:42:23 2014-10-15T14:42:23 0:03 1:00:00
rcslurm 4 4 0 rack-5-[16-19]
The default qos is now respected.
Is a restart of the slurmd/slurmctld daemons necessary and just
undocumented or is this a potential bug?
Thank you,
John DeSantis