Schmidtmann, Carl
2014-08-11 14:55:38 UTC
We are seeing an unexpected behavior with our scheduler. All the nodes have 24 cores. If we ask for 60 cpus we get something less than that. It appears as if the schedule allocates enough nodes to cover the 60 cpus but then if any of those cpus are already allocated they are subtracted from the total granted to the new job.
In this example I asked for 5 tasks of 12 cpus each. I was allocated 3 24 core nodes but the first node already had an 8 core job running on it. Below are the allocation request and the slurm allocations reported to my job. The SLURM_JOB_CPUS_PER_NODE are 16,24,12 instead of the expected 24,24,12. We have also confirmed that if we try to use 60 cores, we get an error after the 52 cores are used. If no other jobs are currently running on the node I do get the full amount of cpus I asked for.
Attached is the slurm.conf. We running slurm 14.03.0.
Is this a bug in the scheduler or are we missing something? There is an error about over subscribe in the logfile but I don’t know what it means.
[2014-08-11T10:09:37.494] job submit for user cschmid7_local(5199): max node changed 4294967294 -> 16 because of qos limit
[2014-08-11T10:09:37.494] error: cons_res: _compute_c_b_task_dist oversubscribe for job 269189
[2014-08-11T10:09:37.494] sched: _slurm_rpc_allocate_resources JobId=269189 NodeList=bhc[0001-0003] usec=641
Thanks for any enlightenment,
Carl
[***@bh-sn]$ salloc -p debug -n 5 -c 12
salloc: Granted job allocation 269189
08/11/2014 10:09:37 AM
[/var/home/cschmid7_local]
[cschmid7_local@]$ env | grep SLURM
SLURM_NODELIST=bhc[0001-0003]
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=3
SLURM_JOBID=269189
SLURM_NTASKS=5
SLURM_TASKS_PER_NODE=2(x2),1
SLURM_CPUS_PER_TASK=12
SLURM_JOB_ID=269189
SLURM_SUBMIT_DIR=/var/home/cschmid7_local
SLURM_NPROCS=5
SLURM_JOB_NODELIST=bhc[0001-0003]
SLURM_JOB_CPUS_PER_NODE=16,24,12
SLURM_SUBMIT_HOST=bh-sn.bluehive.circ.private
SLURM_JOB_PARTITION=debug
SLURM_JOB_NUM_NODES=3
[***@bh-sn]$ scontrol sho part debug
PartitionName=debug
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO
DefaultTime=00:01:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=01:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=bhc[0001-0010]
Priority=100 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
State=UP TotalCPUs=240 TotalNodes=10 SelectTypeParameters=N/A
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
In this example I asked for 5 tasks of 12 cpus each. I was allocated 3 24 core nodes but the first node already had an 8 core job running on it. Below are the allocation request and the slurm allocations reported to my job. The SLURM_JOB_CPUS_PER_NODE are 16,24,12 instead of the expected 24,24,12. We have also confirmed that if we try to use 60 cores, we get an error after the 52 cores are used. If no other jobs are currently running on the node I do get the full amount of cpus I asked for.
Attached is the slurm.conf. We running slurm 14.03.0.
Is this a bug in the scheduler or are we missing something? There is an error about over subscribe in the logfile but I don’t know what it means.
[2014-08-11T10:09:37.494] job submit for user cschmid7_local(5199): max node changed 4294967294 -> 16 because of qos limit
[2014-08-11T10:09:37.494] error: cons_res: _compute_c_b_task_dist oversubscribe for job 269189
[2014-08-11T10:09:37.494] sched: _slurm_rpc_allocate_resources JobId=269189 NodeList=bhc[0001-0003] usec=641
Thanks for any enlightenment,
Carl
[***@bh-sn]$ salloc -p debug -n 5 -c 12
salloc: Granted job allocation 269189
08/11/2014 10:09:37 AM
[/var/home/cschmid7_local]
[cschmid7_local@]$ env | grep SLURM
SLURM_NODELIST=bhc[0001-0003]
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=3
SLURM_JOBID=269189
SLURM_NTASKS=5
SLURM_TASKS_PER_NODE=2(x2),1
SLURM_CPUS_PER_TASK=12
SLURM_JOB_ID=269189
SLURM_SUBMIT_DIR=/var/home/cschmid7_local
SLURM_NPROCS=5
SLURM_JOB_NODELIST=bhc[0001-0003]
SLURM_JOB_CPUS_PER_NODE=16,24,12
SLURM_SUBMIT_HOST=bh-sn.bluehive.circ.private
SLURM_JOB_PARTITION=debug
SLURM_JOB_NUM_NODES=3
[***@bh-sn]$ scontrol sho part debug
PartitionName=debug
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO
DefaultTime=00:01:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=01:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=bhc[0001-0010]
Priority=100 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
State=UP TotalCPUs=240 TotalNodes=10 SelectTypeParameters=N/A
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED