Discussion:
Bizarre number of CPUs calculated after updating to 14.03.06
Kevin M. Hildebrand
2014-07-23 12:50:41 UTC
Permalink
Hi, we were running 14.03.03 and updated to 14.03.06 yesterday, and since then I've been seeing bizarre figures for NumCPUs in submitted jobs.
sbatch -n 30 slurmtest.script
JobId=431695 Name=slurmtest.script
scontrol show job 431695
JobId=431695 Name=slurmtest.script
UserId=kevin(7260) GroupId=glue-staff(8675)
Priority=40133 Nice=0 Account=bubba QOS=wide-short
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48
StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=standard AllocNode:Sid=deepthought2:21227
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-b25-[24,37]
BatchHost=compute-b25-24
NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:* <---- NOTICE NumCPUs here
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/export/home/dt2-admin/kevin/slurmtest.script
WorkDir=/home/dt2-admin/kevin
StdErr=/home/dt2-admin/kevin/slurm-431695.out
StdIn=/dev/null
StdOut=/home/dt2-admin/kevin/slurm-431695.out

This cluster is made up of nodes of 20 cores each, so I'd expect NumCPUs to be 40 since the job is exclusive.
scontrol show node "compute-b25-[24,37]"
NodeName=compute-b25-24 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null)
Gres=(null)
NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


I'm seeing this behavior on two different clusters, both of which were updated to 14.03.06. Was something changed recently that could explain this?

Thanks,
Kevin
Kevin M. Hildebrand
2014-07-23 13:41:48 UTC
Permalink
Ok, I see what's happening, but don't know why yet. If the job is assigned non-contiguous nodes, for some reason the CPUs for the intervening nodes are being counted. i.e., if I'm assigned compute-b25-[0-2], NumCPUs is correct, at 60 (three nodes worth of CPUs)
However, if I'm assigned compute-b25-[0,4], NumCPUs is incorrect, and is shown as 100 (five nodes worth of CPUs instead of two).

Kevin


From: Kevin M. Hildebrand [mailto:***@umd.edu]
Sent: Wednesday, July 23, 2014 8:51 AM
To: slurm-dev
Subject: [slurm-dev] Bizarre number of CPUs calculated after updating to 14.03.06

Hi, we were running 14.03.03 and updated to 14.03.06 yesterday, and since then I've been seeing bizarre figures for NumCPUs in submitted jobs.
sbatch -n 30 slurmtest.script
JobId=431695 Name=slurmtest.script
scontrol show job 431695
JobId=431695 Name=slurmtest.script
UserId=kevin(7260) GroupId=glue-staff(8675)
Priority=40133 Nice=0 Account=bubba QOS=wide-short
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48
StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=standard AllocNode:Sid=deepthought2:21227
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-b25-[24,37]
BatchHost=compute-b25-24
NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:* <---- NOTICE NumCPUs here
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/export/home/dt2-admin/kevin/slurmtest.script
WorkDir=/home/dt2-admin/kevin
StdErr=/home/dt2-admin/kevin/slurm-431695.out
StdIn=/dev/null
StdOut=/home/dt2-admin/kevin/slurm-431695.out

This cluster is made up of nodes of 20 cores each, so I'd expect NumCPUs to be 40 since the job is exclusive.
scontrol show node "compute-b25-[24,37]"
NodeName=compute-b25-24 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null)
Gres=(null)
NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


I'm seeing this behavior on two different clusters, both of which were updated to 14.03.06. Was something changed recently that could explain this?

Thanks,
Kevin
Morris Jette
2014-07-23 14:04:42 UTC
Permalink
There is a fix in github about a week old.
Post by Kevin M. Hildebrand
Ok, I see what's happening, but don't know why yet. If the job is
assigned non-contiguous nodes, for some reason the CPUs for the
intervening nodes are being counted. i.e., if I'm assigned
compute-b25-[0-2], NumCPUs is correct, at 60 (three nodes worth of
CPUs)
However, if I'm assigned compute-b25-[0,4], NumCPUs is incorrect, and
is shown as 100 (five nodes worth of CPUs instead of two).
Kevin
Sent: Wednesday, July 23, 2014 8:51 AM
To: slurm-dev
Subject: [slurm-dev] Bizarre number of CPUs calculated after updating to 14.03.06
Hi, we were running 14.03.03 and updated to 14.03.06 yesterday, and
since then I've been seeing bizarre figures for NumCPUs in submitted
jobs.
sbatch -n 30 slurmtest.script
JobId=431695 Name=slurmtest.script
scontrol show job 431695
JobId=431695 Name=slurmtest.script
UserId=kevin(7260) GroupId=glue-staff(8675)
Priority=40133 Nice=0 Account=bubba QOS=wide-short
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48
StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=standard AllocNode:Sid=deepthought2:21227
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-b25-[24,37]
BatchHost=compute-b25-24
NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:* <---- NOTICE NumCPUs here
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/export/home/dt2-admin/kevin/slurmtest.script
WorkDir=/home/dt2-admin/kevin
StdErr=/home/dt2-admin/kevin/slurm-431695.out
StdIn=/dev/null
StdOut=/home/dt2-admin/kevin/slurm-431695.out
This cluster is made up of nodes of 20 cores each, so I'd expect
NumCPUs to be 40 since the job is exclusive.
scontrol show node "compute-b25-[24,37]"
NodeName=compute-b25-24 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null)
Gres=(null)
NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I'm seeing this behavior on two different clusters, both of which were
updated to 14.03.06. Was something changed recently that could explain
this?
Thanks,
Kevin
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Kevin M. Hildebrand
2014-07-23 14:17:19 UTC
Permalink
Looks like commit #abb65255968d8cdafd90e2337131d53b9578cd82

I've just grabbed it and verified that it indeed fixes the problem.

Thanks for the quick reply!

Kevin

From: Morris Jette [mailto:***@schedmd.com]
Sent: Wednesday, July 23, 2014 10:05 AM
To: slurm-dev
Subject: [slurm-dev] RE: Bizarre number of CPUs calculated after updating to 14.03.06

There is a fix in github about a week old.
On July 23, 2014 6:42:02 AM PDT, "Kevin M. Hildebrand" <***@umd.edu> wrote:
Ok, I see what's happening, but don't know why yet. If the job is assigned non-contiguous nodes, for some reason the CPUs for the intervening nodes are being counted. i.e., if I'm assigned compute-b25-[0-2], NumCPUs is correct, at 60 (three nodes worth of CPUs)
However, if I'm assigned compute-b25-[0,4], NumCPUs is incorrect, and is shown as 100 (five nodes worth of CPUs instead of two).


Kevin




From: Kevin M. Hildebrand [mailto:***@umd.edu]
Sent: Wednesday, July 23, 2014 8:51 AM
To: slurm-dev
Subject: [slurm-dev] Bizarre number of CPUs calculated after updating to 14.03.06


Hi, we were running 14.03.03 and updated to 14.03.06 yesterday, and since then I've been seeing bizarre figures for NumCPUs in submitted jobs.
sbatch -n 30 slurmtest.script
JobId=431695 Name=slurmtest.script
scontrol show job 431695
JobId=431695 Name=slurmtest.script
UserId=kevin(7260) GroupId=glue-staff(8675)
Priority=40133 Nice=0 Account=bubba QOS=wide-short
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48
StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=standard AllocNode:Sid=deepthought2:21227
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-b25-[24,37]
BatchHost=compute-b25-24
NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:* <---- NOTICE NumCPUs here
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/export/home/dt2-admin/kevin/slurmtest.script
WorkDir=/home/dt2-admin/kevin
StdErr=/home/dt2-admin/kevin/slurm-431695.out
StdIn=/dev/null
StdOut=/home/dt2-admin/kevin/slurm-431695.out


This cluster is made up of nodes of 20 cores each, so I'd expect NumCPUs to be 40 since the job is exclusive.
scontrol show node "compute-b25-[24,37]"
NodeName=compute-b25-24 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null)
Gres=(null)
NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s



NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s




I'm seeing this behavior on two different clusters, both of which were updated to 14.03.06. Was something changed recently that could explain this?


Thanks,
Kevin
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Loading...