Kevin M. Hildebrand
2014-07-23 12:50:41 UTC
Hi, we were running 14.03.03 and updated to 14.03.06 yesterday, and since then I've been seeing bizarre figures for NumCPUs in submitted jobs.
UserId=kevin(7260) GroupId=glue-staff(8675)
Priority=40133 Nice=0 Account=bubba QOS=wide-short
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48
StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=standard AllocNode:Sid=deepthought2:21227
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-b25-[24,37]
BatchHost=compute-b25-24
NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:* <---- NOTICE NumCPUs here
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/export/home/dt2-admin/kevin/slurmtest.script
WorkDir=/home/dt2-admin/kevin
StdErr=/home/dt2-admin/kevin/slurm-431695.out
StdIn=/dev/null
StdOut=/home/dt2-admin/kevin/slurm-431695.out
This cluster is made up of nodes of 20 cores each, so I'd expect NumCPUs to be 40 since the job is exclusive.
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null)
Gres=(null)
NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I'm seeing this behavior on two different clusters, both of which were updated to 14.03.06. Was something changed recently that could explain this?
Thanks,
Kevin
sbatch -n 30 slurmtest.script
JobId=431695 Name=slurmtest.scriptscontrol show job 431695
JobId=431695 Name=slurmtest.scriptUserId=kevin(7260) GroupId=glue-staff(8675)
Priority=40133 Nice=0 Account=bubba QOS=wide-short
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48
StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=standard AllocNode:Sid=deepthought2:21227
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-b25-[24,37]
BatchHost=compute-b25-24
NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:* <---- NOTICE NumCPUs here
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/export/home/dt2-admin/kevin/slurmtest.script
WorkDir=/home/dt2-admin/kevin
StdErr=/home/dt2-admin/kevin/slurm-431695.out
StdIn=/dev/null
StdOut=/home/dt2-admin/kevin/slurm-431695.out
This cluster is made up of nodes of 20 cores each, so I'd expect NumCPUs to be 40 since the job is exclusive.
scontrol show node "compute-b25-[24,37]"
NodeName=compute-b25-24 Arch=x86_64 CoresPerSocket=10CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null)
Gres=(null)
NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03
OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I'm seeing this behavior on two different clusters, both of which were updated to 14.03.06. Was something changed recently that could explain this?
Thanks,
Kevin