srun with multithreading applications and task/cgroup

Trey Dockendorf

2014-09-02 21:19:39 UTC

I'm running small HPL benchmarks for testing, using 2 nodes with 32-cores each. I've compiled HPL against MVAPICH2 (compiled using --with-pm=no --with-pmi=slurm) and OpenBLAS. I've noticed that when I run a job that is supposed to have 1 tasks per node and 32 CPUs per task, that only 1 CPU has any load. The other 31 CPUs are idle (as seen by top, mpstat, etc).

My sbatch script:

#!/bin/bash
#SBATCH -J HPL_2x32_openblas_mvapich2
#SBATCH -o logs/HPL_2x32_openblas_mvapich2-%J.out
#SBATCH -p mpi-core32
#SBATCH --time=48:00:00
#SBATCH -N2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --hint=multithread

export OPENBLAS_NUM_THREADS=32
export PATH=$PATH:$HOME/hpl
srun --cpu_bind=none xhpl_openblas_mvapich2

Below are relevant configs. I've tried removing "--cpu_bind" as well as setting to "threads". Each node has the correct number of threads running, but only 1 of the 32 cores has a load above 0%. It appears as though something in the use of srun is binding that process to a single core. I saw in the sbatch and srun docs mention of multithreaded tasks inheriting CPU binding of parent process, but unsure how to bind the parent process to use all CPUs.

Thanks,
- Trey

slurm.conf:

JobAcctGatherType=jobacct_gather/linux
MaxMemPerCPU=1960
MpiDefault=pmi2
MpiParams=ports=30000-39999
PreemptMode=SUSPEND,GANG
PreemptType=preempt/partition_prio
ProctrackType=proctrack/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK
TaskPlugin=task/cgroup
TaskPluginParam=Sched
VSizeFactor=101

NodeName=c0237 NodeAddr=192.168.200.87 CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=129000 TmpDisk=16000 Feature=core32,mem128gb,ib_ddr,bulldozer,interlagos State=UNKNOWN
NodeName=c0238 NodeAddr=192.168.200.88 CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=129000 TmpDisk=16000 Feature=core32,mem128gb,ib_ddr,bulldozer,interlagos State=UNKNOWN
NodeName=c0133 NodeAddr=192.168.200.42 CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=129000 TmpDisk=16000 Feature=core32,mem128gb,ib_ddr,piledriver,abu_dhabi State=UNKNOWN
NodeName=c0134 NodeAddr=192.168.200.43 CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=129000 TmpDisk=16000 Feature=core32,mem128gb,ib_ddr,piledriver,abu_dhabi State=UNKNOWN
NodeName=c0933 NodeAddr=192.168.201.95 CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64300 TmpDisk=16000 Feature=core32,mem64gb,ib_ddr,piledriver,abu_dhabi State=UNKNOWN
NodeName=c0934 NodeAddr=192.168.201.96 CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64300 TmpDisk=16000 Feature=core32,mem64gb,ib_ddr,piledriver,abu_dhabi State=UNKNOWN
NodeName=c0935 NodeAddr=192.168.201.97 CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64300 TmpDisk=16000 Feature=core32,mem64gb,ib_ddr,piledriver,abu_dhabi State=UNKNOWN
NodeName=c0936 NodeAddr=192.168.201.98 CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64300 TmpDisk=16000 Feature=core32,mem64gb,ib_ddr,piledriver,abu_dhabi State=UNKNOWN

PartitionName=mpi-core32 Nodes=c[0133-0134],c[0237-0238],c[0933-0936] Priority=100 AllowQOS=mpi MinNodes=2 MaxTime=48:00:00 State=UP

cgroup.conf:
CgroupMountpoint=/cgroup
CgroupAutomount=yes
CgroupReleaseAgentDir="/home/slurm/cgroup"
ConstrainCores=yes
TaskAffinity=yes
AllowedRAMSpace=100
AllowedSwapSpace=0
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30
ConstrainDevices=no
AllowedDevicesFile=/home/slurm/conf/cgroup_allowed_devices_file.conf

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org