Bill Wichser
2014-07-11 15:14:34 UTC
This morning, one of our users questioned another's allocation, mainly
asking for a how-to in order to do the same thing. The request looks
like this:
#SBATCH --ntasks=256
#SBATCH --ntasks-per-socket=16
Now we have 16 core nodes with dual socket, 8 core CPUs in each, so this
raised an eyebrow. The actual allocation is all over the place and I
offer a few lines from scontrol show job;
Socks/Node=* NtasksPerN:B:S:C=0:0:16:* CoreSpec=0
Nodes=tiger-r1c1n16 CPU_IDs=0-7 Mem=24000
Nodes=tiger-r1c2n11 CPU_IDs=8-15 Mem=24000
Nodes=tiger-r1c3n1 CPU_IDs=15 Mem=3000
Nodes=tiger-r1c3n2 CPU_IDs=12 Mem=3000
Nodes=tiger-r1c3n[6,10] CPU_IDs=8-15 Mem=24000
Nodes=tiger-r1c4n2 CPU_IDs=4,15 Mem=6000
Nodes=tiger-r1c4n3 CPU_IDs=13-14 Mem=6000
Nodes=tiger-r2c1n2 CPU_IDs=8-15 Mem=24000
Nodes=tiger-r2c1n3 CPU_IDs=3 Mem=3000
and on and on and on, using a total of 43 different nodes.
Off to the man pages. What I find is that --ntasks-per-socket specifies
the maximum number of cores per socket. Okay this is interesting and
now I understand why this worked.
But this isn't my question.
We tell users to allocate using
#SBATCH -N 4
#SBATCH --ntasks-per-node=16
and this gets exactly that -- 64 cores. Why? When I look at the man
page for --ntasks-per-node I also find this to be a maximum value.
So I'm not sure why this works correctly (thankfully) and the other
--ntasks-per-socket is using this as a maximum value. Off to the source
code and in there I find that when -N is set, then there is a MAX() call
which actually takes this value as absolute and allocates the correct
values.
I have no clue how to get this written correctly in the documentation
but the current description of --ntasks-per-node doesn't spell this out
very clearly at all to me.
Bill
asking for a how-to in order to do the same thing. The request looks
like this:
#SBATCH --ntasks=256
#SBATCH --ntasks-per-socket=16
Now we have 16 core nodes with dual socket, 8 core CPUs in each, so this
raised an eyebrow. The actual allocation is all over the place and I
offer a few lines from scontrol show job;
Socks/Node=* NtasksPerN:B:S:C=0:0:16:* CoreSpec=0
Nodes=tiger-r1c1n16 CPU_IDs=0-7 Mem=24000
Nodes=tiger-r1c2n11 CPU_IDs=8-15 Mem=24000
Nodes=tiger-r1c3n1 CPU_IDs=15 Mem=3000
Nodes=tiger-r1c3n2 CPU_IDs=12 Mem=3000
Nodes=tiger-r1c3n[6,10] CPU_IDs=8-15 Mem=24000
Nodes=tiger-r1c4n2 CPU_IDs=4,15 Mem=6000
Nodes=tiger-r1c4n3 CPU_IDs=13-14 Mem=6000
Nodes=tiger-r2c1n2 CPU_IDs=8-15 Mem=24000
Nodes=tiger-r2c1n3 CPU_IDs=3 Mem=3000
and on and on and on, using a total of 43 different nodes.
Off to the man pages. What I find is that --ntasks-per-socket specifies
the maximum number of cores per socket. Okay this is interesting and
now I understand why this worked.
But this isn't my question.
We tell users to allocate using
#SBATCH -N 4
#SBATCH --ntasks-per-node=16
and this gets exactly that -- 64 cores. Why? When I look at the man
page for --ntasks-per-node I also find this to be a maximum value.
So I'm not sure why this works correctly (thankfully) and the other
--ntasks-per-socket is using this as a maximum value. Off to the source
code and in there I find that when -N is set, then there is a MAX() call
which actually takes this value as absolute and allocates the correct
values.
I have no clue how to get this written correctly in the documentation
but the current description of --ntasks-per-node doesn't spell this out
very clearly at all to me.
Bill