Kevin M. Hildebrand
2014-06-04 18:40:38 UTC
Hi, we've got a brand new cluster and a new SLURM installation (14.03.3), and I'm trying to set it up so that users can share nodes if desired.
At the moment, I'm seeing two unexpected behaviors that I could use some help figuring out. Each of our nodes has 20 cores, and I'm assuming that cores are the smallest unit of division, i.e., not sharing cores.
When submitting small jobs, as long as all have shared=1, everything works fine. However if there are a bunch of shared=1 jobs running on a node and a shared=0 job comes along, it gets incorrectly packed onto a node with other shared=1 jobs. Once there's a single shared=0 job running on a node, subsequent jobs get assigned new nodes, regardless of their shared status. This seems like a bug to me, as I'd expect shared=0 jobs to never share a node.
The second issue I'm having is that if the first shared=0 job to come along is asking for more than the remaining cores available on a node, it gets packed on the node anyway, overallocating the node. i.e., if there are 18 single core shared=1 jobs on a node, and I submit a 20-core shared=0 job, it will end up on the same node, and I end up with 38 tasks competing for 20 cores.
I've attached my slurm.conf, please let me know if I can provide other useful info.
Thanks,
Kevin
--
Kevin Hildebrand
Division of IT
University of Maryland, College Park
At the moment, I'm seeing two unexpected behaviors that I could use some help figuring out. Each of our nodes has 20 cores, and I'm assuming that cores are the smallest unit of division, i.e., not sharing cores.
When submitting small jobs, as long as all have shared=1, everything works fine. However if there are a bunch of shared=1 jobs running on a node and a shared=0 job comes along, it gets incorrectly packed onto a node with other shared=1 jobs. Once there's a single shared=0 job running on a node, subsequent jobs get assigned new nodes, regardless of their shared status. This seems like a bug to me, as I'd expect shared=0 jobs to never share a node.
The second issue I'm having is that if the first shared=0 job to come along is asking for more than the remaining cores available on a node, it gets packed on the node anyway, overallocating the node. i.e., if there are 18 single core shared=1 jobs on a node, and I submit a 20-core shared=0 job, it will end up on the same node, and I end up with 38 tasks competing for 20 cores.
I've attached my slurm.conf, please let me know if I can provide other useful info.
Thanks,
Kevin
--
Kevin Hildebrand
Division of IT
University of Maryland, College Park