Discussion:
Unexpected behavior with shared jobs
Kevin M. Hildebrand
2014-06-04 18:40:38 UTC
Permalink
Hi, we've got a brand new cluster and a new SLURM installation (14.03.3), and I'm trying to set it up so that users can share nodes if desired.
At the moment, I'm seeing two unexpected behaviors that I could use some help figuring out. Each of our nodes has 20 cores, and I'm assuming that cores are the smallest unit of division, i.e., not sharing cores.
When submitting small jobs, as long as all have shared=1, everything works fine. However if there are a bunch of shared=1 jobs running on a node and a shared=0 job comes along, it gets incorrectly packed onto a node with other shared=1 jobs. Once there's a single shared=0 job running on a node, subsequent jobs get assigned new nodes, regardless of their shared status. This seems like a bug to me, as I'd expect shared=0 jobs to never share a node.

The second issue I'm having is that if the first shared=0 job to come along is asking for more than the remaining cores available on a node, it gets packed on the node anyway, overallocating the node. i.e., if there are 18 single core shared=1 jobs on a node, and I submit a 20-core shared=0 job, it will end up on the same node, and I end up with 38 tasks competing for 20 cores.

I've attached my slurm.conf, please let me know if I can provide other useful info.

Thanks,
Kevin

--
Kevin Hildebrand
Division of IT
University of Maryland, College Park
Kevin M. Hildebrand
2014-06-05 14:00:37 UTC
Permalink
More on this- this problem appears to be caused by something in the preemption logic. With PreemptType=preempt/qos, I've got the erratic behavior described below. With preempt/none, things appear to behave as they should.
Also, in the slurmctld.log I'm seeing the following for misdirected jobs-
[2014-06-05T09:40:18.197] error: cons_res: ERROR: job overflow: could not find idle resources for job 16384

Kevin


From: Kevin M. Hildebrand
Sent: Wednesday, June 4, 2014 2:38 PM
To: 'slurm-dev-***@public.gmane.org'
Subject: Unexpected behavior with shared jobs

Hi, we've got a brand new cluster and a new SLURM installation (14.03.3), and I'm trying to set it up so that users can share nodes if desired.
At the moment, I'm seeing two unexpected behaviors that I could use some help figuring out. Each of our nodes has 20 cores, and I'm assuming that cores are the smallest unit of division, i.e., not sharing cores.
When submitting small jobs, as long as all have shared=1, everything works fine. However if there are a bunch of shared=1 jobs running on a node and a shared=0 job comes along, it gets incorrectly packed onto a node with other shared=1 jobs. Once there's a single shared=0 job running on a node, subsequent jobs get assigned new nodes, regardless of their shared status. This seems like a bug to me, as I'd expect shared=0 jobs to never share a node.

The second issue I'm having is that if the first shared=0 job to come along is asking for more than the remaining cores available on a node, it gets packed on the node anyway, overallocating the node. i.e., if there are 18 single core shared=1 jobs on a node, and I submit a 20-core shared=0 job, it will end up on the same node, and I end up with 38 tasks competing for 20 cores.

I've attached my slurm.conf, please let me know if I can provide other useful info.

Thanks,
Kevin

--
Kevin Hildebrand
Division of IT
University of Maryland, College Park

Loading...