Unexpected behavior with shared jobs

Kevin M. Hildebrand

2014-06-05 14:00:37 UTC

More on this- this problem appears to be caused by something in the preemption logic. With PreemptType=preempt/qos, I've got the erratic behavior described below. With preempt/none, things appear to behave as they should.
Also, in the slurmctld.log I'm seeing the following for misdirected jobs-
[2014-06-05T09:40:18.197] error: cons_res: ERROR: job overflow: could not find idle resources for job 16384

Kevin

From: Kevin M. Hildebrand
Sent: Wednesday, June 4, 2014 2:38 PM
To: 'slurm-dev-***@public.gmane.org'
Subject: Unexpected behavior with shared jobs

Hi, we've got a brand new cluster and a new SLURM installation (14.03.3), and I'm trying to set it up so that users can share nodes if desired.
At the moment, I'm seeing two unexpected behaviors that I could use some help figuring out. Each of our nodes has 20 cores, and I'm assuming that cores are the smallest unit of division, i.e., not sharing cores.
When submitting small jobs, as long as all have shared=1, everything works fine. However if there are a bunch of shared=1 jobs running on a node and a shared=0 job comes along, it gets incorrectly packed onto a node with other shared=1 jobs. Once there's a single shared=0 job running on a node, subsequent jobs get assigned new nodes, regardless of their shared status. This seems like a bug to me, as I'd expect shared=0 jobs to never share a node.

The second issue I'm having is that if the first shared=0 job to come along is asking for more than the remaining cores available on a node, it gets packed on the node anyway, overallocating the node. i.e., if there are 18 single core shared=1 jobs on a node, and I submit a 20-core shared=0 job, it will end up on the same node, and I end up with 38 tasks competing for 20 cores.

I've attached my slurm.conf, please let me know if I can provide other useful info.

Thanks,
Kevin

--
Kevin Hildebrand
Division of IT
University of Maryland, College Park