Odd (ReqNodeNotAvail) and (PartitionNodeLimit) with multiple partitions

Mikael Johansson

2014-10-21 20:53:41 UTC

Hello All,

I had a problem with jobs being stuck in the queue and not being scheduled
even with unused cores on the cluster. The system has four partitions,
three different "high priority" ones and one lower priority, "backfill"
partition. A concise description of the setup in slurm.config, SLURM
2.2.7:

PartitionName=backfill Nodes=node[001-026] Default=NO MaxNodes=10 MaxTime=168:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO PreemptMode=requeue
PartitionName=short Nodes=node[005-026] Default=YES MaxNodes=6 MaxTime=002:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO PreemptMode=off
PartitionName=medium Nodes=node[009-026] Default=NO MaxNodes=4 MaxTime=168:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO PreemptMode=off
PartitionName=long Nodes=node[001-004] Default=NO MaxNodes=4 MaxTime=744:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO PreemptMode=off

SchedulerType=sched/builtin
PreemptType=preempt/partition_prio
PreemptMode=requeue

(I'll send more of course if needed)

The problem here is that the backfill jobs will only be scheduled to run
on nodes node[001-008]. They will never start on nodes node[009-026]. I
tested this buy submitting a job explicitly to a specific node (node020)
using two enforcements; both lead to the job being stuck in different,
odd, states;

#SBATCH -w node020:
the job gets status (ReqNodeNotAvail), and the log shows "debug2: sched:
JobId=NNNNNN allocated resources: NodeList=(null)" and "debug3:
JobId=NNNNNN required nodes not avail"

#SBATCH -x node[001-019,021-026]:
the job gets status (PartitionNodeLimit), and the log shows "debug3:
JobId=NNNNNN not runnable with present config"

I have no idea how SLURM arrives at these conclusions. In order to find
out what's going on, the following _does_ start the jobs (but breaks the
configuration, of course):

1. Increasing the priority of PartitionName=backfill to the same as the
others, 2

2. Removing node020 from all other partitions

I also thought it might be somehow related to the fact that nodes
node[009-026] are shared by three partitions (instead of just 2, like the
other nodes), which perhaps confuses SLURM 2.2.7. Removing node020 from,
for example, the short partition, leaving it in only medium and backfill
does not help, though.

However, removing node020 from the medium partition, leaving it only in
the short and backfill partitions, does work, and the job starts in
backfill without problems.

To me this sounds like an odd bug, but perhaps I'm missing something. If
it is a bug, and known to be fixed in later versions, that would be a good
reason to force us to upgrade SLURM to something a bit more modern. But at
the same time, if someone comes up with a work-around, it would actually
at least in the short-term be a solution much easier to implement.

So again, all ideas and suggestions, or just explanations of the odd job
states are most welcome!

Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/