jobs killed on controller restart

Michael Gutteridge

2014-06-10 18:02:37 UTC

We've had some trouble with curious job failures- the jobs aren't even
assigned nodes:

JobID NodeList State ExitCode
------------ --------------- ---------- --------

7229124 None assigned FAILED 0:1

We finally got some better log data (I'd turned it way too low) which
suggests that restarting and/or reconfiguring the controller is at the
root. After some preliminaries (purging job records, recovering active
jobs) there will be these sorts of messages
:

[2014-06-09T23:10:15.920] No nodes satisfy job 7228909 requirements in
partition full
[2014-06-09T23:10:15.920] sched: schedule: JobId=7228909 non-runnable:
Requested node configuration is not available

The indicated job has specified --mem and --tmp, but the values are within
the capacities for all nodes in that "full" partition. Typically if a user
were to request resources exceeding those available on nodes in this
partition the submission is failed. It appears that this failure only
occurs for jobs with memory and/or disk constraints. Worse yet, it's not
consistent- only seems to happen sometime. I also cannot reproduce this in
our test environment.

A typical node configuration line looks thus:

NodeName=gizmod[51-60] Sockets=2 CoresPerSocket=6 RealMemory=48000
Weight=10 Feature=full,restart,rx200,ssd

though I've got FastSchedule=0. Honestly it *feels* like there's a moment
where the node data isn't fully loaded from the slurmd and thus the
scheduler doesn't see any nodes that satisfy the requirements.

Thanks all...

Michael