Michael Gutteridge
2014-06-05 18:16:57 UTC
I'm running slurm 2.6.9: I've got the backfill scheduler set up with
some pretty ridiculous parameters as we have a large number of queued
jobs of various dimensions:
SchedulerParameters=default_queue_depth=10000,bf_continue,bf_interval=120,bf_max_job_user=10000,bf_resolution=600,bf_window=4320,bf_max_job_part=10000
This has been working fine- backfill was effectively going through the
full queue- but today it appears to have stopped- jobs which should be
backfilled onto idle resources aren't being run. The scheduler log
shows:
[2014-06-04T13:16:10.107] sched: Running job scheduler
[2014-06-04T13:16:10.111] sched: JobId=7060218. State=PENDING.
Reason=Resources. Priority=10850. Partition=campus.
[2014-06-04T13:16:10.111] sched: JobId=7060219. State=PENDING.
Reason=Priority(Priority), Priority=10850, Partition=campus.
[2014-06-04T13:16:10.111] sched: already tested 3 jobs, breaking out
My understanding is that it shouldn't hit that limit until
default_queue_depth. Has my controller lost it's mind? I've got a
nearly identical test setup where this is working as I'd expect.
Any hints appreciated... thanks much
Michael
some pretty ridiculous parameters as we have a large number of queued
jobs of various dimensions:
SchedulerParameters=default_queue_depth=10000,bf_continue,bf_interval=120,bf_max_job_user=10000,bf_resolution=600,bf_window=4320,bf_max_job_part=10000
This has been working fine- backfill was effectively going through the
full queue- but today it appears to have stopped- jobs which should be
backfilled onto idle resources aren't being run. The scheduler log
shows:
[2014-06-04T13:16:10.107] sched: Running job scheduler
[2014-06-04T13:16:10.111] sched: JobId=7060218. State=PENDING.
Reason=Resources. Priority=10850. Partition=campus.
[2014-06-04T13:16:10.111] sched: JobId=7060219. State=PENDING.
Reason=Priority(Priority), Priority=10850, Partition=campus.
[2014-06-04T13:16:10.111] sched: already tested 3 jobs, breaking out
My understanding is that it shouldn't hit that limit until
default_queue_depth. Has my controller lost it's mind? I've got a
nearly identical test setup where this is working as I'd expect.
Any hints appreciated... thanks much
Michael