Discussion:
backfill breaking out too early
Michael Gutteridge
2014-06-05 18:16:57 UTC
Permalink
I'm running slurm 2.6.9: I've got the backfill scheduler set up with
some pretty ridiculous parameters as we have a large number of queued
jobs of various dimensions:

SchedulerParameters=default_queue_depth=10000,bf_continue,bf_interval=120,bf_max_job_user=10000,bf_resolution=600,bf_window=4320,bf_max_job_part=10000

This has been working fine- backfill was effectively going through the
full queue- but today it appears to have stopped- jobs which should be
backfilled onto idle resources aren't being run. The scheduler log
shows:

[2014-06-04T13:16:10.107] sched: Running job scheduler
[2014-06-04T13:16:10.111] sched: JobId=7060218. State=PENDING.
Reason=Resources. Priority=10850. Partition=campus.
[2014-06-04T13:16:10.111] sched: JobId=7060219. State=PENDING.
Reason=Priority(Priority), Priority=10850, Partition=campus.
[2014-06-04T13:16:10.111] sched: already tested 3 jobs, breaking out

My understanding is that it shouldn't hit that limit until
default_queue_depth. Has my controller lost it's mind? I've got a
nearly identical test setup where this is working as I'd expect.

Any hints appreciated... thanks much

Michael
Paul Edmon
2014-06-05 18:33:32 UTC
Permalink
That's the main scheduler braking out not the backfill if I am reading
that correctly. The main scheduler will go until it can't schedule jobs
anymore or until it hits a 4 second runtime.

At least that's my read of that log.

-Paul Edmon-
Post by Michael Gutteridge
I'm running slurm 2.6.9: I've got the backfill scheduler set up with
some pretty ridiculous parameters as we have a large number of queued
SchedulerParameters=default_queue_depth=10000,bf_continue,bf_interval=120,bf_max_job_user=10000,bf_resolution=600,bf_window=4320,bf_max_job_part=10000
This has been working fine- backfill was effectively going through the
full queue- but today it appears to have stopped- jobs which should be
backfilled onto idle resources aren't being run. The scheduler log
[2014-06-04T13:16:10.107] sched: Running job scheduler
[2014-06-04T13:16:10.111] sched: JobId=7060218. State=PENDING.
Reason=Resources. Priority=10850. Partition=campus.
[2014-06-04T13:16:10.111] sched: JobId=7060219. State=PENDING.
Reason=Priority(Priority), Priority=10850, Partition=campus.
[2014-06-04T13:16:10.111] sched: already tested 3 jobs, breaking out
My understanding is that it shouldn't hit that limit until
default_queue_depth. Has my controller lost it's mind? I've got a
nearly identical test setup where this is working as I'd expect.
Any hints appreciated... thanks much
Michael
Danny Auble
2014-06-05 19:04:30 UTC
Permalink
Hey Michael,

A commit in 14.03.1 that may be related to what you are seeing is
e94f10b8a2f85936e487358a0da001a271898d4f. It is a partial revert of
commit 9b1dadea4eb823b5ef29d8b4ee56cb6b7c3be22f which first appeared in
2.6.8. Try applying that (or upgrading to 14.03) and see if it fixes
your issue.

I think Paul is correct though. I don't think this is the backfill
loop, but the normal scheduler.

Danny
Post by Michael Gutteridge
I'm running slurm 2.6.9: I've got the backfill scheduler set up with
some pretty ridiculous parameters as we have a large number of queued
SchedulerParameters=default_queue_depth=10000,bf_continue,bf_interval=120,bf_max_job_user=10000,bf_resolution=600,bf_window=4320,bf_max_job_part=10000
This has been working fine- backfill was effectively going through the
full queue- but today it appears to have stopped- jobs which should be
backfilled onto idle resources aren't being run. The scheduler log
[2014-06-04T13:16:10.107] sched: Running job scheduler
[2014-06-04T13:16:10.111] sched: JobId=7060218. State=PENDING.
Reason=Resources. Priority=10850. Partition=campus.
[2014-06-04T13:16:10.111] sched: JobId=7060219. State=PENDING.
Reason=Priority(Priority), Priority=10850, Partition=campus.
[2014-06-04T13:16:10.111] sched: already tested 3 jobs, breaking out
My understanding is that it shouldn't hit that limit until
default_queue_depth. Has my controller lost it's mind? I've got a
nearly identical test setup where this is working as I'd expect.
Any hints appreciated... thanks much
Michael
j***@public.gmane.org
2014-06-05 19:18:31 UTC
Permalink
Tme message is definitely from the main scheduling loop rather than
backfill. I would guess that three batch jobs were submitted since the
last time the scheduler ran and it is only testing three jobs for
scheduling at that time.
Post by Michael Gutteridge
I'm running slurm 2.6.9: I've got the backfill scheduler set up with
some pretty ridiculous parameters as we have a large number of queued
SchedulerParameters=default_queue_depth=10000,bf_continue,bf_interval=120,bf_max_job_user=10000,bf_resolution=600,bf_window=4320,bf_max_job_part=10000
This has been working fine- backfill was effectively going through the
full queue- but today it appears to have stopped- jobs which should be
backfilled onto idle resources aren't being run. The scheduler log
[2014-06-04T13:16:10.107] sched: Running job scheduler
[2014-06-04T13:16:10.111] sched: JobId=7060218. State=PENDING.
Reason=Resources. Priority=10850. Partition=campus.
[2014-06-04T13:16:10.111] sched: JobId=7060219. State=PENDING.
Reason=Priority(Priority), Priority=10850, Partition=campus.
[2014-06-04T13:16:10.111] sched: already tested 3 jobs, breaking out
My understanding is that it shouldn't hit that limit until
default_queue_depth. Has my controller lost it's mind? I've got a
nearly identical test setup where this is working as I'd expect.
Any hints appreciated... thanks much
Michael
Loading...