backfill breaking out too early

Discussion:

Michael Gutteridge

2014-06-05 18:16:57 UTC

I'm running slurm 2.6.9: I've got the backfill scheduler set up with
some pretty ridiculous parameters as we have a large number of queued
jobs of various dimensions:

SchedulerParameters=default_queue_depth=10000,bf_continue,bf_interval=120,bf_max_job_user=10000,bf_resolution=600,bf_window=4320,bf_max_job_part=10000

This has been working fine- backfill was effectively going through the
full queue- but today it appears to have stopped- jobs which should be
backfilled onto idle resources aren't being run. The scheduler log
shows:

[2014-06-04T13:16:10.107] sched: Running job scheduler
[2014-06-04T13:16:10.111] sched: JobId=7060218. State=PENDING.
Reason=Resources. Priority=10850. Partition=campus.
[2014-06-04T13:16:10.111] sched: JobId=7060219. State=PENDING.
Reason=Priority(Priority), Priority=10850, Partition=campus.
[2014-06-04T13:16:10.111] sched: already tested 3 jobs, breaking out

My understanding is that it shouldn't hit that limit until
default_queue_depth. Has my controller lost it's mind? I've got a
nearly identical test setup where this is working as I'd expect.

Any hints appreciated... thanks much

Michael

Paul Edmon

2014-06-05 18:33:32 UTC

Permalink

That's the main scheduler braking out not the backfill if I am reading
that correctly. The main scheduler will go until it can't schedule jobs
anymore or until it hits a 4 second runtime.

At least that's my read of that log.

-Paul Edmon-

Post by Michael Gutteridge
I'm running slurm 2.6.9: I've got the backfill scheduler set up with
some pretty ridiculous parameters as we have a large number of queued
SchedulerParameters=default_queue_depth=10000,bf_continue,bf_interval=120,bf_max_job_user=10000,bf_resolution=600,bf_window=4320,bf_max_job_part=10000
This has been working fine- backfill was effectively going through the
full queue- but today it appears to have stopped- jobs which should be
backfilled onto idle resources aren't being run. The scheduler log
[2014-06-04T13:16:10.107] sched: Running job scheduler
[2014-06-04T13:16:10.111] sched: JobId=7060218. State=PENDING.
Reason=Resources. Priority=10850. Partition=campus.
[2014-06-04T13:16:10.111] sched: JobId=7060219. State=PENDING.
Reason=Priority(Priority), Priority=10850, Partition=campus.
[2014-06-04T13:16:10.111] sched: already tested 3 jobs, breaking out
My understanding is that it shouldn't hit that limit until
default_queue_depth. Has my controller lost it's mind? I've got a
nearly identical test setup where this is working as I'd expect.
Any hints appreciated... thanks much
Michael

Danny Auble

2014-06-05 19:04:30 UTC

Permalink

Hey Michael,

A commit in 14.03.1 that may be related to what you are seeing is
e94f10b8a2f85936e487358a0da001a271898d4f. It is a partial revert of
commit 9b1dadea4eb823b5ef29d8b4ee56cb6b7c3be22f which first appeared in
2.6.8. Try applying that (or upgrading to 14.03) and see if it fixes
your issue.

I think Paul is correct though. I don't think this is the backfill
loop, but the normal scheduler.

Danny

j***@public.gmane.org

2014-06-05 19:18:31 UTC

Permalink

Tme message is definitely from the main scheduling loop rather than
backfill. I would guess that three batch jobs were submitted since the
last time the scheduler ran and it is only testing three jobs for
scheduling at that time.