Partition for unused resources until needed by any other partition

Discussion:

Mikael Johansson

2014-10-20 17:51:38 UTC

Hello All,

I've been scratching my head for a while now trying to figure this one
out, which I would think would be a rather common setup.

I would need to set up a partition (or whatever, maybe a partition is
actually not the way to go) with the following properties:

1. If there are any unused cores on the cluster, jobs submitted to this
one would use them, and immediately have access to them.

2. The jobs should only use these resources until _any_ other job in
another partition needs them. In this case, the jobs should be
preempted and requeued.

So this should be some sort of "shadow" queue/partition, that shouldn't
affect the scheduling of other jobs on the cluster, but just use up any
free resources that momentarily happen to be available. So SLURM should
just continue scheduling everything else normally, and treat the cores
used by this shadow queue as free resources, and then just immediately
cancel and requeue any jobs there, when a "real" job starts.

If anyone has something like this set up, example configs would be very
welcome, as of course all other suggestions and ideas.

Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/

j***@public.gmane.org

2014-10-20 18:02:41 UTC

Permalink

This should help:
http://slurm.schedmd.com/preempt.html

Post by Mikael Johansson
Hello All,
I've been scratching my head for a while now trying to figure this
one out, which I would think would be a rather common setup.
I would need to set up a partition (or whatever, maybe a partition
1. If there are any unused cores on the cluster, jobs submitted to this
one would use them, and immediately have access to them.
2. The jobs should only use these resources until _any_ other job in
another partition needs them. In this case, the jobs should be
preempted and requeued.
So this should be some sort of "shadow" queue/partition, that
shouldn't affect the scheduling of other jobs on the cluster, but
just use up any free resources that momentarily happen to be
available. So SLURM should just continue scheduling everything else
normally, and treat the cores used by this shadow queue as free
resources, and then just immediately cancel and requeue any jobs
there, when a "real" job starts.
If anyone has something like this set up, example configs would be
very welcome, as of course all other suggestions and ideas.
Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/

--
Morris "Moe" Jette
CTO, SchedMD LLC

Paul Edmon

2014-10-20 18:16:33 UTC

Permalink

Yes, we have this set up here. Here is an example:

# Serial Requeue
PartitionName=serial_requeue Priority=1 \
PreemptMode=REQUEUE MaxTime=7-0 Default=YES MaxNodes=1 \
AllowGroups=cluster_users \
Nodes=blah

# Priority
PartitionName=priority Priority=10 \
AllowGroups=important_people \
Nodes=blah

# JOB PREEMPTION
PreemptType=preempt/partition_prio
PreemptMode=REQUEUE

Since serial_requeue is the lowest priority it gets scheduled last and
if any jobs come in from the higher priority queue it requeues the lower
priority jobs.

-Paul Edmon-

Post by j***@public.gmane.org
http://slurm.schedmd.com/preempt.html

Post by Mikael Johansson
Hello All,
I've been scratching my head for a while now trying to figure this
one out, which I would think would be a rather common setup.
I would need to set up a partition (or whatever, maybe a partition is
1. If there are any unused cores on the cluster, jobs submitted to this
one would use them, and immediately have access to them.
2. The jobs should only use these resources until _any_ other job in
another partition needs them. In this case, the jobs should be
preempted and requeued.
So this should be some sort of "shadow" queue/partition, that
shouldn't affect the scheduling of other jobs on the cluster, but
just use up any free resources that momentarily happen to be
available. So SLURM should just continue scheduling everything else
normally, and treat the cores used by this shadow queue as free
resources, and then just immediately cancel and requeue any jobs
there, when a "real" job starts.
If anyone has something like this set up, example configs would be
very welcome, as of course all other suggestions and ideas.
Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/

Mikael Johansson

2014-10-20 18:18:35 UTC

Permalink

Hello,

Yeah, I looked at that, and have now four partitions defined like this:

PartitionName=short Nodes=node[005-026] Default=YES MaxNodes=6 MaxTime=02:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=no PreemptMode=off
PartitionName=medium Nodes=node[009-026] Default=NO MaxNodes=4 MaxTime=168:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=no PreemptMode=off
PartitionName=long Nodes=node[001-004] Default=NO MaxNodes=4 MaxTime=744:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=no PreemptMode=off
PartitionName=backfill Nodes=node[001-026] Default=NO MaxNodes=10 MaxTime=168:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=no PreemptMode=requeue

And I've set up:

PreemptType=preempt/partition_prio
PreemptMode=requeue
PriorityType=priority/multifactor
PriorityWeightFairshare=10000
PriorityWeightAge=2000
SelectType=select/cons_res

It works so far as when a job in "backfill" gets running, it will be
requeued when a job in one of the other partitions start.

The problem is that there's plenty of free cores on the cluster that don't
get assigned jobs from "backfill". If I understand things correctly, this
is because there are jobs with a higher priority queueing in the other
partitions.

So I would maybe need a mechanism that increases the priority of the
backfill jobs while queueing, but then immediately decreases it when the
jobs start?

Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/

http: //slurm.schedmd.com/preempt.html

Post by Mikael Johansson
Hello All,
I've been scratching my head for a while now trying to figure this one out,
which I would think would be a rather common setup.
I would need to set up a partition (or whatever, maybe a partition is
1. If there are any unused cores on the cluster, jobs submitted to this
one would use them, and immediately have access to them.
2. The jobs should only use these resources until _any_ other job in
another partition needs them. In this case, the jobs should be
preempted and requeued.
So this should be some sort of "shadow" queue/partition, that shouldn't
affect the scheduling of other jobs on the cluster, but just use up any
free resources that momentarily happen to be available. So SLURM should
just continue scheduling everything else normally, and treat the cores used
by this shadow queue as free resources, and then just immediately cancel
and requeue any jobs there, when a "real" job starts.
If anyone has something like this set up, example configs would be very
welcome, as of course all other suggestions and ideas.
Cheers,
Mikael J.
http: //www.iki.fi/~mpjohans/

--
Morris "Moe" Jette
CTO, SchedMD LLC

Paul Edmon

2014-10-20 18:23:42 UTC

Permalink

I advise using the following SchedulerParameters, partition_job_depth,
and bf_max_job_part. This will force the scheduler to schedule jobs for
each partition. Otherwise it will take a strictly top down approach.

This is what we run:

# default_queue_depth should be some multiple of the partition_job_depth,
# ideally number_of_partitions * partition_job_depth.
SchedulerParameters=default_queue_depth=5700,partition_job_depth=100,bf_interval=1,bf_continue,bf_window=2880,bf_resolution=3600,bf_max_job_test=50000,bf_max_job_part=50000,b
f_max_job_user=1,bf_max_job_start=100,max_rpc_cnt=8

These parameters work well for a cluster of 50,000 cores, 57 queues, and
about 40,000 jobs per day. We are running 14.03.8

-Paul Edmon-

Post by Mikael Johansson
Hello,
PartitionName=short Nodes=node[005-026] Default=YES MaxNodes=6
MaxTime=02:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=no PreemptMode=off
PartitionName=medium Nodes=node[009-026] Default=NO MaxNodes=4
MaxTime=168:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=no PreemptMode=off
PartitionName=long Nodes=node[001-004] Default=NO MaxNodes=4
MaxTime=744:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=no PreemptMode=off
PartitionName=backfill Nodes=node[001-026] Default=NO MaxNodes=10
MaxTime=168:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=no PreemptMode=requeue
PreemptType=preempt/partition_prio
PreemptMode=requeue
PriorityType=priority/multifactor
PriorityWeightFairshare=10000
PriorityWeightAge=2000
SelectType=select/cons_res
It works so far as when a job in "backfill" gets running, it will be
requeued when a job in one of the other partitions start.
The problem is that there's plenty of free cores on the cluster that
don't get assigned jobs from "backfill". If I understand things
correctly, this is because there are jobs with a higher priority
queueing in the other partitions.
So I would maybe need a mechanism that increases the priority of the
backfill jobs while queueing, but then immediately decreases it when
the jobs start?
Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/

http: //slurm.schedmd.com/preempt.html

--
Morris "Moe" Jette
CTO, SchedMD LLC

Mikael Johansson

2014-10-21 20:24:35 UTC

Permalink

Thanks!

That looks like something that could be useful indeed. We are for the
moment stuck with version 2.2.7, though, and if I understood the docs
correctly, most of the partition based parameters are of later date and
versions. We might upgrade in some future, though.

It also seems like the scheduling problems are not directly related to the
setup of the partitions, but rather to some odd behaviour on the part of
SLURM. I'll write a separate post about that, as it seems it might be a
bug.

Cheers, and thanks again,
Mikael J.
http://www.iki.fi/~mpjohans/

I advise using the following SchedulerParameters, partition_job_depth, and
bf_max_job_part. This will force the scheduler to schedule jobs for each
partition. Otherwise it will take a strictly top down approach.
# default_queue_depth should be some multiple of the partition_job_depth,
# ideally number_of_partitions * partition_job_depth.
SchedulerParameters=default_queue_depth=5700,partition_job_depth=100,bf_interval=1,bf_continue,bf_window=2880,bf_resolution=3600,bf_max_job_test=50000,bf_max_job_part=50000,bf_max_job_user=1,bf_max_job_start=100,max_rpc_cnt=8
These parameters work well for a cluster of 50,000 cores, 57 queues, and
about 40,000 jobs per day. We are running 14.03.8
-Paul Edmon-