Discussion:
Implementing fair-share policy using BLCR
Eyal Privman
2014-09-22 21:36:39 UTC
Permalink
Hi,

We're setting up a new cluster in our faculty and we want to use a SLURM +
BLCR combination. We couldn't figure out how BLCR checkpointing would be
best used for improved fair-share policy.

This is a general scenario that's relevant to many academic research
institutes: each research group should get a certain percentage of the
nodes; when some groups don't run then their resources should be
distributed among the others; when they suddenly do start to run jobs they
should be given their share back. The problem with classical scheduling
policies without checkpointing is that very long jobs run by one group can
monopolize the share of other groups for a very long time (days). We cannot
split long jobs into several shorter jobs.

This is how I would imagine the ideal solution: all resources are
distributed between the groups running jobs at any given moment; when
another group submits jobs then its share is immediately freed by
automatically checkpointing and evicting the excess jobs of the running
groups. When nodes become available again the checkpointed jobs are
automatically resumed.

I know some places implement a policy where each group owns a high-priority
queue for its share of nodes and a public low-priority queue allows anyone
to run on the unused nodes. This public queue is checkpointed to allow jobs
to be evicted by the node owner and later resumed. However, this solution
requires each user to manually split their jobs between the private and
public queues, to monitor their progress, and to redistribute jobs between
the queues if one queue is slower than expected. It's desirable that SLURM
would automatically manage this without the complication of having two
queues. I.e. everybody submits to one queue and the smart scheduler manages
everything so that every group gets its fair share at any
moment, instantaneously thanks to checkpointing. Is such a solution
possible?

Many thanks,
Eyal
Yann Sagon
2014-09-23 14:18:34 UTC
Permalink
Post by Eyal Privman
Hi,
I know some places implement a policy where each group owns a
high-priority queue for its share of nodes and a public low-priority queue
allows anyone to run on the unused nodes. This public queue is checkpointed
to allow jobs to be evicted by the node owner and later resumed. However,
this solution requires each user to manually split their jobs between the
private and public queues, to monitor their progress, and to redistribute
jobs between the queues if one queue is slower than expected. It's
desirable that SLURM would automatically manage this without the
complication of having two queues. I.e. everybody submits to one queue and
the smart scheduler manages everything so that every group gets its fair
share at any moment, instantaneously thanks to checkpointing. Is such a
solution possible?
To lower the problem of having to deal with two queues, you can specify the
two queues like that when you submit a job : --partition=queue1,queue2 and
the first one that is free is selected.
Eyal Privman
2014-09-23 14:22:32 UTC
Permalink
Sounds good, thanks!

Eyal
Post by Yann Sagon
Post by Eyal Privman
Hi,
I know some places implement a policy where each group owns a
high-priority queue for its share of nodes and a public low-priority queue
allows anyone to run on the unused nodes. This public queue is checkpointed
to allow jobs to be evicted by the node owner and later resumed. However,
this solution requires each user to manually split their jobs between the
private and public queues, to monitor their progress, and to redistribute
jobs between the queues if one queue is slower than expected. It's
desirable that SLURM would automatically manage this without the
complication of having two queues. I.e. everybody submits to one queue and
the smart scheduler manages everything so that every group gets its fair
share at any moment, instantaneously thanks to checkpointing. Is such a
solution possible?
To lower the problem of having to deal with two queues, you can specify
the two queues like that when you submit a job : --partition=queue1,queue2
and the first one that is free is selected.
Kilian Cavalotti
2014-09-23 16:52:32 UTC
Permalink
Hi,
To lower the problem of having to deal with two queues, you can specify the two queues like that when you submit a job : --partition=queue1,queue2 and the first one that is free is selected.
You can even define an env variable in users' environment so they
don't have to type anything. "export SLURM_PARTITION=queue1,queue2"
would do the same. Note that for sbatch, it's SBATCH_PARTITION, and
SALLOC_PARTITION for salloc.

Cheers,
--
Kilian
Danny Auble
2014-09-23 16:54:31 UTC
Permalink
Or just use the all_partitions job_submit plugin.
Post by Kilian Cavalotti
Hi,
To lower the problem of having to deal with two queues, you can specify the two queues like that when you submit a job : --partition=queue1,queue2 and the first one that is free is selected.
You can even define an env variable in users' environment so they
don't have to type anything. "export SLURM_PARTITION=queue1,queue2"
would do the same. Note that for sbatch, it's SBATCH_PARTITION, and
SALLOC_PARTITION for salloc.
Cheers,
Trey Dockendorf
2014-09-23 17:26:33 UTC
Permalink
To avoid hijacking the previous thread I'm starting a new one, as my question is different than the OP but relates to the suggestions given.

The use of submitting with multiple partitions is a very useful feature I had neglected to realize existed. I gave it a quick test on my system (still in "BETA" phase) and found my job_submit plugin prevents the submissions from working as expected.

My job_submit plugin chooses a QOS based on the partition and default account of the user submitting a job. We have decided to not require users to specify a QOS as some of our users use software that cannot handle the --qos flag (jobs coming from Grid / OSG / etc).

Has anyone used the Lua job_submit plugin and also allows multiple partitions? I'm not even user what the partition value would be in the Lua code when a job is submitted with "--partition=general,background", for example.

A more broader question would be, would a QOS have to be allowed on all the partitions listed if the --qos flag is used?

My job_submit.lua - https://gist.github.com/treydock/b964c5599fd057b0aa6a

Thanks,
- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org

----- Original Message -----
Sent: Tuesday, September 23, 2014 11:55:14 AM
Subject: [slurm-dev] Re: Implementing fair-share policy using BLCR
Or just use the all_partitions job_submit plugin.
Post by Kilian Cavalotti
Hi,
Post by Yann Sagon
To lower the problem of having to deal with two queues, you can
--partition=queue1,queue2 and the first one that is free is
selected.
You can even define an env variable in users' environment so they
don't have to type anything. "export SLURM_PARTITION=queue1,queue2"
would do the same. Note that for sbatch, it's SBATCH_PARTITION, and
SALLOC_PARTITION for salloc.
Cheers,
Ryan Cox
2014-09-29 15:49:33 UTC
Permalink
Post by Trey Dockendorf
Has anyone used the Lua job_submit plugin and also allows multiple partitions? I'm not even user what the partition value would be in the Lua code when a job is submitted with "--partition=general,background", for example.
We do. We use the all_partitions plugin and our own Lua plugin for job
submission. In the Lua script, we remove partitions from the array that
they shouldn't have access to for whatever reason. Reasons include: the
job didn't request enough memory to "need" a bigmem node, the job didn't
request a GPU and this is a GPU partition, etc. The partition string
has commas so you can explode() it into an array.


Ryan
Trey Dockendorf
2014-09-29 18:50:35 UTC
Permalink
Ryan,

Thanks for the information. Is your Lua script something you would be
willing to share with me, either via the mailing list of privately? I'm
able to stumble my way around Lua and am curious how others are defining
available resources, conditions, allowed partitions, etc, in Lua. I've so
far resorted to hardcoded Key-value pair variables in the Lua script as
attempts to use slurm CLI commands via popen would result in the script
causing job submissions to timeout.

Thanks,
- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Post by Ryan Cox
Post by Trey Dockendorf
Has anyone used the Lua job_submit plugin and also allows multiple
partitions? I'm not even user what the partition value would be in the Lua
code when a job is submitted with "--partition=general,background", for
example.
We do. We use the all_partitions plugin and our own Lua plugin for job
submission. In the Lua script, we remove partitions from the array that
they shouldn't have access to for whatever reason. Reasons include: the
job didn't request enough memory to "need" a bigmem node, the job didn't
request a GPU and this is a GPU partition, etc. The partition string has
commas so you can explode() it into an array.
Ryan
Eyal Privman
2014-09-24 06:01:35 UTC
Permalink
Thanks to all for the tips!
They will definitely make our lives easier. However, this isn't really the
ideal solution. One drawback is that a job that started running in the
checkpointed low priority queue will stay there until it finishes even if a
slot becomes available in the high priority queue. In principle, there is
no need for two queues. So I still think the ideal scheduler would do this
with one queue.
Nevertheless, thanks for the great development and support!

Eyal
Post by Danny Auble
Or just use the all_partitions job_submit plugin.
Post by Kilian Cavalotti
Hi,
Post by Yann Sagon
To lower the problem of having to deal with two queues, you can specify
the two queues like that when you submit a job : --partition=queue1,queue2
and the first one that is free is selected.
You can even define an env variable in users' environment so they
don't have to type anything. "export SLURM_PARTITION=queue1,queue2"
would do the same. Note that for sbatch, it's SBATCH_PARTITION, and
SALLOC_PARTITION for salloc.
Cheers,
Loading...