Eyal Privman
2014-09-22 21:36:39 UTC
Hi,
We're setting up a new cluster in our faculty and we want to use a SLURM +
BLCR combination. We couldn't figure out how BLCR checkpointing would be
best used for improved fair-share policy.
This is a general scenario that's relevant to many academic research
institutes: each research group should get a certain percentage of the
nodes; when some groups don't run then their resources should be
distributed among the others; when they suddenly do start to run jobs they
should be given their share back. The problem with classical scheduling
policies without checkpointing is that very long jobs run by one group can
monopolize the share of other groups for a very long time (days). We cannot
split long jobs into several shorter jobs.
This is how I would imagine the ideal solution: all resources are
distributed between the groups running jobs at any given moment; when
another group submits jobs then its share is immediately freed by
automatically checkpointing and evicting the excess jobs of the running
groups. When nodes become available again the checkpointed jobs are
automatically resumed.
I know some places implement a policy where each group owns a high-priority
queue for its share of nodes and a public low-priority queue allows anyone
to run on the unused nodes. This public queue is checkpointed to allow jobs
to be evicted by the node owner and later resumed. However, this solution
requires each user to manually split their jobs between the private and
public queues, to monitor their progress, and to redistribute jobs between
the queues if one queue is slower than expected. It's desirable that SLURM
would automatically manage this without the complication of having two
queues. I.e. everybody submits to one queue and the smart scheduler manages
everything so that every group gets its fair share at any
moment, instantaneously thanks to checkpointing. Is such a solution
possible?
Many thanks,
Eyal
We're setting up a new cluster in our faculty and we want to use a SLURM +
BLCR combination. We couldn't figure out how BLCR checkpointing would be
best used for improved fair-share policy.
This is a general scenario that's relevant to many academic research
institutes: each research group should get a certain percentage of the
nodes; when some groups don't run then their resources should be
distributed among the others; when they suddenly do start to run jobs they
should be given their share back. The problem with classical scheduling
policies without checkpointing is that very long jobs run by one group can
monopolize the share of other groups for a very long time (days). We cannot
split long jobs into several shorter jobs.
This is how I would imagine the ideal solution: all resources are
distributed between the groups running jobs at any given moment; when
another group submits jobs then its share is immediately freed by
automatically checkpointing and evicting the excess jobs of the running
groups. When nodes become available again the checkpointed jobs are
automatically resumed.
I know some places implement a policy where each group owns a high-priority
queue for its share of nodes and a public low-priority queue allows anyone
to run on the unused nodes. This public queue is checkpointed to allow jobs
to be evicted by the node owner and later resumed. However, this solution
requires each user to manually split their jobs between the private and
public queues, to monitor their progress, and to redistribute jobs between
the queues if one queue is slower than expected. It's desirable that SLURM
would automatically manage this without the complication of having two
queues. I.e. everybody submits to one queue and the smart scheduler manages
everything so that every group gets its fair share at any
moment, instantaneously thanks to checkpointing. Is such a solution
possible?
Many thanks,
Eyal