Dynamic partitions on Linux cluster

Discussion:

Uwe Sauter

2014-08-14 09:11:31 UTC

Hi all,

I got a question about a configuration detail: "dynamic partitions"

Situation:
I operate a Linux cluster of currently 54 nodes for a cooperation of two
different institutes at the university. To reflect the ratio of cash
those institutes invested I configured SLURM with two partition, one for
each institute. Those partitions have assigned different numbers of
nodes in a hard way, e.g.

PartitionName=InstA Nodes=n[01-20]
PartitionName=InstB Nodes=n[21-54]

To improve availability in case nodes break (and perhaps save some
power) I'd like to configure SLURM in a way that jobs can be assigned
nodes from the whole pool, respecting the number of nodes each institute
bought.

Research so far:
There is an option for partition configuration called "MaxNodes" but the
man pages state that this restricts the maximum number of nodes PER JOB.
It probably is possible to get something similar working using limit
enforcment through accounting, but I haven't understood that part of
SLURM 100% yet.
BlueGene systems seem to have the ability for something alike but then
this is for IBM systems only.

Question:
Is it possible to configure SLURM so that both partitions could utilize
all nodes but respect a maximum number of nodes that may be used the
same time? Something like:

PartitionName=InstA Nodes=n[01-54] MaxPartNodes=20
PartitionName=InstB Nodes=n[01-54] MaxPartNodes=34

So is there a way to achieve this using the confg file? Do I have to use
accounting to enfoce the limits? Or is there another way that I don't see?

Best regards,

Uwe Sauter

Bill Barth

2014-08-14 12:33:32 UTC

Permalink

Why not make one partition and use fairshare to balance the usage over
time? That way both institutes can run large jobs that span the whole
machine when others are not using it.

Bill.
--
Bill Barth, Ph.D., Director, HPC
bbarth-***@public.gmane.org | Phone: (512) 232-7069
Office: ROC 1.435 | Fax: (512) 475-9445

Post by Uwe Sauter
Hi all,
I got a question about a configuration detail: "dynamic partitions"
I operate a Linux cluster of currently 54 nodes for a cooperation of two
different institutes at the university. To reflect the ratio of cash
those institutes invested I configured SLURM with two partition, one for
each institute. Those partitions have assigned different numbers of
nodes in a hard way, e.g.
PartitionName=InstA Nodes=n[01-20]
PartitionName=InstB Nodes=n[21-54]
To improve availability in case nodes break (and perhaps save some
power) I'd like to configure SLURM in a way that jobs can be assigned
nodes from the whole pool, respecting the number of nodes each institute
bought.
There is an option for partition configuration called "MaxNodes" but the
man pages state that this restricts the maximum number of nodes PER JOB.
It probably is possible to get something similar working using limit
enforcment through accounting, but I haven't understood that part of
SLURM 100% yet.
BlueGene systems seem to have the ability for something alike but then
this is for IBM systems only.
Is it possible to configure SLURM so that both partitions could utilize
all nodes but respect a maximum number of nodes that may be used the
PartitionName=InstA Nodes=n[01-54] MaxPartNodes=20
PartitionName=InstB Nodes=n[01-54] MaxPartNodes=34
So is there a way to achieve this using the confg file? Do I have to use
accounting to enfoce the limits? Or is there another way that I don't see?
Best regards,
Uwe Sauter

Uwe Sauter

2014-08-14 12:48:32 UTC

Permalink

Hi Bill,

if I understand the concept of fairshare correctly, this could result in
a situation where one institute uses all resources.

Because of this fairshare is out of the question as I have to enforce
the ratio between the institutes - I cannot allow usage that would
result in one institute using more than what they paied for. If an
institute doesn't use the resources they have to run idle (or power down).

You could compare my situation with running two clusters that use the
same base infrastructure. What I want to do is enable users of both
institutes to use both clusters - but for each point in time use a
maximum of nodes that belong to "their" cluster.

Regards,

Uwe

Post by Bill Barth
Why not make one partition and use fairshare to balance the usage over
time? That way both institutes can run large jobs that span the whole
machine when others are not using it.
Bill.
--
Bill Barth, Ph.D., Director, HPC
Office: ROC 1.435 | Fax: (512) 475-9445

Bill Barth

2014-08-14 12:58:33 UTC

Permalink

Yes, yes it does. I don't mean to be harsh, but doing it their way is a
potentially huge waste of resources. Why not get each institute to agree
to share the whole machine in proportion to what they paid? Each institute
gets an allocation of time (through accounting) and a fairshare fraction
in the ratio of their contribution, but is allowed to use the whole
machine. If both institutes have periods of down time, then the machine
will be less likely to sit idle and more work will get done.

I'll get off my soapbox now.

Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bbarth-***@public.gmane.org | Phone: (512) 232-7069
Office: ROC 1.435 | Fax: (512) 475-9445

Post by Uwe Sauter
Hi Bill,
if I understand the concept of fairshare correctly, this could result in
a situation where one institute uses all resources.
Because of this fairshare is out of the question as I have to enforce
the ratio between the institutes - I cannot allow usage that would
result in one institute using more than what they paied for. If an
institute doesn't use the resources they have to run idle (or power down).
You could compare my situation with running two clusters that use the
same base infrastructure. What I want to do is enable users of both
institutes to use both clusters - but for each point in time use a
maximum of nodes that belong to "their" cluster.
Regards,
Uwe

Uwe Sauter

2014-08-14 13:25:08 UTC

Permalink

I would totally agree with you but university administration has to
justify the part of the first institute (because it was paid with
federal money) while the other institute paid for themselves and can do
with their part what they want.

This is the reason for the current unflexible mapping between partition
and nodes. To get away from that for better availability I'm looking for
a way to have a dynamic mapping that just enforces the ratio between the
institutes while flexibly allocate the nodes from the whole pool.

I know its a waste of resources but I am bound to this decision...

Regards,

Uwe

Post by Bill Barth
Yes, yes it does. I don't mean to be harsh, but doing it their way is a
potentially huge waste of resources. Why not get each institute to agree
to share the whole machine in proportion to what they paid? Each institute
gets an allocation of time (through accounting) and a fairshare fraction
in the ratio of their contribution, but is allowed to use the whole
machine. If both institutes have periods of down time, then the machine
will be less likely to sit idle and more work will get done.
I'll get off my soapbox now.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
Office: ROC 1.435 | Fax: (512) 475-9445

Paul Edmon

2014-08-14 13:47:35 UTC

Permalink

We have a bit of a similar situation here. A possible solution that may
work for you is QoS. The QoS's behave like a synthetic partition. That
way you can have a single partition but multiple QoS's which can flex
around down nodes.

From the experimentation I have done with them this may be a good
solution for you.

-Paul Edmon-

Post by Uwe Sauter
I would totally agree with you but university administration has to
justify the part of the first institute (because it was paid with
federal money) while the other institute paid for themselves and can do
with their part what they want.
This is the reason for the current unflexible mapping between partition
and nodes. To get away from that for better availability I'm looking for
a way to have a dynamic mapping that just enforces the ratio between the
institutes while flexibly allocate the nodes from the whole pool.
I know its a waste of resources but I am bound to this decision...
Regards,
Uwe

Ryan Cox

2014-08-14 14:26:33 UTC

Permalink

I would also recommend QOS if you absolutely can't use fairshare. Set up
a QOS per institute with a GrpNodes limit that is the correct ratio and
only allow institute members to their QOS (make it their default too).

Alternatively you can also do one account per institute and set GrpNodes
there, though that is less flexible than a QOS.

Ryan

Post by Paul Edmon
We have a bit of a similar situation here. A possible solution that
may work for you is QoS. The QoS's behave like a synthetic
partition. That way you can have a single partition but multiple
QoS's which can flex around down nodes.
From the experimentation I have done with them this may be a good
solution for you.
-Paul Edmon-