fairshare - memory resource allocation

Discussion:

Bill Wichser

2014-07-25 13:13:31 UTC

I'd like to revisit this...

After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm. No longer do we have a shared node's
jobs eating all the memory and killing everything running there. But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.

In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available. This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.

So our default is 3G/core with an actual node having 4G/core available.
This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.

Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes. But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory. This is mostly serial jobs requesting a single
core. So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node. You can see where this is causing issues.

Fairshare has a number of other issues as well, which I will send under
a different email.

Now maybe this is just a matter of constant monitoring of user jobs and
proactively going after those users having small memory per core
requirements. We have attempted this in the past and have found that
the first job they run which crashes due to insufficient memory results
in all scripts being increased and so the process is never ending.

Another solution is to simply trust the users and just keep reminding
them about allocations. They are usually a smart bunch and are quite
creative when it comes to getting jobs to run! So maybe I am concerned
over nothing at all and things will just work out.

Bill

Lipari, Don

2014-07-25 16:04:34 UTC

Permalink

Bill,

As I understand the dilemma you presented, you want to maximize the utilization of node resources when running with Slurm configured for SelectType=select/cons_res. To do this, you would like to nudge users into requesting only the amount of memory they will need for their jobs. The nudge would be in the form of decreased fair-share priority for users' jobs that request only one core but lots of memory.

I don't know of a way for Slurm to do this as it exists. I can only offer alternatives that have their pros and cons.

One alternative would be to add memory usage support to the multifactor priority plugin. This would be a substantial undertaking as it touches code not just in multifactor/priority_multifactor.c but also in structures that are defined in common/assoc_mgr.h as well as sshare itself.

A second less invasive option would be to redefine the multifactor/priority_multifactor.c's raw_usage to make it a configurable blend of cpu and memory usage. These changes could be more localized to the multifactor/priority_multifactor.c module. However you would have a harder time justifying a user's sshare report because the usage numbers would no longer track jobs' historical cpu usage. You response to a user who asked you to justify their sshare usage report would be, "trust me, it's right".

A third alternative (as I'm sure you know) is to give up on perfectly packed nodes and make every 4G of memory requested cost 1 cpu of allocation.

Perhaps there are other options, but those are the ones that immediately come to mind.

Don Lipari

-----Original Message-----
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation
I'd like to revisit this...
After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm. No longer do we have a shared node's
jobs eating all the memory and killing everything running there. But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.
In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available. This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.
So our default is 3G/core with an actual node having 4G/core available.
This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.
Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes. But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory. This is mostly serial jobs requesting a single
core. So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node. You can see where this is causing issues.
Fairshare has a number of other issues as well, which I will send under
a different email.
Now maybe this is just a matter of constant monitoring of user jobs and
proactively going after those users having small memory per core
requirements. We have attempted this in the past and have found that
the first job they run which crashes due to insufficient memory results
in all scripts being increased and so the process is never ending.
Another solution is to simply trust the users and just keep reminding
them about allocations. They are usually a smart bunch and are quite
creative when it comes to getting jobs to run! So maybe I am concerned
over nothing at all and things will just work out.

Ryan Cox

2014-07-25 16:30:37 UTC

Permalink

Bill and Don,

We have wondered about this ourselves. I just came up with this idea
and haven't thought it through completely, but option two seems like the
easiest. For example, you could modify lines like
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952
to have a MAX() of a few different types.

I seem to recall seeing this on the list or in a bug report somewhere
already, but you could have different charge rates for memory or GPUs
compared to a CPU, maybe on a per partition basis. You could give each
of them a charge rate like:
PartitionName=p1 ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0 ......

So the line I referenced would be something like the following (except
using real code and real struct members, etc):
real_decay = run_decay * MAX(CPUs*ChargePerCPU,
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);

In this case, each CPU is 1.0 but each GB of RAM is 0.5. Assuming no
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting
usage is 1.0. But if they use 4 GB of RAM and 1 CPU, it is 2.0 just
like they had been using 2 CPUs. Essentially you define every 2 GB of
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with
"cpu equivalents".

It might be harder to explain to users but I don't think it would be too
bad.

Ryan

Post by Lipari, Don
Bill,
As I understand the dilemma you presented, you want to maximize the utilization of node resources when running with Slurm configured for SelectType=select/cons_res. To do this, you would like to nudge users into requesting only the amount of memory they will need for their jobs. The nudge would be in the form of decreased fair-share priority for users' jobs that request only one core but lots of memory.
I don't know of a way for Slurm to do this as it exists. I can only offer alternatives that have their pros and cons.
One alternative would be to add memory usage support to the multifactor priority plugin. This would be a substantial undertaking as it touches code not just in multifactor/priority_multifactor.c but also in structures that are defined in common/assoc_mgr.h as well as sshare itself.
A second less invasive option would be to redefine the multifactor/priority_multifactor.c's raw_usage to make it a configurable blend of cpu and memory usage. These changes could be more localized to the multifactor/priority_multifactor.c module. However you would have a harder time justifying a user's sshare report because the usage numbers would no longer track jobs' historical cpu usage. You response to a user who asked you to justify their sshare usage report would be, "trust me, it's right".
A third alternative (as I'm sure you know) is to give up on perfectly packed nodes and make every 4G of memory requested cost 1 cpu of allocation.
Perhaps there are other options, but those are the ones that immediately come to mind.
Don Lipari

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

Bill Wichser

2014-07-25 23:57:36 UTC

Permalink

Thank you Ryan. Not sure how we will proceed here.

Bill

Post by Ryan Cox
Bill and Don,
We have wondered about this ourselves. I just came up with this idea
and haven't thought it through completely, but option two seems like
the easiest. For example, you could modify lines like
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952
to have a MAX() of a few different types.
I seem to recall seeing this on the list or in a bug report somewhere
already, but you could have different charge rates for memory or GPUs
compared to a CPU, maybe on a per partition basis. You could give each
PartitionName=p1 ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0 ......
So the line I referenced would be something like the following (except
real_decay = run_decay * MAX(CPUs*ChargePerCPU,
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);
In this case, each CPU is 1.0 but each GB of RAM is 0.5. Assuming no
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting
usage is 1.0. But if they use 4 GB of RAM and 1 CPU, it is 2.0 just
like they had been using 2 CPUs. Essentially you define every 2 GB of
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with
"cpu equivalents".
It might be harder to explain to users but I don't think it would be
too bad.
Ryan

Post by Lipari, Don
Bill,
As I understand the dilemma you presented, you want to maximize the
utilization of node resources when running with Slurm configured for
SelectType=select/cons_res. To do this, you would like to nudge
users into requesting only the amount of memory they will need for
their jobs. The nudge would be in the form of decreased fair-share
priority for users' jobs that request only one core but lots of memory.
I don't know of a way for Slurm to do this as it exists. I can only
offer alternatives that have their pros and cons.
One alternative would be to add memory usage support to the
multifactor priority plugin. This would be a substantial undertaking
as it touches code not just in multifactor/priority_multifactor.c but
also in structures that are defined in common/assoc_mgr.h as well as
sshare itself.
A second less invasive option would be to redefine the
multifactor/priority_multifactor.c's raw_usage to make it a
configurable blend of cpu and memory usage. These changes could be
more localized to the multifactor/priority_multifactor.c module.
However you would have a harder time justifying a user's sshare
report because the usage numbers would no longer track jobs'
historical cpu usage. You response to a user who asked you to
justify their sshare usage report would be, "trust me, it's right".
A third alternative (as I'm sure you know) is to give up on perfectly
packed nodes and make every 4G of memory requested cost 1 cpu of
allocation.
Perhaps there are other options, but those are the ones that
immediately come to mind.
Don Lipari

-----Original Message-----
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation
I'd like to revisit this...
After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm. No longer do we have a shared node's
jobs eating all the memory and killing everything running there.
But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.
In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available. This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.
So our default is 3G/core with an actual node having 4G/core available.
This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.
Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes. But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory. This is mostly serial jobs requesting a single
core. So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node. You can see where this is causing issues.
Fairshare has a number of other issues as well, which I will send under
a different email.
Now maybe this is just a matter of constant monitoring of user jobs and
proactively going after those users having small memory per core
requirements. We have attempted this in the past and have found that
the first job they run which crashes due to insufficient memory results
in all scripts being increased and so the process is never ending.
Another solution is to simply trust the users and just keep reminding
them about allocations. They are usually a smart bunch and are quite
creative when it comes to getting jobs to run! So maybe I am concerned
over nothing at all and things will just work out.
Bill

Ryan Cox

2014-07-29 15:47:50 UTC

Permalink

I'm interested in hearing opinions on this, if any. Basically, I think
there is an easy solution to the problem of a user using few CPUs but a
lot of memory and that not being reflected well in the CPU-centric usage
stats.

Below is my proposal. There are likely some other good approaches out
there too (Don and Janne presented some) so feel free to tell me that
you don't like this idea :)

Short version

I propose that the Raw Usage be modified to *optionally* be ("CPU
equivalents" * time) instead of just (CPUs * time). The "CPU
equivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy over
that time period, or whatever multiplied by a corresponding charge rate
that an admin can configure on a per partition basis.

I wrote a simple proof of concept patch to demonstrate this (see "Proof
of Concept" below for details).

Longer version

The CPU equivalent would be used in place of total_cpus for calculating
usage_raw. I propose that the default charge rate be 1.0 for each CPU
in a job and 0.0 for everything else. This is the current behavior so
there are no behavior changes if you choose not to define a different
charge rate. The reason I think this should be done on a partition
basis is because different partitions may have nodes with different
memory/core ratios, etc. so one partition may have 2 GB/core and another
may have 8 GB/core nodes and you may want to charge differently on each.

If you define the charge rate for each CPU to be 1.0 and the charge rate
per GB of memory to be 0.5, that is saying that 2 GB of memory will be
equivalent to the charge rate for 1 CPU. 4 GB of memory would be
equivalent to 2 CPUs (4 GB * 0.5/GB). Since it is a MAX() of all the
available (resource * charge_rate) combinations, the largest value is
chosen. If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, the
user gets charged for using all the RAM. If a user uses 16 CPUs and 1
MB, the user gets charged for 16 CPUs.

Downsides

The problem that is not completely solved is if a user uses 1 CPU but
3/4 of the memory on a node. Then they only get billed for 3/4 of the
node but might make it unusable for others who need a whole or half
node. I'm not sure of a great way to solve that besides modifying the
request in a job submit plugin or requiring exclusive node access.

One other complication is for resources that include a counter rather
than a static allocation value, such as network bandwidth or energy.
This is a problem because the current approach is to immediately begin
decaying the cputime (aka usage) as it accumulates. This means you
would have to keep a delta value for each resource with a counter,
meaning you track that 5 GB have been transmitted since the last decay
thread iteration then only add that 5 GB. This could get messy when
comparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb)
each iteration since the bandwidth may never reach a high enough value
to matter between iterations but might when considered as an entire job.

I don't think this proposal would be too bad for something like energy.
You could define a charge rate per joule (or kilojoule or whatever) that
would equal the node's minimum power divided by core count. Then you
look at the delta of that time period. If they were allocated all cores
and used minimum power, they get charged 1.0 * core count. If they were
allocated all cores and used maximum power, they effectively get charged
for the difference in the node's max energy and min energy times the
energy charge rate. This calculation, as with others, would occur once
per decay thread iteration.

User Education

The reason I like this approach is that it is incredibly simple to
implement and I don't think it takes much effort to explain to users.
It would be easy to add other resources you want to charge for (it would
require a code addition, though it would be pretty simple if the data is
available in the right structs). It doesn't require any RPC changes.
sshare, etc only need manpage clarifications to say that the usage data
is "CPU equivalents". No new fields are required.

As for user education, you just need to explain the concept of "CPU
equivalents", something that can be easily done in the documentation.
The slurm.conf partition lines would be relatively easy to read too. If
you don't need to change the behavior, no slurm.conf changes or
explanations to users are required.

Proof of Concept

I did a really quick proof of concept (attached) based on the master
branch. It is very simple to charge for most things as long as the data
is there in the existing structs. One caveat for the test patch is that
I didn't see a float handler in the config parser so I skipped over that
for the test. Instead, each config parameter in slurm.conf should be
set to (desired_value * 1000) for now. Proper float handling can be
added if this is the route people want to take. The patch currently
implements charging for CPUs, memory (GB), and nodes.

Note: I saw a similar idea in a bug report from the University of
Chicago: http://bugs.schedmd.com/show_bug.cgi?id=858.

Ryan

Post by Lipari, Don
Bill,
As I understand the dilemma you presented, you want to maximize the
utilization of node resources when running with Slurm configured for
SelectType=select/cons_res. To do this, you would like to nudge
users into requesting only the amount of memory they will need for
their jobs. The nudge would be in the form of decreased fair-share
priority for users' jobs that request only one core but lots of memory.
I don't know of a way for Slurm to do this as it exists. I can only
offer alternatives that have their pros and cons.
One alternative would be to add memory usage support to the
multifactor priority plugin. This would be a substantial undertaking
as it touches code not just in multifactor/priority_multifactor.c but
also in structures that are defined in common/assoc_mgr.h as well as
sshare itself.
A second less invasive option would be to redefine the
multifactor/priority_multifactor.c's raw_usage to make it a
configurable blend of cpu and memory usage. These changes could be
more localized to the multifactor/priority_multifactor.c module.
However you would have a harder time justifying a user's sshare
report because the usage numbers would no longer track jobs'
historical cpu usage. You response to a user who asked you to
justify their sshare usage report would be, "trust me, it's right".
A third alternative (as I'm sure you know) is to give up on perfectly
packed nodes and make every 4G of memory requested cost 1 cpu of
allocation.
Perhaps there are other options, but those are the ones that
immediately come to mind.
Don Lipari

-----Original Message-----
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation
I'd like to revisit this...
After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm. No longer do we have a shared node's
jobs eating all the memory and killing everything running there.
But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.
In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available. This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.
So our default is 3G/core with an actual node having 4G/core available.
This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.
Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes. But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory. This is mostly serial jobs requesting a single
core. So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node. You can see where this is causing issues.
Fairshare has a number of other issues as well, which I will send under
a different email.
Now maybe this is just a matter of constant monitoring of user jobs and
proactively going after those users having small memory per core
requirements. We have attempted this in the past and have found that
the first job they run which crashes due to insufficient memory results
in all scripts being increased and so the process is never ending.
Another solution is to simply trust the users and just keep reminding
them about allocations. They are usually a smart bunch and are quite
creative when it comes to getting jobs to run! So maybe I am concerned
over nothing at all and things will just work out.
Bill

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

Blomqvist Janne

2014-07-30 21:29:45 UTC

Permalink

Hi,

if I understand it correctly, this is actually very close to Dominant Resource Fairness (DRF) which I mentioned previously, with the difference that in DRF the charge rates are determined automatically from the available resources (in a partition) rather than being specified explicitly by the administrator. So for an example, say we have a partition with 100 cores and 400 GB memory. Now for a job requesting (10CPU's, 20 GB) the domination calculation proceeds as follows:

1) Calculate the "domination vector" by dividing each element in the request vector (here, CPU & MEM) with the available resources. That is (10/100, 20/400) = (0.1, 0.05).

2) The MAX element in the domination vector is chosen (it "dominates" the others, hence the name of the algorithm) as the one to use in fairshare calculations, accounting etc. In this case, the CPU element (0.1).

Now for another job request, (1CPU, 20 GB) the domination vector is (0.01, 0.05) and the MAX element is the memory element (0.05), so in this case the memory part of the request dominates.

In your patch you have used "cpu-sec equivalents" rather than "dominant share secs", but that's just a difference of a scaling factor. From a backwards compatibility and user education point of view cpu-sec equivalents seem like a better choice to me, actually.

So while you patch is more flexible than DRF in that it allows arbitrary charge rates to be specified, I'm not sure it makes sense to specify rates different from the DRF ones? Or if one does specify different rates, it might end up breaking some of the fairness properties that are described in the DRF paper and opens up the algorithm for gaming?

--
Janne Blomqvist

________________________________________
From: Ryan Cox [ryan_cox-8Bzd4dk9+***@public.gmane.org]
Sent: Tuesday, July 29, 2014 18:47
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation

I'm interested in hearing opinions on this, if any. Basically, I think
there is an easy solution to the problem of a user using few CPUs but a
lot of memory and that not being reflected well in the CPU-centric usage
stats.

Below is my proposal. There are likely some other good approaches out
there too (Don and Janne presented some) so feel free to tell me that
you don't like this idea :)

Short version

I propose that the Raw Usage be modified to *optionally* be ("CPU
equivalents" * time) instead of just (CPUs * time). The "CPU
equivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy over
that time period, or whatever multiplied by a corresponding charge rate
that an admin can configure on a per partition basis.

I wrote a simple proof of concept patch to demonstrate this (see "Proof
of Concept" below for details).

Longer version

The CPU equivalent would be used in place of total_cpus for calculating
usage_raw. I propose that the default charge rate be 1.0 for each CPU
in a job and 0.0 for everything else. This is the current behavior so
there are no behavior changes if you choose not to define a different
charge rate. The reason I think this should be done on a partition
basis is because different partitions may have nodes with different
memory/core ratios, etc. so one partition may have 2 GB/core and another
may have 8 GB/core nodes and you may want to charge differently on each.

If you define the charge rate for each CPU to be 1.0 and the charge rate
per GB of memory to be 0.5, that is saying that 2 GB of memory will be
equivalent to the charge rate for 1 CPU. 4 GB of memory would be
equivalent to 2 CPUs (4 GB * 0.5/GB). Since it is a MAX() of all the
available (resource * charge_rate) combinations, the largest value is
chosen. If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, the
user gets charged for using all the RAM. If a user uses 16 CPUs and 1
MB, the user gets charged for 16 CPUs.

Downsides

The problem that is not completely solved is if a user uses 1 CPU but
3/4 of the memory on a node. Then they only get billed for 3/4 of the
node but might make it unusable for others who need a whole or half
node. I'm not sure of a great way to solve that besides modifying the
request in a job submit plugin or requiring exclusive node access.

One other complication is for resources that include a counter rather
than a static allocation value, such as network bandwidth or energy.
This is a problem because the current approach is to immediately begin
decaying the cputime (aka usage) as it accumulates. This means you
would have to keep a delta value for each resource with a counter,
meaning you track that 5 GB have been transmitted since the last decay
thread iteration then only add that 5 GB. This could get messy when
comparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb)
each iteration since the bandwidth may never reach a high enough value
to matter between iterations but might when considered as an entire job.

I don't think this proposal would be too bad for something like energy.
You could define a charge rate per joule (or kilojoule or whatever) that
would equal the node's minimum power divided by core count. Then you
look at the delta of that time period. If they were allocated all cores
and used minimum power, they get charged 1.0 * core count. If they were
allocated all cores and used maximum power, they effectively get charged
for the difference in the node's max energy and min energy times the
energy charge rate. This calculation, as with others, would occur once
per decay thread iteration.

User Education

The reason I like this approach is that it is incredibly simple to
implement and I don't think it takes much effort to explain to users.
It would be easy to add other resources you want to charge for (it would
require a code addition, though it would be pretty simple if the data is
available in the right structs). It doesn't require any RPC changes.
sshare, etc only need manpage clarifications to say that the usage data
is "CPU equivalents". No new fields are required.

As for user education, you just need to explain the concept of "CPU
equivalents", something that can be easily done in the documentation.
The slurm.conf partition lines would be relatively easy to read too. If
you don't need to change the behavior, no slurm.conf changes or
explanations to users are required.

Proof of Concept

I did a really quick proof of concept (attached) based on the master
branch. It is very simple to charge for most things as long as the data
is there in the existing structs. One caveat for the test patch is that
I didn't see a float handler in the config parser so I skipped over that
for the test. Instead, each config parameter in slurm.conf should be
set to (desired_value * 1000) for now. Proper float handling can be
added if this is the route people want to take. The patch currently
implements charging for CPUs, memory (GB), and nodes.

Note: I saw a similar idea in a bug report from the University of
Chicago: http://bugs.schedmd.com/show_bug.cgi?id=858.

Ryan

Post by Ryan Cox
Bill and Don,
We have wondered about this ourselves. I just came up with this idea
and haven't thought it through completely, but option two seems like
the easiest. For example, you could modify lines like
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952
to have a MAX() of a few different types.
I seem to recall seeing this on the list or in a bug report somewhere
already, but you could have different charge rates for memory or GPUs
compared to a CPU, maybe on a per partition basis. You could give each
PartitionName=p1 ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0
......
So the line I referenced would be something like the following (except
real_decay = run_decay * MAX(CPUs*ChargePerCPU,
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);
In this case, each CPU is 1.0 but each GB of RAM is 0.5. Assuming no
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting
usage is 1.0. But if they use 4 GB of RAM and 1 CPU, it is 2.0 just
like they had been using 2 CPUs. Essentially you define every 2 GB of
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with
"cpu equivalents".
It might be harder to explain to users but I don't think it would be
too bad.
Ryan

Post by Lipari, Don
Bill,
As I understand the dilemma you presented, you want to maximize the
utilization of node resources when running with Slurm configured for
SelectType=select/cons_res. To do this, you would like to nudge
users into requesting only the amount of memory they will need for
their jobs. The nudge would be in the form of decreased fair-share
priority for users' jobs that request only one core but lots of memory.
I don't know of a way for Slurm to do this as it exists. I can only
offer alternatives that have their pros and cons.
One alternative would be to add memory usage support to the
multifactor priority plugin. This would be a substantial undertaking
as it touches code not just in multifactor/priority_multifactor.c but
also in structures that are defined in common/assoc_mgr.h as well as
sshare itself.
A second less invasive option would be to redefine the
multifactor/priority_multifactor.c's raw_usage to make it a
configurable blend of cpu and memory usage. These changes could be
more localized to the multifactor/priority_multifactor.c module.
However you would have a harder time justifying a user's sshare
report because the usage numbers would no longer track jobs'
historical cpu usage. You response to a user who asked you to
justify their sshare usage report would be, "trust me, it's right".
A third alternative (as I'm sure you know) is to give up on perfectly
packed nodes and make every 4G of memory requested cost 1 cpu of
allocation.
Perhaps there are other options, but those are the ones that
immediately come to mind.
Don Lipari

-----Original Message-----
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation
I'd like to revisit this...
After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm. No longer do we have a shared node's
jobs eating all the memory and killing everything running there.
But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.
In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available. This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.
So our default is 3G/core with an actual node having 4G/core available.
This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.
Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes. But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory. This is mostly serial jobs requesting a single
core. So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node. You can see where this is causing issues.
Fairshare has a number of other issues as well, which I will send under
a different email.
Now maybe this is just a matter of constant monitoring of user jobs and
proactively going after those users having small memory per core
requirements. We have attempted this in the past and have found that
the first job they run which crashes due to insufficient memory results
in all scripts being increased and so the process is never ending.
Another solution is to simply trust the users and just keep reminding
them about allocations. They are usually a smart bunch and are quite
creative when it comes to getting jobs to run! So maybe I am concerned
over nothing at all and things will just work out.
Bill

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

Bjørn-Helge Mevik

2014-07-31 08:27:31 UTC

Permalink

Just a short note about terminology. I believe "processor equivalents"
(PE) is a much used term for this. It is at least what Maui and Moab
uses, if I recall correctly. The "resource*time" would then be PE seconds
(or hours, or whatever).

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

Ryan Cox

2014-07-31 16:09:30 UTC

Permalink

Thanks. I can certainly call it that. My understanding is that this
would be a slightly different implementation from Moab/Maui, but I don't
know those as well so I could be wrong. Either way, the concept is
similar enough that a more recognizable term might be good.

Does anyone else have thoughts on this? I called it "CPU equivalents"
because the calculation in the code is currently ("total_cpus" * time)
so I stuck with CPUs. Slurm seems to use lots of terms somewhat
interchangeably so I couldn't really decide. I don't really have an
opinion on the name so I'll just accept what others decide.

Ryan

Post by BjÃ¸rn-Helge Mevik
Just a short note about terminology. I believe "processor equivalents"
(PE) is a much used term for this. It is at least what Maui and Moab
uses, if I recall correctly. The "resource*time" would then be PE seconds
(or hours, or whatever).

Ryan Cox

2014-07-31 15:44:49 UTC

Permalink

Janne,

I appreciate the feedback. I agree that it makes the most sense to
specify rates like DRF most of the time. However, there are some use
cases that I'm aware of and others that are probably out there that
would make a DRF imitation difficult or less desirable if it's the only
option.

We happen to have one partition that has mixed memory amounts per node,
32 GB and 64 GB. Besides the memory differences (long story), the nodes
are homogeneous and each have 16 cores. I'm not sure I would like the
DRF approach for this particular scenario. In this case we would like
to set the charge rate to be .5/GB, or 1 core == 2 GB RAM. If someone
needs 64 GB per node, they are contending for a more limited resource
and we would be happy to double the charge rate for the 64 GB nodes. If
they need all 64 GB, they would end up being charged for 32
CPU/processor equivalents instead of 16. With DRF that wouldn't be
possible if I understand correctly.

One other feature that could be interesting is to have a "baseline"
standard for a CPU charge on a per-partition basis. Let's say that you
have three partitions: old_hardware, new_hardware, and
super_cooled_overclocked_awesomeness. You could set the per CPU charges
to be 0.8, 1.0, and 20.0. That would reflect that a cpu-hour on one
partition doesn't result in the same amount of computation as in another
partition. You could accomplish the same thing automatically by using a
QOS (and maybe some other parameter I'm not aware of) and maybe a job
submit plugin but this would make it easier. I don't know that we would
do this in our setup but it would be possible.

It would be possible to add a config parameter that is something like
Mem=DRF that would auto-configure it to match. The one question I have
about that approach is what to do about partitions with non-homogeneous
nodes. Does it make sense to sum the total cores and memory, etc or
should it default to a charge rate that is the min() of the node
configurations? Of course, partitions with mixed node types could be
difficult to support no matter what method is used for picking charge rates.

So yes, having a DRF-like auto-configuration could be nice and we might
even use it for most of our partitions. I don't think I'll attempt it
for the initial implementation but we'll see.

Thanks,
Ryan

Post by Blomqvist Janne
Hi,
1) Calculate the "domination vector" by dividing each element in the request vector (here, CPU & MEM) with the available resources. That is (10/100, 20/400) = (0.1, 0.05).
2) The MAX element in the domination vector is chosen (it "dominates" the others, hence the name of the algorithm) as the one to use in fairshare calculations, accounting etc. In this case, the CPU element (0.1).
Now for another job request, (1CPU, 20 GB) the domination vector is (0.01, 0.05) and the MAX element is the memory element (0.05), so in this case the memory part of the request dominates.
In your patch you have used "cpu-sec equivalents" rather than "dominant share secs", but that's just a difference of a scaling factor. From a backwards compatibility and user education point of view cpu-sec equivalents seem like a better choice to me, actually.
So while you patch is more flexible than DRF in that it allows arbitrary charge rates to be specified, I'm not sure it makes sense to specify rates different from the DRF ones? Or if one does specify different rates, it might end up breaking some of the fairness properties that are described in the DRF paper and opens up the algorithm for gaming?
--
Janne Blomqvist
________________________________________
Sent: Tuesday, July 29, 2014 18:47
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation
I'm interested in hearing opinions on this, if any. Basically, I think
there is an easy solution to the problem of a user using few CPUs but a
lot of memory and that not being reflected well in the CPU-centric usage
stats.
Below is my proposal. There are likely some other good approaches out
there too (Don and Janne presented some) so feel free to tell me that
you don't like this idea :)
Short version
I propose that the Raw Usage be modified to *optionally* be ("CPU
equivalents" * time) instead of just (CPUs * time). The "CPU
equivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy over
that time period, or whatever multiplied by a corresponding charge rate
that an admin can configure on a per partition basis.
I wrote a simple proof of concept patch to demonstrate this (see "Proof
of Concept" below for details).
Longer version
The CPU equivalent would be used in place of total_cpus for calculating
usage_raw. I propose that the default charge rate be 1.0 for each CPU
in a job and 0.0 for everything else. This is the current behavior so
there are no behavior changes if you choose not to define a different
charge rate. The reason I think this should be done on a partition
basis is because different partitions may have nodes with different
memory/core ratios, etc. so one partition may have 2 GB/core and another
may have 8 GB/core nodes and you may want to charge differently on each.
If you define the charge rate for each CPU to be 1.0 and the charge rate
per GB of memory to be 0.5, that is saying that 2 GB of memory will be
equivalent to the charge rate for 1 CPU. 4 GB of memory would be
equivalent to 2 CPUs (4 GB * 0.5/GB). Since it is a MAX() of all the
available (resource * charge_rate) combinations, the largest value is
chosen. If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, the
user gets charged for using all the RAM. If a user uses 16 CPUs and 1
MB, the user gets charged for 16 CPUs.
Downsides
The problem that is not completely solved is if a user uses 1 CPU but
3/4 of the memory on a node. Then they only get billed for 3/4 of the
node but might make it unusable for others who need a whole or half
node. I'm not sure of a great way to solve that besides modifying the
request in a job submit plugin or requiring exclusive node access.
One other complication is for resources that include a counter rather
than a static allocation value, such as network bandwidth or energy.
This is a problem because the current approach is to immediately begin
decaying the cputime (aka usage) as it accumulates. This means you
would have to keep a delta value for each resource with a counter,
meaning you track that 5 GB have been transmitted since the last decay
thread iteration then only add that 5 GB. This could get messy when
comparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb)
each iteration since the bandwidth may never reach a high enough value
to matter between iterations but might when considered as an entire job.
I don't think this proposal would be too bad for something like energy.
You could define a charge rate per joule (or kilojoule or whatever) that
would equal the node's minimum power divided by core count. Then you
look at the delta of that time period. If they were allocated all cores
and used minimum power, they get charged 1.0 * core count. If they were
allocated all cores and used maximum power, they effectively get charged
for the difference in the node's max energy and min energy times the
energy charge rate. This calculation, as with others, would occur once
per decay thread iteration.
User Education
The reason I like this approach is that it is incredibly simple to
implement and I don't think it takes much effort to explain to users.
It would be easy to add other resources you want to charge for (it would
require a code addition, though it would be pretty simple if the data is
available in the right structs). It doesn't require any RPC changes.
sshare, etc only need manpage clarifications to say that the usage data
is "CPU equivalents". No new fields are required.
As for user education, you just need to explain the concept of "CPU
equivalents", something that can be easily done in the documentation.
The slurm.conf partition lines would be relatively easy to read too. If
you don't need to change the behavior, no slurm.conf changes or
explanations to users are required.
Proof of Concept
I did a really quick proof of concept (attached) based on the master
branch. It is very simple to charge for most things as long as the data
is there in the existing structs. One caveat for the test patch is that
I didn't see a float handler in the config parser so I skipped over that
for the test. Instead, each config parameter in slurm.conf should be
set to (desired_value * 1000) for now. Proper float handling can be
added if this is the route people want to take. The patch currently
implements charging for CPUs, memory (GB), and nodes.
Note: I saw a similar idea in a bug report from the University of
Chicago: http://bugs.schedmd.com/show_bug.cgi?id=858.
Ryan

Post by Ryan Cox
Bill and Don,
We have wondered about this ourselves. I just came up with this idea
and haven't thought it through completely, but option two seems like
the easiest. For example, you could modify lines like
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952
to have a MAX() of a few different types.
I seem to recall seeing this on the list or in a bug report somewhere
already, but you could have different charge rates for memory or GPUs
compared to a CPU, maybe on a per partition basis. You could give each
PartitionName=p1 ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0
......
So the line I referenced would be something like the following (except
real_decay = run_decay * MAX(CPUs*ChargePerCPU,
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);
In this case, each CPU is 1.0 but each GB of RAM is 0.5. Assuming no
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting
usage is 1.0. But if they use 4 GB of RAM and 1 CPU, it is 2.0 just
like they had been using 2 CPUs. Essentially you define every 2 GB of
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with
"cpu equivalents".
It might be harder to explain to users but I don't think it would be
too bad.
Ryan

Post by Lipari, Don
Bill,
As I understand the dilemma you presented, you want to maximize the
utilization of node resources when running with Slurm configured for
SelectType=select/cons_res. To do this, you would like to nudge
users into requesting only the amount of memory they will need for
their jobs. The nudge would be in the form of decreased fair-share
priority for users' jobs that request only one core but lots of memory.
I don't know of a way for Slurm to do this as it exists. I can only
offer alternatives that have their pros and cons.
One alternative would be to add memory usage support to the
multifactor priority plugin. This would be a substantial undertaking
as it touches code not just in multifactor/priority_multifactor.c but
also in structures that are defined in common/assoc_mgr.h as well as
sshare itself.
A second less invasive option would be to redefine the
multifactor/priority_multifactor.c's raw_usage to make it a
configurable blend of cpu and memory usage. These changes could be
more localized to the multifactor/priority_multifactor.c module.
However you would have a harder time justifying a user's sshare
report because the usage numbers would no longer track jobs'
historical cpu usage. You response to a user who asked you to
justify their sshare usage report would be, "trust me, it's right".
A third alternative (as I'm sure you know) is to give up on perfectly
packed nodes and make every 4G of memory requested cost 1 cpu of
allocation.
Perhaps there are other options, but those are the ones that
immediately come to mind.
Don Lipari

-----Original Message-----
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation
I'd like to revisit this...
After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm. No longer do we have a shared node's
jobs eating all the memory and killing everything running there.
But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.
In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available. This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.
So our default is 3G/core with an actual node having 4G/core available.
This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.
Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes. But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory. This is mostly serial jobs requesting a single
core. So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node. You can see where this is causing issues.
Fairshare has a number of other issues as well, which I will send under
a different email.
Now maybe this is just a matter of constant monitoring of user jobs and
proactively going after those users having small memory per core
requirements. We have attempted this in the past and have found that
the first job they run which crashes due to insufficient memory results
in all scripts being increased and so the process is never ending.
Another solution is to simply trust the users and just keep reminding
them about allocations. They are usually a smart bunch and are quite
creative when it comes to getting jobs to run! So maybe I am concerned
over nothing at all and things will just work out.
Bill

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

Ryan Cox

2014-07-31 16:19:30 UTC

Permalink

All,

There has been more conversation on
http://bugs.schedmd.com/show_bug.cgi?id=858. It might be good to post
future comments there so we have just one central location for
everything. No worries if you'd rather reply on the list.

Once a solution is ready I'll post something to the list so everyone is
aware.

Ryan

Ulf Markwardt

2014-08-20 09:28:36 UTC

Permalink

Hi all,
this is a very interesting approach.
I hope we find a chance to discuss it in Lugano.
Ulf

--
___________________________________________________________________
Dr. Ulf Markwardt

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: (+49) 351/463-33640 WWW: http://www.tu-dresden.de/zih

Blomqvist Janne

2014-07-27 08:08:33 UTC

Permalink

Hi,

As a variation on the second option you propose, take a look at the concept of Dominant Resource Fairness [1], which is an algorithm for achieving multi-resource (e.g. cpu's, memory, disk/net BW, ...) fairness. By using "dominant share"-secs instead of cpu-secs in the current accounting code the changes would similarly be limited in scope.

[1] http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf
https://www.usenix.org/legacy/events/nsdi11/tech/slides/ghodsi.pdf

--
Janne Blomqvist

________________________________________
From: Lipari, Don [lipari1-i2BcT+NCU+***@public.gmane.org]
Sent: Friday, July 25, 2014 19:04
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation

Bill,

As I understand the dilemma you presented, you want to maximize the utilization of node resources when running with Slurm configured for SelectType=select/cons_res. To do this, you would like to nudge users into requesting only the amount of memory they will need for their jobs. The nudge would be in the form of decreased fair-share priority for users' jobs that request only one core but lots of memory.

I don't know of a way for Slurm to do this as it exists. I can only offer alternatives that have their pros and cons.

One alternative would be to add memory usage support to the multifactor priority plugin. This would be a substantial undertaking as it touches code not just in multifactor/priority_multifactor.c but also in structures that are defined in common/assoc_mgr.h as well as sshare itself.

A second less invasive option would be to redefine the multifactor/priority_multifactor.c's raw_usage to make it a configurable blend of cpu and memory usage. These changes could be more localized to the multifactor/priority_multifactor.c module. However you would have a harder time justifying a user's sshare report because the usage numbers would no longer track jobs' historical cpu usage. You response to a user who asked you to justify their sshare usage report would be, "trust me, it's right".

A third alternative (as I'm sure you know) is to give up on perfectly packed nodes and make every 4G of memory requested cost 1 cpu of allocation.

Perhaps there are other options, but those are the ones that immediately come to mind.

Don Lipari