Hi,
if I understand it correctly, this is actually very close to Dominant Resource Fairness (DRF) which I mentioned previously, with the difference that in DRF the charge rates are determined automatically from the available resources (in a partition) rather than being specified explicitly by the administrator. So for an example, say we have a partition with 100 cores and 400 GB memory. Now for a job requesting (10CPU's, 20 GB) the domination calculation proceeds as follows:
1) Calculate the "domination vector" by dividing each element in the request vector (here, CPU & MEM) with the available resources. That is (10/100, 20/400) = (0.1, 0.05).
2) The MAX element in the domination vector is chosen (it "dominates" the others, hence the name of the algorithm) as the one to use in fairshare calculations, accounting etc. In this case, the CPU element (0.1).
Now for another job request, (1CPU, 20 GB) the domination vector is (0.01, 0.05) and the MAX element is the memory element (0.05), so in this case the memory part of the request dominates.
In your patch you have used "cpu-sec equivalents" rather than "dominant share secs", but that's just a difference of a scaling factor. From a backwards compatibility and user education point of view cpu-sec equivalents seem like a better choice to me, actually.
So while you patch is more flexible than DRF in that it allows arbitrary charge rates to be specified, I'm not sure it makes sense to specify rates different from the DRF ones? Or if one does specify different rates, it might end up breaking some of the fairness properties that are described in the DRF paper and opens up the algorithm for gaming?
--
Janne Blomqvist
________________________________________
From: Ryan Cox [ryan_cox-8Bzd4dk9+***@public.gmane.org]
Sent: Tuesday, July 29, 2014 18:47
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation
I'm interested in hearing opinions on this, if any. Basically, I think
there is an easy solution to the problem of a user using few CPUs but a
lot of memory and that not being reflected well in the CPU-centric usage
stats.
Below is my proposal. There are likely some other good approaches out
there too (Don and Janne presented some) so feel free to tell me that
you don't like this idea :)
Short version
I propose that the Raw Usage be modified to *optionally* be ("CPU
equivalents" * time) instead of just (CPUs * time). The "CPU
equivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy over
that time period, or whatever multiplied by a corresponding charge rate
that an admin can configure on a per partition basis.
I wrote a simple proof of concept patch to demonstrate this (see "Proof
of Concept" below for details).
Longer version
The CPU equivalent would be used in place of total_cpus for calculating
usage_raw. I propose that the default charge rate be 1.0 for each CPU
in a job and 0.0 for everything else. This is the current behavior so
there are no behavior changes if you choose not to define a different
charge rate. The reason I think this should be done on a partition
basis is because different partitions may have nodes with different
memory/core ratios, etc. so one partition may have 2 GB/core and another
may have 8 GB/core nodes and you may want to charge differently on each.
If you define the charge rate for each CPU to be 1.0 and the charge rate
per GB of memory to be 0.5, that is saying that 2 GB of memory will be
equivalent to the charge rate for 1 CPU. 4 GB of memory would be
equivalent to 2 CPUs (4 GB * 0.5/GB). Since it is a MAX() of all the
available (resource * charge_rate) combinations, the largest value is
chosen. If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, the
user gets charged for using all the RAM. If a user uses 16 CPUs and 1
MB, the user gets charged for 16 CPUs.
Downsides
The problem that is not completely solved is if a user uses 1 CPU but
3/4 of the memory on a node. Then they only get billed for 3/4 of the
node but might make it unusable for others who need a whole or half
node. I'm not sure of a great way to solve that besides modifying the
request in a job submit plugin or requiring exclusive node access.
One other complication is for resources that include a counter rather
than a static allocation value, such as network bandwidth or energy.
This is a problem because the current approach is to immediately begin
decaying the cputime (aka usage) as it accumulates. This means you
would have to keep a delta value for each resource with a counter,
meaning you track that 5 GB have been transmitted since the last decay
thread iteration then only add that 5 GB. This could get messy when
comparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb)
each iteration since the bandwidth may never reach a high enough value
to matter between iterations but might when considered as an entire job.
I don't think this proposal would be too bad for something like energy.
You could define a charge rate per joule (or kilojoule or whatever) that
would equal the node's minimum power divided by core count. Then you
look at the delta of that time period. If they were allocated all cores
and used minimum power, they get charged 1.0 * core count. If they were
allocated all cores and used maximum power, they effectively get charged
for the difference in the node's max energy and min energy times the
energy charge rate. This calculation, as with others, would occur once
per decay thread iteration.
User Education
The reason I like this approach is that it is incredibly simple to
implement and I don't think it takes much effort to explain to users.
It would be easy to add other resources you want to charge for (it would
require a code addition, though it would be pretty simple if the data is
available in the right structs). It doesn't require any RPC changes.
sshare, etc only need manpage clarifications to say that the usage data
is "CPU equivalents". No new fields are required.
As for user education, you just need to explain the concept of "CPU
equivalents", something that can be easily done in the documentation.
The slurm.conf partition lines would be relatively easy to read too. If
you don't need to change the behavior, no slurm.conf changes or
explanations to users are required.
Proof of Concept
I did a really quick proof of concept (attached) based on the master
branch. It is very simple to charge for most things as long as the data
is there in the existing structs. One caveat for the test patch is that
I didn't see a float handler in the config parser so I skipped over that
for the test. Instead, each config parameter in slurm.conf should be
set to (desired_value * 1000) for now. Proper float handling can be
added if this is the route people want to take. The patch currently
implements charging for CPUs, memory (GB), and nodes.
Note: I saw a similar idea in a bug report from the University of
Chicago: http://bugs.schedmd.com/show_bug.cgi?id=858.
Ryan
Post by Ryan CoxBill and Don,
We have wondered about this ourselves. I just came up with this idea
and haven't thought it through completely, but option two seems like
the easiest. For example, you could modify lines like
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952
to have a MAX() of a few different types.
I seem to recall seeing this on the list or in a bug report somewhere
already, but you could have different charge rates for memory or GPUs
compared to a CPU, maybe on a per partition basis. You could give each
PartitionName=p1 ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0
......
So the line I referenced would be something like the following (except
real_decay = run_decay * MAX(CPUs*ChargePerCPU,
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);
In this case, each CPU is 1.0 but each GB of RAM is 0.5. Assuming no
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting
usage is 1.0. But if they use 4 GB of RAM and 1 CPU, it is 2.0 just
like they had been using 2 CPUs. Essentially you define every 2 GB of
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with
"cpu equivalents".
It might be harder to explain to users but I don't think it would be
too bad.
Ryan
Post by Lipari, DonBill,
As I understand the dilemma you presented, you want to maximize the
utilization of node resources when running with Slurm configured for
SelectType=select/cons_res. To do this, you would like to nudge
users into requesting only the amount of memory they will need for
their jobs. The nudge would be in the form of decreased fair-share
priority for users' jobs that request only one core but lots of memory.
I don't know of a way for Slurm to do this as it exists. I can only
offer alternatives that have their pros and cons.
One alternative would be to add memory usage support to the
multifactor priority plugin. This would be a substantial undertaking
as it touches code not just in multifactor/priority_multifactor.c but
also in structures that are defined in common/assoc_mgr.h as well as
sshare itself.
A second less invasive option would be to redefine the
multifactor/priority_multifactor.c's raw_usage to make it a
configurable blend of cpu and memory usage. These changes could be
more localized to the multifactor/priority_multifactor.c module.
However you would have a harder time justifying a user's sshare
report because the usage numbers would no longer track jobs'
historical cpu usage. You response to a user who asked you to
justify their sshare usage report would be, "trust me, it's right".
A third alternative (as I'm sure you know) is to give up on perfectly
packed nodes and make every 4G of memory requested cost 1 cpu of
allocation.
Perhaps there are other options, but those are the ones that
immediately come to mind.
Don Lipari
-----Original Message-----
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation
I'd like to revisit this...
After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm. No longer do we have a shared node's
jobs eating all the memory and killing everything running there.
But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.
In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available. This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.
So our default is 3G/core with an actual node having 4G/core available.
This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.
Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes. But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory. This is mostly serial jobs requesting a single
core. So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node. You can see where this is causing issues.
Fairshare has a number of other issues as well, which I will send under
a different email.
Now maybe this is just a matter of constant monitoring of user jobs and
proactively going after those users having small memory per core
requirements. We have attempted this in the past and have found that
the first job they run which crashes due to insufficient memory results
in all scripts being increased and so the process is never ending.
Another solution is to simply trust the users and just keep reminding
them about allocations. They are usually a smart bunch and are quite
creative when it comes to getting jobs to run! So maybe I am concerned
over nothing at all and things will just work out.
Bill
--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University