Discussion:
fairshare
Bill Wichser
2014-07-14 20:14:35 UTC
Permalink
Is there any way to get a better view of fairshare than the "sshare"
command?

Under PBS, there was the diagnose -f command which showed the breakdown
per set time period which calculated this value. What was nice about
this was I could point a group to this command, or cut and paste,
showing that you have been using 20% over the last 30 days even though
you haven't run anything in the last three days.

It's a much more difficult problem when asked now. I have no tool which
shows the value, and decay, over the time. So I'm wondering if anyone
has a method to demonstrate that, yes, this fairshare value is correct
and here is why. Or do I just need to figure out a database query to
cull this information?

Thanks,
Bill
Ryan Cox
2014-07-15 14:52:32 UTC
Permalink
Bill,

I may be wrong (corrections welcomed), but I'm pretty sure you'll have
to use a database query. My understanding is that the decayed usage is
stored as a single usage_raw value per association
(https://github.com/SchedMD/slurm/blob/f8025c1484838ecbe3e690fa565452d990123361/src/plugins/priority/multifactor/priority_multifactor.c#L1119).
There is no history of any kind.

You would have to do a fairly complex query to get an accurate
representation or write some code to recreate the way Slurm does it. If
you look at _apply_decay() and _apply_new_usage() in
src/plugins/priority/multifactor/priority_multifactor.c, you can see all
that happens. Basically, once per decay thread iteration each
association's usage_raw and the job's cputime for that time period is
calculated and decayed accordingly. This can happen many, many times
over the length of a job. If a job terminates before reaching its
timelimit, the remaining allocated cputime is immediately added all at
the same time
(https://github.com/SchedMD/slurm/blob/f8025c1484838ecbe3e690fa565452d990123361/src/plugins/priority/multifactor/priority_multifactor.c#L1036).

Those are some of the issues that you may run into while creating a
database tool for this.

I could be mistaken on some of the details but that is my understanding
of the code (we looked recently for an unrelated reason).

Ryan
Post by Bill Wichser
Is there any way to get a better view of fairshare than the "sshare"
command?
Under PBS, there was the diagnose -f command which showed the
breakdown per set time period which calculated this value. What was
nice about this was I could point a group to this command, or cut and
paste, showing that you have been using 20% over the last 30 days even
though you haven't run anything in the last three days.
It's a much more difficult problem when asked now. I have no tool
which shows the value, and decay, over the time. So I'm wondering if
anyone has a method to demonstrate that, yes, this fairshare value is
correct and here is why. Or do I just need to figure out a database
query to cull this information?
Thanks,
Bill
--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University
Loading...