I ended up stopping the slurmctld daemons and re-writing the assoc_usage checkpoint file with the "corrected" usage info based on what sacct says has run. I restarted slurmctld and it seems to have picked up the updated usage.
Probably not a great general solution, since the documentation for the file format came from the source code and the format is likely to change without notice. It did get things back in sync though.
I'll have to watch more closely to see if I can tell where things begin to go wrong.
-----
Gary Skouson
-----Original Message-----
From: Lipari, Don [mailto:***@llnl.gov]
Sent: Monday, March 17, 2014 12:24 PM
To: slurm-dev
Subject: [slurm-dev] RE: sshare and sacct
-----Original Message-----
Sent: Monday, March 17, 2014 11:48 AM
To: slurm-dev
Subject: [slurm-dev] RE: sshare and sacct
Thanks.
We did start with 0 usage on the accounts I'm looking at and we have the
share set to not decay or reset. For some user/account associations, we
have identical usage between sacct and sshare. For others, sshare shows
significantly less usage than sacct info. I'm not sure what caused the
difference.
I can't think of a reason for the associations that show a discrepancy. Perhaps increasing the debug levels and adding Priority to your DebugFlags would shed light.
I was hoping I could update the sshare info to match what sacct says, but
I could only see how to reset share usage to 0 from the info I could find.
Resetting RawUsage to zero is all that is currently possible. Support for non-zero values was never implemented.
Don
-----
Gary Skouson
-----Original Message-----
Sent: Monday, March 17, 2014 8:47 AM
To: slurm-dev
Subject: [slurm-dev] RE: sshare and sacct
Gary,
The sacct command retrieves job and job step records from the slurmdb and
reports the statistics for the requested job(s).
The sshare command provides the basis for the fair-scheduling component of
the multi-factor plugin. sshare lists the two components (shares and
usage) which are used to calculate the fair share factor for each user and
account. By default, one of the slurm.conf parameters which affect this
calculation (PriorityDecayHalfLife) is set to a 7 day decay. That means
that whatever raw usage appears in the sshare report, it is bound to be
less over time (in the absence of any more running jobs).
So, it is not a surprise that there would be a discrepancy between the
usages reported by sacct and sshare. If you set PriorityDecayHalfLife to
not decay (zero), and if you started with zero usage, the usage numbers of
sacct and sreport should track until the PriorityUsageResetPeriod limit
was reached. At that point, the raw usage value would be reset to zero.
Don
-----Original Message-----
Sent: Friday, March 14, 2014 3:52 PM
To: slurm-dev
Subject: [slurm-dev] sshare and sacct
We started using sshare to enforce limits on usage, and it seems that
sshare is getting confused about actual usage.
If I use sacct to check the usage for an account, I get different
numbers
than sshare reports for the same account.
Is there a way to "fix" sshare to reflect the usage found from sacct?
I can see that I can reset the share usage to 0, but that's the only
value
allowed at the moment. Is there some other way to set the rawusage to
fix
sshare to reflect reality?