slurmdbd: more time than is possible

Jeff Tan

2014-09-10 01:54:35 UTC

Post by Jeff Tan
I have logs where the reported
d_cpu matches the total number of CPU-seconds for an hour during the hourly
rollup, but sometimes the number is higher and sometimes lower.

No, d_cpu did not quite match the total number of CPU-seconds on our
resource after all. Rather it was total_time, which is correct and does not
change even when nodes are down. There were times when d_cpu did match
total_time, however, e.g.,

(921600+235929600+0)(236851200) > 235929600

where 235929600 is the correct number of CPU-seconds for our system in an
hour.

After I had resolved the problematic job records, the complaints seem to go
away, but came back a few days later:

[2014-09-08T13:00:02.350] (223572256+14745600+0)(238317856) > 235929600
[2014-09-08T15:00:02.191] (229063568+14745600+0)(243809168) > 235929600
[2014-09-08T18:00:02.389] (221913328+14745600+0)(236658928) > 235929600
[2014-09-08T20:00:02.007] (227854160+14745600+0)(242599760) > 235929600
[2014-09-09T15:00:03.017] (225833760+14745600+0)(240579360) > 235929600
[2014-09-09T16:00:02.270] (225181472+14745600+0)(239927072) > 235929600
[2014-09-09T17:00:02.660] (222904176+14745600+0)(237649776) > 235929600
[2014-09-09T18:00:02.057] (231175072+14745600+0)(245920672) > 235929600
[2014-09-09T19:00:02.440] (225419680+14745600+0)(240165280) > 235929600
[2014-09-09T21:00:01.985] (222188384+14745600+0)(236933984) > 235929600
[2014-09-09T22:00:02.480] (222496064+14745600+0)(237241664) > 235929600
[2014-09-09T23:00:02.827] (225304240+14745600+0)(240049840) > 235929600
[2014-09-10T00:00:02.130] (230062480+14745600+0)(244808080) > 235929600
[2014-09-10T01:00:02.737] (231246000+14745600+0)(245991600) > 235929600
[2014-09-10T02:00:02.442] (226591488+14745600+0)(241337088) > 235929600

All jobs with state>1 have time_start values, and none have time_end of 0,
so I don't know what else could be amiss in the job table.

Regards
Jeff