Discussion:
slurmdbd: more time than is possible
Jeff Tan
2014-09-09 07:27:32 UTC
Permalink
Hello, folks!

Although this topic was considered resolved from last year, and while I
tried what was suggested from those posts, we still get this in
slurmdbd.log on Slurm 2.6.5.

Following Don's suggestion to resolve errors in job records in
<cluster>_job_table, i.e., time_start was 0 although the job actually did
start running and ended one way or another. This worked for two of our x86
clusters, or so it seems. One of them was only resolved a few weeks back,
but the other one hasn't had such complaints about having "more time than
is possible" since July.

Our Blue Gene/Q is another matter, where the rollup still makes this
complaint sporadically, up to yesterday, in fact. Are there other Slurm
users out there with a Blue Gene who see these slurmdbd complaints? I was
wondering if it had to do with reservations and/or node failures. The
problem with overlapping reservations is mentioned in a 2012 post here as
well as in the source code. Looking at the source code in
as_mysql_rollup.c, it occurred to me that perhaps outages will mess up the
number of CPUs which affects c_usage->d_cpu? I have logs where the reported
d_cpu matches the total number of CPU-seconds for an hour during the hourly
rollup, but sometimes the number is higher and sometimes lower.

Has anyone ever noticed these complaints? I've already resolved (1) jobs
that ran and ended but time_start was 0, and (2) jobs that were marked
running despite having long been terminated. I'm not sure what else we are
missing. Any suggestions would be appreciated.

Regards
Jeff

--
Jeff Tan

High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences, Melbourne, Australia
Jeff Tan
2014-09-10 01:54:35 UTC
Permalink
Post by Jeff Tan
I have logs where the reported
d_cpu matches the total number of CPU-seconds for an hour during the hourly
rollup, but sometimes the number is higher and sometimes lower.
No, d_cpu did not quite match the total number of CPU-seconds on our
resource after all. Rather it was total_time, which is correct and does not
change even when nodes are down. There were times when d_cpu did match
total_time, however, e.g.,

(921600+235929600+0)(236851200) > 235929600

where 235929600 is the correct number of CPU-seconds for our system in an
hour.

After I had resolved the problematic job records, the complaints seem to go
away, but came back a few days later:

[2014-09-08T13:00:02.350] (223572256+14745600+0)(238317856) > 235929600
[2014-09-08T15:00:02.191] (229063568+14745600+0)(243809168) > 235929600
[2014-09-08T18:00:02.389] (221913328+14745600+0)(236658928) > 235929600
[2014-09-08T20:00:02.007] (227854160+14745600+0)(242599760) > 235929600
[2014-09-09T15:00:03.017] (225833760+14745600+0)(240579360) > 235929600
[2014-09-09T16:00:02.270] (225181472+14745600+0)(239927072) > 235929600
[2014-09-09T17:00:02.660] (222904176+14745600+0)(237649776) > 235929600
[2014-09-09T18:00:02.057] (231175072+14745600+0)(245920672) > 235929600
[2014-09-09T19:00:02.440] (225419680+14745600+0)(240165280) > 235929600
[2014-09-09T21:00:01.985] (222188384+14745600+0)(236933984) > 235929600
[2014-09-09T22:00:02.480] (222496064+14745600+0)(237241664) > 235929600
[2014-09-09T23:00:02.827] (225304240+14745600+0)(240049840) > 235929600
[2014-09-10T00:00:02.130] (230062480+14745600+0)(244808080) > 235929600
[2014-09-10T01:00:02.737] (231246000+14745600+0)(245991600) > 235929600
[2014-09-10T02:00:02.442] (226591488+14745600+0)(241337088) > 235929600

All jobs with state>1 have time_start values, and none have time_end of 0,
so I don't know what else could be amiss in the job table.

Regards
Jeff

Loading...