Jeff Tan
2014-09-09 07:27:32 UTC
Hello, folks!
Although this topic was considered resolved from last year, and while I
tried what was suggested from those posts, we still get this in
slurmdbd.log on Slurm 2.6.5.
Following Don's suggestion to resolve errors in job records in
<cluster>_job_table, i.e., time_start was 0 although the job actually did
start running and ended one way or another. This worked for two of our x86
clusters, or so it seems. One of them was only resolved a few weeks back,
but the other one hasn't had such complaints about having "more time than
is possible" since July.
Our Blue Gene/Q is another matter, where the rollup still makes this
complaint sporadically, up to yesterday, in fact. Are there other Slurm
users out there with a Blue Gene who see these slurmdbd complaints? I was
wondering if it had to do with reservations and/or node failures. The
problem with overlapping reservations is mentioned in a 2012 post here as
well as in the source code. Looking at the source code in
as_mysql_rollup.c, it occurred to me that perhaps outages will mess up the
number of CPUs which affects c_usage->d_cpu? I have logs where the reported
d_cpu matches the total number of CPU-seconds for an hour during the hourly
rollup, but sometimes the number is higher and sometimes lower.
Has anyone ever noticed these complaints? I've already resolved (1) jobs
that ran and ended but time_start was 0, and (2) jobs that were marked
running despite having long been terminated. I'm not sure what else we are
missing. Any suggestions would be appreciated.
Regards
Jeff
--
Jeff Tan
High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences, Melbourne, Australia
Although this topic was considered resolved from last year, and while I
tried what was suggested from those posts, we still get this in
slurmdbd.log on Slurm 2.6.5.
Following Don's suggestion to resolve errors in job records in
<cluster>_job_table, i.e., time_start was 0 although the job actually did
start running and ended one way or another. This worked for two of our x86
clusters, or so it seems. One of them was only resolved a few weeks back,
but the other one hasn't had such complaints about having "more time than
is possible" since July.
Our Blue Gene/Q is another matter, where the rollup still makes this
complaint sporadically, up to yesterday, in fact. Are there other Slurm
users out there with a Blue Gene who see these slurmdbd complaints? I was
wondering if it had to do with reservations and/or node failures. The
problem with overlapping reservations is mentioned in a 2012 post here as
well as in the source code. Looking at the source code in
as_mysql_rollup.c, it occurred to me that perhaps outages will mess up the
number of CPUs which affects c_usage->d_cpu? I have logs where the reported
d_cpu matches the total number of CPU-seconds for an hour during the hourly
rollup, but sometimes the number is higher and sometimes lower.
Has anyone ever noticed these complaints? I've already resolved (1) jobs
that ran and ended but time_start was 0, and (2) jobs that were marked
running despite having long been terminated. I'm not sure what else we are
missing. Any suggestions would be appreciated.
Regards
Jeff
--
Jeff Tan
High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences, Melbourne, Australia