Discussion:
Checking on array jobs within slurm accounting DB and via sacct
John Desantis
2014-09-26 17:19:35 UTC
Permalink
Hello all,

First and foremost since this is my first post to the list, I'd like
to thank the Slurm developers for a great and gratis product!

Anyways, to the point.

We have users submitting array jobs via sbatch and using
"-a/--array=n-n" without an issue. When these jobs are running, we
can use 'squeue' to see tasks under the form of "jobnumber_task".
When we try to query these jobs via the accounting database (checking
on job_table, step_table, and jobcomp_table) and via sacct -j
"jobnumber", we're not getting the complete set of information
associated with the job(batch and exec hosts, etc.). If the job is
currently running, we can use scontrol to see the job and its steps,
and the full set of information we're looking for.

When I used scontrol to view an array job, I saw that "JobId" for each
of the array tasks incremented based upon the step, e.g.:

JobId=23383 ArrayJobId=23383 ArrayTaskId=1
JobId=23384 ArrayJobId=23383 ArrayTaskId=2
JobId=23385 ArrayJobId=23383 ArrayTaskId=3

When I tried to query any of the successive JobId's via sacct or the
DB itself, I didn't get any information. Only the real JobId "23383"
returned a result within sacct and the DB. I was able to glean node
information from the scheduler and control daemon logs by looking for
the JobId's listed above.

I did find a previous post
https://www.mail-archive.com/slurm-dev-***@public.gmane.org/msg03344.html which
seems to be my question as well.

Thanks for any insight which can be provided,

John DeSantis
Danny Auble
2014-09-26 17:45:33 UTC
Permalink
John, this was fixed in 14.11 (commit
d23590dbc94e40a0963fc8d1cee0e6145f782f5c). Since structures had to
change it wasn't possible to fix previous versions. The patch might go
in cleanly to 14.03, but will probably need some massaging with the
packs and unpacks. Using this patch will also break backwards
compatibility which you may or may not care about.

Danny
Post by John Desantis
Hello all,
First and foremost since this is my first post to the list, I'd like
to thank the Slurm developers for a great and gratis product!
Anyways, to the point.
We have users submitting array jobs via sbatch and using
"-a/--array=n-n" without an issue. When these jobs are running, we
can use 'squeue' to see tasks under the form of "jobnumber_task".
When we try to query these jobs via the accounting database (checking
on job_table, step_table, and jobcomp_table) and via sacct -j
"jobnumber", we're not getting the complete set of information
associated with the job(batch and exec hosts, etc.). If the job is
currently running, we can use scontrol to see the job and its steps,
and the full set of information we're looking for.
When I used scontrol to view an array job, I saw that "JobId" for each
JobId=23383 ArrayJobId=23383 ArrayTaskId=1
JobId=23384 ArrayJobId=23383 ArrayTaskId=2
JobId=23385 ArrayJobId=23383 ArrayTaskId=3
When I tried to query any of the successive JobId's via sacct or the
DB itself, I didn't get any information. Only the real JobId "23383"
returned a result within sacct and the DB. I was able to glean node
information from the scheduler and control daemon logs by looking for
the JobId's listed above.
I did find a previous post
seems to be my question as well.
Thanks for any insight which can be provided,
John DeSantis
John Desantis
2014-09-26 18:08:32 UTC
Permalink
Danny,

Thank you for your response. We'll schedule an upgrade to address the issue.

Could you tell me if commit 6aadcf15355dfe (introduced in 14.03.4)
will still be present?

John DeSantis
Post by Danny Auble
John, this was fixed in 14.11 (commit
d23590dbc94e40a0963fc8d1cee0e6145f782f5c). Since structures had to change
it wasn't possible to fix previous versions. The patch might go in cleanly
to 14.03, but will probably need some massaging with the packs and unpacks.
Using this patch will also break backwards compatibility which you may or
may not care about.
Danny
Post by John Desantis
Hello all,
First and foremost since this is my first post to the list, I'd like
to thank the Slurm developers for a great and gratis product!
Anyways, to the point.
We have users submitting array jobs via sbatch and using
"-a/--array=n-n" without an issue. When these jobs are running, we
can use 'squeue' to see tasks under the form of "jobnumber_task".
When we try to query these jobs via the accounting database (checking
on job_table, step_table, and jobcomp_table) and via sacct -j
"jobnumber", we're not getting the complete set of information
associated with the job(batch and exec hosts, etc.). If the job is
currently running, we can use scontrol to see the job and its steps,
and the full set of information we're looking for.
When I used scontrol to view an array job, I saw that "JobId" for each
JobId=23383 ArrayJobId=23383 ArrayTaskId=1
JobId=23384 ArrayJobId=23383 ArrayTaskId=2
JobId=23385 ArrayJobId=23383 ArrayTaskId=3
When I tried to query any of the successive JobId's via sacct or the
DB itself, I didn't get any information. Only the real JobId "23383"
returned a result within sacct and the DB. I was able to glean node
information from the scheduler and control daemon logs by looking for
the JobId's listed above.
I did find a previous post
seems to be my question as well.
Thanks for any insight which can be provided,
John DeSantis
Danny Auble
2014-09-26 18:27:36 UTC
Permalink
Depending to what commit you upgrade to yes, anything in 14.03 is in
14.11. Right now I wouldn't suggest on running 14.11 in production
since it is still under development. If this feature is something you
really need I would suggest getting a 14.03.8 tag and cherry pick the
14.11 commit massage it and run it that way.
Post by John Desantis
Danny,
Thank you for your response. We'll schedule an upgrade to address the issue.
Could you tell me if commit 6aadcf15355dfe (introduced in 14.03.4)
will still be present?
John DeSantis
Post by Danny Auble
John, this was fixed in 14.11 (commit
d23590dbc94e40a0963fc8d1cee0e6145f782f5c). Since structures had to change
it wasn't possible to fix previous versions. The patch might go in cleanly
to 14.03, but will probably need some massaging with the packs and unpacks.
Using this patch will also break backwards compatibility which you may or
may not care about.
Danny
Post by John Desantis
Hello all,
First and foremost since this is my first post to the list, I'd like
to thank the Slurm developers for a great and gratis product!
Anyways, to the point.
We have users submitting array jobs via sbatch and using
"-a/--array=n-n" without an issue. When these jobs are running, we
can use 'squeue' to see tasks under the form of "jobnumber_task".
When we try to query these jobs via the accounting database (checking
on job_table, step_table, and jobcomp_table) and via sacct -j
"jobnumber", we're not getting the complete set of information
associated with the job(batch and exec hosts, etc.). If the job is
currently running, we can use scontrol to see the job and its steps,
and the full set of information we're looking for.
When I used scontrol to view an array job, I saw that "JobId" for each
JobId=23383 ArrayJobId=23383 ArrayTaskId=1
JobId=23384 ArrayJobId=23383 ArrayTaskId=2
JobId=23385 ArrayJobId=23383 ArrayTaskId=3
When I tried to query any of the successive JobId's via sacct or the
DB itself, I didn't get any information. Only the real JobId "23383"
returned a result within sacct and the DB. I was able to glean node
information from the scheduler and control daemon logs by looking for
the JobId's listed above.
I did find a previous post
seems to be my question as well.
Thanks for any insight which can be provided,
John DeSantis
John Desantis
2014-09-26 18:41:31 UTC
Permalink
Danny,

We can wait for the production version. Thanks!

John DeSantis
Depending to what commit you upgrade to yes, anything in 14.03 is in 14.11.
Right now I wouldn't suggest on running 14.11 in production since it is
still under development. If this feature is something you really need I
would suggest getting a 14.03.8 tag and cherry pick the 14.11 commit massage
it and run it that way.
Post by John Desantis
Danny,
Thank you for your response. We'll schedule an upgrade to address the issue.
Could you tell me if commit 6aadcf15355dfe (introduced in 14.03.4)
will still be present?
John DeSantis
Post by Danny Auble
John, this was fixed in 14.11 (commit
d23590dbc94e40a0963fc8d1cee0e6145f782f5c). Since structures had to change
it wasn't possible to fix previous versions. The patch might go in cleanly
to 14.03, but will probably need some massaging with the packs and unpacks.
Using this patch will also break backwards compatibility which you may or
may not care about.
Danny
Post by John Desantis
Hello all,
First and foremost since this is my first post to the list, I'd like
to thank the Slurm developers for a great and gratis product!
Anyways, to the point.
We have users submitting array jobs via sbatch and using
"-a/--array=n-n" without an issue. When these jobs are running, we
can use 'squeue' to see tasks under the form of "jobnumber_task".
When we try to query these jobs via the accounting database (checking
on job_table, step_table, and jobcomp_table) and via sacct -j
"jobnumber", we're not getting the complete set of information
associated with the job(batch and exec hosts, etc.). If the job is
currently running, we can use scontrol to see the job and its steps,
and the full set of information we're looking for.
When I used scontrol to view an array job, I saw that "JobId" for each
JobId=23383 ArrayJobId=23383 ArrayTaskId=1
JobId=23384 ArrayJobId=23383 ArrayTaskId=2
JobId=23385 ArrayJobId=23383 ArrayTaskId=3
When I tried to query any of the successive JobId's via sacct or the
DB itself, I didn't get any information. Only the real JobId "23383"
returned a result within sacct and the DB. I was able to glean node
information from the scheduler and control daemon logs by looking for
the JobId's listed above.
I did find a previous post
seems to be my question as well.
Thanks for any insight which can be provided,
John DeSantis
Loading...