Discussion:
job exit codes
Bill Wichser
2014-07-25 22:06:32 UTC
Permalink
From the documentation there is no clear explanation which I find
explaining the exit codes of jobs. I have a user experiencing exit
codes of 137 and 139. Can anyone help me to locate what this 8 bit
unsigned integer references?

Thanks,
Bill
Paul Hargrove
2014-07-25 22:18:44 UTC
Permalink
Bill,

Even in the absence of SLURM, the normal behavior on most *nix platforms is
that exit codes over 128 are due to fatal signals. The signal value is the
lower 7 bits (or equivalently signo = exitcode - 128). So either your
user's code (or some portion of SLURM?) has encountered signals 9 and 11.
On Linux and many other platforms, those are SIGKILL and SIGSEGV,
respectively.

-Paul
Post by Bill Wichser
From the documentation there is no clear explanation which I find
explaining the exit codes of jobs. I have a user experiencing exit codes
of 137 and 139. Can anyone help me to locate what this 8 bit unsigned
integer references?
Thanks,
Bill
--
Paul H. Hargrove PHHargrove-/***@public.gmane.org
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Danny Auble
2014-07-25 22:22:33 UTC
Permalink
Paul is correct,

Before 14.03.5 Slurm didn't obey POSIX convention but now does.

Basically if the job was signaled in some fashion the exit code is
increased by 128 to show this is the case.

As an example on the command line, if I do a simple sleep and ctrl-C it
the exit code would be 130

sleep 1000
^C
echo $?
130

Before 14.03.5 srun wouldn't return just 15 in this case but we wanted
to be POSIX compliant so we modified it to increase the exit_code as it
should to be compliant.

What does sacct tell you on the jobs? For the exit code of 137 I would
expect you would get a ExitCode of 0:9 meaning you had an exit code of 0
but it was signaled with a SIGKILL. For the 139 I would expect a 0:11
meaning a Seg Fault happened just as Paul said.

Danny
Post by Bill Wichser
From the documentation there is no clear explanation which I find
explaining the exit codes of jobs. I have a user experiencing exit
codes of 137 and 139. Can anyone help me to locate what this 8 bit
unsigned integer references?
Thanks,
Bill
Bill Wichser
2014-07-26 00:12:32 UTC
Permalink
Thanks. I knew that with our implementation of PBS it was always this
way. But there was no indication from Slurm docs that the lower 7 bits
(-128) also applied for slurm.

My exit codes from sacct are always 137:0 and 139:0 from these jobs.

Bill
Post by Danny Auble
Paul is correct,
Before 14.03.5 Slurm didn't obey POSIX convention but now does.
Basically if the job was signaled in some fashion the exit code is
increased by 128 to show this is the case.
As an example on the command line, if I do a simple sleep and ctrl-C
it the exit code would be 130
sleep 1000
^C
echo $?
130
Before 14.03.5 srun wouldn't return just 15 in this case but we wanted
to be POSIX compliant so we modified it to increase the exit_code as
it should to be compliant.
What does sacct tell you on the jobs? For the exit code of 137 I
would expect you would get a ExitCode of 0:9 meaning you had an exit
code of 0 but it was signaled with a SIGKILL. For the 139 I would
expect a 0:11 meaning a Seg Fault happened just as Paul said.
Danny
Post by Bill Wichser
From the documentation there is no clear explanation which I find
explaining the exit codes of jobs. I have a user experiencing exit
codes of 137 and 139. Can anyone help me to locate what this 8 bit
unsigned integer references?
Thanks,
Bill
Danny Auble
2014-07-26 01:44:39 UTC
Permalink
What version are you using?
Post by Bill Wichser
Thanks. I knew that with our implementation of PBS it was always this
way. But there was no indication from Slurm docs that the lower 7 bits
(-128) also applied for slurm.
My exit codes from sacct are always 137:0 and 139:0 from these jobs.
Bill
Post by Danny Auble
Paul is correct,
Before 14.03.5 Slurm didn't obey POSIX convention but now does.
Basically if the job was signaled in some fashion the exit code is
increased by 128 to show this is the case.
As an example on the command line, if I do a simple sleep and ctrl-C
it the exit code would be 130
sleep 1000
^C
echo $?
130
Before 14.03.5 srun wouldn't return just 15 in this case but we
wanted
Post by Danny Auble
to be POSIX compliant so we modified it to increase the exit_code as
it should to be compliant.
What does sacct tell you on the jobs? For the exit code of 137 I
would expect you would get a ExitCode of 0:9 meaning you had an exit
code of 0 but it was signaled with a SIGKILL. For the 139 I would
expect a 0:11 meaning a Seg Fault happened just as Paul said.
Danny
Post by Bill Wichser
From the documentation there is no clear explanation which I find
explaining the exit codes of jobs. I have a user experiencing exit
codes of 137 and 139. Can anyone help me to locate what this 8 bit
unsigned integer references?
Thanks,
Bill
Bill Wichser
2014-07-29 15:41:35 UTC
Permalink
Version currently demonstrating this is: 14.03

Bill
Post by Danny Auble
What version are you using?
Thanks. I knew that with our implementation of PBS it was always this
way. But there was no indication from Slurm docs that the lower 7 bits
(-128) also applied for slurm.
My exit codes from sacct are always 137:0 and 139:0 from these jobs.
Bill
Paul is correct,
Before 14.03.5 Slurm didn't obey POSIX convention but now does.
Basically if the job was signaled in some fashion the exit code is
increased by 128 to show this is the case.
As an example on the command line, if I do a simple sleep and ctrl-C
it the exit code would be 130
sleep 1000
^C
echo $?
130
Before 14.03.5 srun wouldn't return just 15 in this case but we wanted
to be POSIX c! ompliant so we modified it to increase the
exit_code as
it should to be compliant.
What does sacct tell you on the jobs? For the exit code of 137 I
would expect you would get a ExitCode of 0:9 meaning you had an exit
code of 0 but it was signaled with a SIGKILL. For the 139 I would
expect a 0:11 meaning a Seg Fault happened just as Paul said.
Danny
From the documentation there is no clear explanation which I find
explaining the exit codes of jobs. I have a user
experiencing exit
codes of 137 and 139. Can anyone help me to locate what this
8 bit
unsigned integer references?
Thanks,
Bill
Danny Auble
2014-07-29 18:02:00 UTC
Permalink
14.03.05?
Post by Bill Wichser
Version currently demonstrating this is: 14.03
Bill
Post by Danny Auble
What version are you using?
Thanks. I knew that with our implementation of PBS it was always
this
Post by Danny Auble
way. But there was no indication from Slurm docs that the lower
7 bits
Post by Danny Auble
(-128) also applied for slurm.
My exit codes from sacct are always 137:0 and 139:0 from these
jobs.
Post by Danny Auble
Bill
Paul is correct,
Before 14.03.5 Slurm didn't obey POSIX convention but now
does.
Post by Danny Auble
Basically if the job was signaled in some fashion the exit
code is
Post by Danny Auble
increased by 128 to show this is the case.
As an example on the command line, if I do a simple sleep and ctrl-C
it the exit code would be 130
sleep 1000
^C
echo $?
130
Before 14.03.5 srun wouldn't return just 15 in this case but
we
Post by Danny Auble
wanted
to be POSIX c! ompliant so we modified it to increase the
exit_code as
it should to be compliant.
What does sacct tell you on the jobs? For the exit code of
137 I
Post by Danny Auble
would expect you would get a ExitCode of 0:9 meaning you had
an
Post by Danny Auble
exit
code of 0 but it was signaled with a SIGKILL. For the 139 I
would
Post by Danny Auble
expect a 0:11 meaning a Seg Fault happened just as Paul said.
Danny
From the documentation there is no clear explanation
which
Post by Danny Auble
I find
explaining the exit codes of jobs. I have a user
experiencing exit
codes of 137 and 139. Can anyone help me to locate what
this
Post by Danny Auble
8 bit
unsigned integer references?
Thanks,
Bill
Bill Wichser
2014-07-29 19:26:37 UTC
Permalink
Lol. Missed that! 14.03.04
Post by Danny Auble
14.03.05?
Version currently demonstrating this is: 14.03
Bill
What version are you using?
On July 25, 2014 5:12:22 PM PDT, Bill Wichser
Thanks. I knew that with our implementation of PBS it was always
this
way. But there was no indication from Slurm docs that the lower
7 bits
(-128) also applied for slurm.
My exit codes from sacct are always 137:0 and 139:0 from these jobs.
Bill
Paul is correct,
Before 14.03.5 Slurm didn't obey POSIX convention but now does.
Basically if the job was signaled in some fashion the exit code is
! increased by 128 to show this is the case.
As an example on the command line, if I do a simple sleep and ctrl-C
it the exit code would be 130
sleep 1000
^C
echo $?
130
Before 14.03.5 srun wouldn't return just 15 in this case but we wanted
to be POSIX c! ompliant so we modified it to increase the
exit_code as
it should to be compliant.
What does sacct tell you on the jobs? For the exit code of 137 I
would expect you would get a ExitCode of 0:9 meaning you had an exit
code of 0 but it was signaled with a SIGKILL. For the 139 I would
expect a 0:11 meaning a Seg Fault happened just as Paul said.
Danny
From the documentation there is no clear explanation which I find
explaining the exit codes of jobs. I have a user
experiencing exit
codes of 137 and 139. Can anyone help me to locate what this
8 bit
unsigned integer references?
Thanks,
Bill
Danny Auble
2014-07-30 00:41:32 UTC
Permalink
Upgrade and see if you get different behavior, as this was fixed in
14.03.05 ;).
Post by Bill Wichser
Lol. Missed that! 14.03.04
Post by Danny Auble
14.03.05?
Version currently demonstrating this is: 14.03
Bill
What version are you using?
On July 25, 2014 5:12:22 PM PDT, Bill Wichser
Thanks. I knew that with our implementation of PBS it was always
this
way. But there was no indication from Slurm docs that the lower
7 bits
(-128) also applied for slurm.
My exit codes from sacct are always 137:0 and 139:0 from these jobs.
Bill
Paul is correct,
Before 14.03.5 Slurm didn't obey POSIX convention but now does.
Basically if the job was signaled in some fashion the exit code is
! increased by 128 to show this is the case.
As an example on the command line, if I do a simple sleep and ctrl-C
it the exit code would be 130
sleep 1000
^C
echo $?
130
Before 14.03.5 srun wouldn't return just 15 in this case but we wanted
to be POSIX c! ompliant so we modified it to increase the
exit_code as
it should to be compliant.
What does sacct tell you on the jobs? For the exit code of 137 I
would expect you would get a ExitCode of 0:9 meaning you had an exit
code of 0 but it was signaled with a SIGKILL. For the 139 I would
expect a 0:11 meaning a Seg Fault happened just as Paul said.
Danny
From the documentation there is no clear explanation which I find
explaining the exit codes of jobs. I have a user
experiencing exit
codes of 137 and 139. Can anyone help me to locate what this
8 bit
unsigned integer references?
Thanks,
Bill
Loading...