what happens after a prolog failure

Discussion:

Alessandro Italiano

2012-11-19 14:44:03 UTC

Hi

we are going to evaluate slurm as batch system for our computing
farm[14k computing slots].

I've done some tests using the prolog script and I've noticed that

1. when the "Prolog" script fails the host, where it failed, is flagged
as DOWN
and the job will stack in PENDING status.
2. when the "PrologSlurmctld" script fails the job is CANCELLED.

first of all, can someone confirm that this is the expected behavior ?

Is there a way to configure slurm in order to automatically dispatch a
job on
a new host when the "Prolog " script fails ?

unfortunately I didn't find any answer to my questions in the "Prolog
and Epilog Scripts" section of the slurm.conf man page

thanks in advance

Alessandro

Moe Jette

2012-11-19 16:47:04 UTC

Permalink

Hi Alessandro,

I will update the documentation to explain this. The thought is that
if the Prolog fails that would indicate some problem with a particular
node and the job can be requeued to run on another node. If the
PrologSlurmctld fails, the job is not going to be able to run on any
node(s). These are default behaviors and either script can do
something deifferent by executing the appropriate command (e.g.
"scontrol requeue $SLURM_JOBID" or "scancel $SLURM_JOBID").

Moe Jette
SchedMD

Post by Alessandro Italiano
Hi
we are going to evaluate slurm as batch system for our computing
farm[14k computing slots].
I've done some tests using the prolog script and I've noticed that
1. when the "Prolog" script fails the host, where it failed, is flagged
as DOWN
and the job will stack in PENDING status.
2. when the "PrologSlurmctld" script fails the job is CANCELLED.
first of all, can someone confirm that this is the expected behavior ?
Is there a way to configure slurm in order to automatically dispatch a
job on
a new host when the "Prolog " script fails ?
unfortunately I didn't find any answer to my questions in the "Prolog
and Epilog Scripts" section of the slurm.conf man page
thanks in advance
Alessandro

Moe Jette

2012-11-19 17:18:04 UTC

Permalink

I've looked at the code and it is somewhat different from what I
thought. If the PrologSlurmctld fails then batch jobs get requeued.
Interactive jobs (salloc and srun) will be killed.
diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
index f45e483..96016bd 100644
--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -1229,7 +1229,10 @@ also be used to specify more than one program
to run (e.g.
the first job step. The prolog script or scripts may be used to purge files,
enable user login, etc. By default there is no prolog. Any configured script
is expected to complete execution quickly (in less time than
-\fBMessageTimeout\fR). See \fBProlog and Epilog Scripts\fR for more
information.
+\fBMessageTimeout\fR).
+If the prolog fails (returns a non\-zero exit code), this will result in the
+node being set to a DOWN state and the job requeued to executed on
another node.
+See \fBProlog and Epilog Scripts\fR for more information.

.TP
\fBPrologSlurmctld\fR
@@ -1250,7 +1253,7 @@ If some node can not be made available for use,
the program should drain
the node (typically using the scontrol command) and terminate with a
non\-zero
exit code.
A non\-zero exit code will result in the job being requeued (where possible)
-or killed.
+or killed. Note that only batch jobs can be requeued.
See \fBProlog and Epilog Scripts\fR for more information.

.TP

Alessandro Italiano

2012-11-20 11:35:04 UTC

Permalink

Hi,

I've done several tests and it seems that after one or two PrologSlurmctld failures
the batch job [submitted in this way: sbatch -p debug ale.sh] is being canceled.

this is an example of the job status reported by sacct command

""""""""""""""""""""""""""""""""""""""""""""""
[***@pccms60 ~]# sacct -j 77
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
77 ale.sh debug root 1 NODE_FAIL 0:0

[***@pccms60 ~]# scontrol sho config | grep JobRequeue
JobRequeue = 1

[***@pccms60 ~]# scontrol sho node| grep State
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1

[***@pccms60 ~]# scontrol -V
slurm 2.4.4
""""""""""""""""""""

Is it possible to always requeue the job upon a PrologSlurmctld failure ?

thanks in advance

Ale

Post by Moe Jette
I've looked at the code and it is somewhat different from what I
thought. If the PrologSlurmctld fails then batch jobs get requeued.
Interactive jobs (salloc and srun) will be killed.
diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
index f45e483..96016bd 100644
--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -1229,7 +1229,10 @@ also be used to specify more than one program
to run (e.g.
the first job step. The prolog script or scripts may be used to purge files,
enable user login, etc. By default there is no prolog. Any configured script
is expected to complete execution quickly (in less time than
-\fBMessageTimeout\fR). See \fBProlog and Epilog Scripts\fR for more
information.
+\fBMessageTimeout\fR).
+If the prolog fails (returns a non\-zero exit code), this will result in the
+node being set to a DOWN state and the job requeued to executed on
another node.
+See \fBProlog and Epilog Scripts\fR for more information.
.TP
\fBPrologSlurmctld\fR
@@ -1250,7 +1253,7 @@ If some node can not be made available for use,
the program should drain
the node (typically using the scontrol command) and terminate with a
non\-zero
exit code.
A non\-zero exit code will result in the job being requeued (where possible)
-or killed.
+or killed. Note that only batch jobs can be requeued.
See \fBProlog and Epilog Scripts\fR for more information.
.TP

Moe Jette

2012-11-20 16:13:03 UTC

Permalink

I'm not sure that you want to requeue the job an indefinite number of
times, but if that's the case look in src/slurmctld/job_scheduler.c
around line 2135. Just comment out the line "kill_job = true" on
repeated failures of the PrologSlurmctld.

Post by Alessandro Italiano
Hi,
I've done several tests and it seems that after one or two
PrologSlurmctld failures
the batch job [submitted in this way: sbatch -p debug ale.sh] is being canceled.
this is an example of the job status reported by sacct command
""""""""""""""""""""""""""""""""""""""""""""""
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
77 ale.sh debug root 1 NODE_FAIL 0:0
JobRequeue = 1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
slurm 2.4.4
""""""""""""""""""""
Is it possible to always requeue the job upon a PrologSlurmctld failure ?
thanks in advance
Ale

Alessandro Italiano

2012-11-26 10:07:04 UTC

Permalink

Hi,

I commented out in the following way and it works.

""""""
if (status != 0) {
bool kill_job = false;
slurmctld_lock_t job_write_lock = {
NO_LOCK, WRITE_LOCK, WRITE_LOCK, NO_LOCK };
error("prolog_slurmctld job %u prolog exit status %u:%u",
job_id, WEXITSTATUS(status), WTERMSIG(status));
lock_slurmctld(job_write_lock);
/*if (last_job_requeue == job_id) {
info("prolog_slurmctld failed again for job %u",
job_id);
kill_job = true;
} else if ((rc = job_requeue(0, job_id, -1,
*/
if ((rc = job_requeue(0, job_id, -1,
(uint16_t)NO_VAL, false))) {
info("unable to requeue job %u: %m", job_id);
kill_job = true;
} else
last_job_requeue = job_id;
if (kill_job) {
srun_user_message(job_ptr,
"PrologSlurmctld failed, job
killed");
(void) job_signal(job_id, SIGKILL, 0, 0, false);
}
unlock_slurmctld(job_write_lock);
} else
debug2("prolog_slurmctld job %u prolog completed", job_id);

"""""""

let us know whether this is the correct way to achieve our gol or not.

thanks

Ale

Post by Moe Jette
I'm not sure that you want to requeue the job an indefinite number of
times, but if that's the case look in src/slurmctld/job_scheduler.c
around line 2135. Just comment out the line "kill_job = true" on
repeated failures of the PrologSlurmctld.

Alessandro Italiano

2012-11-29 08:53:04 UTC

Permalink

Hi,

I applied the same patch for the Prolog script, modifying the following
files

1. src/slurmctld/job_mgr.c
line 3394: /*job_ptr->batch_flag++; only one retry */
2. src/slurmctld/node_mgr.c
line 1835: /*set_node_down(reg_msg->node_name, "Prolog failed");*/

We use the prolog script to check user's environment before the job starts.
In a multiuser computing farm can be useful keeping the job Pending and
let it
lands on an other node maybe with the correct user's environment.
On the other hand setting down a node can lead to reduce computing
slots although the node can provide the right environment for other users

thanks for the quick support

Ale

Your patch is correct. After some discussion we decided to make this
the behavior of Slurm version 2.5, which will be released within a few
days.

Post by Alessandro Italiano
Hi,
I commented out in the following way and it works.
""""""
if (status != 0) {
bool kill_job = false;
slurmctld_lock_t job_write_lock = {
NO_LOCK, WRITE_LOCK, WRITE_LOCK, NO_LOCK };
error("prolog_slurmctld job %u prolog exit status %u:%u",
job_id, WEXITSTATUS(status), WTERMSIG(status));
lock_slurmctld(job_write_lock);
/*if (last_job_requeue == job_id) {
info("prolog_slurmctld failed again for job %u",
job_id);
kill_job = true;
} else if ((rc = job_requeue(0, job_id, -1,
*/
if ((rc = job_requeue(0, job_id, -1,
(uint16_t)NO_VAL, false))) {
info("unable to requeue job %u: %m", job_id);
kill_job = true;
} else
last_job_requeue = job_id;
if (kill_job) {
srun_user_message(job_ptr,
"PrologSlurmctld failed, job
killed");
(void) job_signal(job_id, SIGKILL, 0, 0, false);
}
unlock_slurmctld(job_write_lock);
} else
debug2("prolog_slurmctld job %u prolog completed", job_id);
"""""""
let us know whether this is the correct way to achieve our gol or not.
thanks
Ale

Post by Alessandro Italiano
Hi,
I've done several tests and it seems that after one or two
PrologSlurmctld failures
the batch job [submitted in this way: sbatch -p debug ale.sh] is being canceled.
this is an example of the job status reported by sacct command
""""""""""""""""""""""""""""""""""""""""""""""
JobID JobName Partition Account AllocCPUS
State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
77 ale.sh debug root 1
NODE_FAIL 0:0
JobRequeue = 1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
slurm 2.4.4
""""""""""""""""""""
Is it possible to always requeue the job upon a PrologSlurmctld failure ?
thanks in advance
Ale

Moe Jette

2012-11-29 17:53:04 UTC

Permalink

For what you want to do, having the Prolog cancel the job having a bad
environment then returning exit code of 0 is may a better solution.

Post by Alessandro Italiano
Hi,
I applied the same patch for the Prolog script, modifying the following
files
1. src/slurmctld/job_mgr.c
line 3394: /*job_ptr->batch_flag++; only one retry */
2. src/slurmctld/node_mgr.c
line 1835: /*set_node_down(reg_msg->node_name, "Prolog failed");*/
We use the prolog script to check user's environment before the job starts.
In a multiuser computing farm can be useful keeping the job Pending and
let it
lands on an other node maybe with the correct user's environment.
On the other hand setting down a node can lead to reduce computing
slots although the node can provide the right environment for other users
thanks for the quick support
Ale

Your patch is correct. After some discussion we decided to make this
the behavior of Slurm version 2.5, which will be released within a few
days.

Post by Alessandro Italiano
Hi,
I've done several tests and it seems that after one or two
PrologSlurmctld failures
the batch job [submitted in this way: sbatch -p debug ale.sh] is being canceled.
this is an example of the job status reported by sacct command
""""""""""""""""""""""""""""""""""""""""""""""
JobID JobName Partition Account AllocCPUS
State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
77 ale.sh debug root 1
NODE_FAIL 0:0
JobRequeue = 1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
slurm 2.4.4
""""""""""""""""""""
Is it possible to always requeue the job upon a PrologSlurmctld failure ?
thanks in advance
Ale