Discussion:
slurmd could not execve job
Krishna Teja
2014-10-10 19:11:32 UTC
Permalink
Hi All,

I am having an issue with running jobs using slurm on our cluster. Slurm
was working fine until i rebooted the head node and now when I run "srun -l
/bin/hostname" I get an error saying

srun: error: task 0 launch failed: Slurmd could not execve job

Slurm is running on the head node and all the compute nodes too. I couldn't
find anything concrete when I tried searching for it online.

I am willing to provide any additional details to troubleshoot this.

Any help appreciated!

Regards
Krishna Teja
2014-10-14 14:21:39 UTC
Permalink
Hi All,

I am having an issue with running jobs using slurm on our cluster. Slurm
was working fine until i rebooted the head node and now when I run "srun -l
/bin/hostname" I get an error saying

srun: error: task 0 launch failed: Slurmd could not execve job

Slurm is running on the head node and all the compute nodes too. I couldn't
find anything concrete when I tried searching for it online.

I am willing to provide any additional details to troubleshoot this.

Any help appreciated!

Regards
Krishna
Kevin Abbey
2014-10-15 01:14:42 UTC
Permalink
Hi Krishna,

1.
Review the logs and increase debug value if needed.

2.
If the slurm config is not exactly the same then the nodes will not be able
to communicate with headnode. The logs will report this.

3.
Is munge running on head? If this didn't start the communication will fail.


These are the first items I'd check. While I was upgrading I ran into
these issues intermittently.

Kevin
Post by Krishna Teja
Hi All,
I am having an issue with running jobs using slurm on our cluster. Slurm
was working fine until i rebooted the head node and now when I run "srun -l
/bin/hostname" I get an error saying
srun: error: task 0 launch failed: Slurmd could not execve job
Slurm is running on the head node and all the compute nodes too. I
couldn't find anything concrete when I tried searching for it online.
I am willing to provide any additional details to troubleshoot this.
Any help appreciated!
Regards
Krishna
Krishna Teja
2014-10-15 18:46:56 UTC
Permalink
Hi Kevin, Thanks for the reply

I have already solved this problem. There was a slurm file in /etc/pam.d
that needed to be copied to all the nodes. I did that and everything is
working as its supposed to. I guess i didn't have the right permissions
before doing that.

Regards
Krishna
Post by Kevin Abbey
Hi Krishna,
1.
Review the logs and increase debug value if needed.
2.
If the slurm config is not exactly the same then the nodes will not be
able to communicate with headnode. The logs will report this.
3.
Is munge running on head? If this didn't start the communication will fail.
These are the first items I'd check. While I was upgrading I ran into
these issues intermittently.
Kevin
Post by Krishna Teja
Hi All,
I am having an issue with running jobs using slurm on our cluster. Slurm
was working fine until i rebooted the head node and now when I run "srun -l
/bin/hostname" I get an error saying
srun: error: task 0 launch failed: Slurmd could not execve job
Slurm is running on the head node and all the compute nodes too. I
couldn't find anything concrete when I tried searching for it online.
I am willing to provide any additional details to troubleshoot this.
Any help appreciated!
Regards
Krishna
Loading...