Discussion:
strange state
Yann Sagon
2014-06-11 15:43:29 UTC
Permalink
Hello,

I have upgraded slurm to the latest version slurm 14.03.3-2

Now I have two nodes in a strange state : alloc~

the job that is submited to those nodes is in state "configuring" and stay
like that for hours.

In the error log of sbatch I see only this line: srun: Job step creation
temporarily disabled, retrying

I have tried to set the node to sate down and then idle, no change.

If there is no job on the node, the state is idle~

I'm using the power save feature of slurm but right now I have commented
the actual power off, power on in my Resum and Suspend script due to others
problem.
Yann Sagon
2014-06-18 13:34:29 UTC
Permalink
Hello,

sorry to bump this thread, but I still have my two nodes in this strange
state, and I can't use them. Any idea how I can reset the nodes to an idle
state?

I have updated slurm and slurmdbd to the latest version (14.03.4-2) but
with no change. All the others nodes are fine.

I'm using slurmdbd as well.

Any idea welcome
Post by Yann Sagon
Hello,
I have upgraded slurm to the latest version slurm 14.03.3-2
Now I have two nodes in a strange state : alloc~
the job that is submited to those nodes is in state "configuring" and stay
like that for hours.
In the error log of sbatch I see only this line: srun: Job step creation
temporarily disabled, retrying
I have tried to set the node to sate down and then idle, no change.
If there is no job on the node, the state is idle~
I'm using the power save feature of slurm but right now I have commented
the actual power off, power on in my Resum and Suspend script due to others
problem.
Christopher B Coffey
2014-06-18 15:20:29 UTC
Permalink
Yi Yann,

Did you try rebooting the node? I’ve seen a node get in a completing
state that never cleared requiring the node to be rebooted at which point
the state cleared.

Best,
Chris
Post by Yann Sagon
Hello,
sorry to bump this thread, but I still have my two nodes in this strange
state, and I can't use them. Any idea how I can reset the nodes to an
idle state?
I have updated slurm and slurmdbd to the latest version (14.03.4-2) but
with no change. All the others nodes are fine.
I'm using slurmdbd as well.
Any idea welcome
Hello,
I have upgraded slurm to the latest version slurm 14.03.3-2
Now I have two nodes in a strange state : alloc~
the job that is submited to those nodes is in state "configuring" and
stay like that for hours.
In the error log of sbatch I see only this line: srun: Job step creation
temporarily disabled, retrying
I have tried to set the node to sate down and then idle, no change.
If there is no job on the node, the state is idle~
I'm using the power save feature of slurm but right now I have commented
the actual power off, power on in my Resum and Suspend script due to
others problem.
Yann Sagon
2014-06-18 15:25:34 UTC
Permalink
Yes I have restartd the node, restarted slurm (on the node), restarted
slurmdbd, restarted slurmctld, tried to change the node state using scontrol
Post by Christopher B Coffey
Yi Yann,
Did you try rebooting the node? I’ve seen a node get in a completing
state that never cleared requiring the node to be rebooted at which point
the state cleared.
Best,
Chris
Post by Yann Sagon
Hello,
sorry to bump this thread, but I still have my two nodes in this strange
state, and I can't use them. Any idea how I can reset the nodes to an
idle state?
I have updated slurm and slurmdbd to the latest version (14.03.4-2) but
with no change. All the others nodes are fine.
I'm using slurmdbd as well.
Any idea welcome
Hello,
I have upgraded slurm to the latest version slurm 14.03.3-2
Now I have two nodes in a strange state : alloc~
the job that is submited to those nodes is in state "configuring" and
stay like that for hours.
In the error log of sbatch I see only this line: srun: Job step creation
temporarily disabled, retrying
I have tried to set the node to sate down and then idle, no change.
If there is no job on the node, the state is idle~
I'm using the power save feature of slurm but right now I have commented
the actual power off, power on in my Resum and Suspend script due to
others problem.
Yann Sagon
2014-06-18 15:49:35 UTC
Permalink
Ok, I figured out what was going on.

I'm using Power saving, and those two nodes were probably powered off
during the upgrade of the cluster.
During the upgrade, I have as well changed the unix rights of the
SuspendProgram and ResumeProgram by mistake. The user "slurm" didn't had
the right to call those scripts anymore.
I didn't saw any message about that on the logs. Restoring the correct
rights did the trick.

Thanks
Post by Yann Sagon
Yes I have restartd the node, restarted slurm (on the node), restarted
slurmdbd, restarted slurmctld, tried to change the node state using scontrol
Post by Christopher B Coffey
Yi Yann,
Did you try rebooting the node? I’ve seen a node get in a completing
state that never cleared requiring the node to be rebooted at which point
the state cleared.
Best,
Chris
Post by Yann Sagon
Hello,
sorry to bump this thread, but I still have my two nodes in this strange
state, and I can't use them. Any idea how I can reset the nodes to an
idle state?
I have updated slurm and slurmdbd to the latest version (14.03.4-2) but
with no change. All the others nodes are fine.
I'm using slurmdbd as well.
Any idea welcome
Hello,
I have upgraded slurm to the latest version slurm 14.03.3-2
Now I have two nodes in a strange state : alloc~
the job that is submited to those nodes is in state "configuring" and
stay like that for hours.
In the error log of sbatch I see only this line: srun: Job step creation
temporarily disabled, retrying
I have tried to set the node to sate down and then idle, no change.
If there is no job on the node, the state is idle~
I'm using the power save feature of slurm but right now I have commented
the actual power off, power on in my Resum and Suspend script due to
others problem.
Loading...