srun not executing command on node

Discussion:

Uwe Sauter

2014-07-10 14:40:32 UTC

Hi all,

I have a cluster which is set up as following

hostname description
cl4intern admin server, running slurmctld and slurmdbd
cl4fr1 frontend, not running any slurm service but has slurm
installed
n01 -- n54 compute nodes

slurm.conf is shared on all hosts via NFS. cl4fr1 does not appear in
slurm.conf. It should only act as submit node.

Queuing job from cl4fr1 using sbatch is working as expected. But if I
run "srun -N1 -n1 -p mypartition hostname" I do not get the output of
"hostname" but just a terminal window that is waiting for something to
happen.

srun -N1 -n1 -p mypartition hostname

With "scontrol show jobs" I can see that a node was allocated but this
is just sitting around, doing nothing.

JobId=2681 Name=df
UserId=myuser(15001) GroupId=mygroup(15000)
Priority=4294901684 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2014-07-10T16:30:26 EligibleTime=2014-07-10T16:30:26
StartTime=2014-07-10T16:30:26 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=mypartition AllocNode:Sid=cl4fr1:18795
ReqNodeList=(null) ExcNodeList=(null)
NodeList=n52
BatchHost=n52
NumNodes=1 NumCPUs=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/bin/hostname
WorkDir=/home/myuser

I can cancel this job with CRTL+C without problem:

^Csrun: interrupt (one more within 1 sec to abort)
srun: task 0: unknown
^Csrun: sending Ctrl-C to job 2681.0
srun: Job step 2681.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

A combination of salloc + env | grep SLURM_JOB_NODELIST + ssh to node +
srun -p mypartition -N1 -n1 hostname does execute, but on a different node.

Can someone point me into a direction where to investigate further? Did
you already have this issue? Is it even possible to have a submit node
that does not apper in the SLURM configuration??

Thanks in advance,

Uwe Sauter

Trey Dockendorf

2014-07-10 17:54:44 UTC

Permalink

Uwe,

I ran into something similar. Our submit host (login node) could not execute srun commands, they would simply hang. I found that it was due to my iptables rules being "too strict". I had to allow all incoming traffic from our private network to the login node. Once that was done all srun commands began working. My controller host is also on this private network, so I'm not sure if it was the compute node or the slurm controller that need the access through the firewall. If your attempting to setup iptables as restrictive as possible, start by allowing incoming from the controller and see if that fixes the issue.

You can use something like the following to log all iptables DROPs which can be viewed in dmesg:

<ACCEPT rules>
-A INPUT -m limit --limit 2/min -j LOG --log-prefix "IPTABLES: INPUT DROPPED " --log-level 7
-A INPUT -j DROP

- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org

----- Original Message -----

Sent: Thursday, July 10, 2014 9:41:15 AM
Subject: [slurm-dev] srun not executing command on node
Hi all,
I have a cluster which is set up as following
hostname description
cl4intern admin server, running slurmctld and slurmdbd
cl4fr1 frontend, not running any slurm service but has slurm
installed
n01 -- n54 compute nodes
slurm.conf is shared on all hosts via NFS. cl4fr1 does not appear in
slurm.conf. It should only act as submit node.
Queuing job from cl4fr1 using sbatch is working as expected. But if I
run "srun -N1 -n1 -p mypartition hostname" I do not get the output of
"hostname" but just a terminal window that is waiting for something to
happen.

Post by Uwe Sauter
srun -N1 -n1 -p mypartition hostname

With "scontrol show jobs" I can see that a node was allocated but this
is just sitting around, doing nothing.
JobId=2681 Name=df
UserId=myuser(15001) GroupId=mygroup(15000)
Priority=4294901684 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2014-07-10T16:30:26 EligibleTime=2014-07-10T16:30:26
StartTime=2014-07-10T16:30:26 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=mypartition AllocNode:Sid=cl4fr1:18795
ReqNodeList=(null) ExcNodeList=(null)
NodeList=n52
BatchHost=n52
NumNodes=1 NumCPUs=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/bin/hostname
WorkDir=/home/myuser
^Csrun: interrupt (one more within 1 sec to abort)
srun: task 0: unknown
^Csrun: sending Ctrl-C to job 2681.0
srun: Job step 2681.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
A combination of salloc + env | grep SLURM_JOB_NODELIST + ssh to node +
srun -p mypartition -N1 -n1 hostname does execute, but on a different node.
Can someone point me into a direction where to investigate further? Did
you already have this issue? Is it even possible to have a submit node
that does not apper in the SLURM configuration??
Thanks in advance,
Uwe Sauter

Uwe Sauter

2014-07-10 18:19:36 UTC

Permalink

Trey,

that was the right hint. As our cluster is pretty much closed down from
the outside I don't have a problem opening the firewall for requests
coming from the nodes.

Now that I allowed that it works like a charm.

Thanks a lot,

Uwe

Post by Trey Dockendorf
Uwe,
I ran into something similar. Our submit host (login node) could not execute srun commands, they would simply hang. I found that it was due to my iptables rules being "too strict". I had to allow all incoming traffic from our private network to the login node. Once that was done all srun commands began working. My controller host is also on this private network, so I'm not sure if it was the compute node or the slurm controller that need the access through the firewall. If your attempting to setup iptables as restrictive as possible, start by allowing incoming from the controller and see if that fixes the issue.
<ACCEPT rules>
-A INPUT -m limit --limit 2/min -j LOG --log-prefix "IPTABLES: INPUT DROPPED " --log-level 7
-A INPUT -j DROP
- Trey
=============================
Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
----- Original Message -----

Post by Uwe Sauter
srun -N1 -n1 -p mypartition hostname

With "scontrol show jobs" I can see that a node was allocated but this
is just sitting around, doing nothing.
JobId=2681 Name=df
UserId=myuser(15001) GroupId=mygroup(15000)
Priority=4294901684 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2014-07-10T16:30:26 EligibleTime=2014-07-10T16:30:26
StartTime=2014-07-10T16:30:26 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=mypartition AllocNode:Sid=cl4fr1:18795
ReqNodeList=(null) ExcNodeList=(null)
NodeList=n52
BatchHost=n52
NumNodes=1 NumCPUs=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/bin/hostname
WorkDir=/home/myuser
^Csrun: interrupt (one more within 1 sec to abort)
srun: task 0: unknown
^Csrun: sending Ctrl-C to job 2681.0
srun: Job step 2681.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
A combination of salloc + env | grep SLURM_JOB_NODELIST + ssh to node +
srun -p mypartition -N1 -n1 hostname does execute, but on a different node.
Can someone point me into a direction where to investigate further? Did
you already have this issue? Is it even possible to have a submit node
that does not apper in the SLURM configuration??
Thanks in advance,
Uwe Sauter