Discussion:
Error: Unable to contact slurm controller
Gerry Creager - NOAA Affiliate
2014-08-20 16:31:33 UTC
Permalink
I'm trying to learn how to use and administer slurm on a new Cray system,
and started seeing this yesterday:
squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)

I'm at a loss as to how to proceed.

Thanks, Gerry
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
j***@public.gmane.org
2014-08-20 16:39:37 UTC
Permalink
Try this:
http://slurm.schedmd.com/troubleshoot.html
Post by Gerry Creager - NOAA Affiliate
I'm trying to learn how to use and administer slurm on a new Cray system,
squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
I'm at a loss as to how to proceed.
Thanks, Gerry
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Morris "Moe" Jette
CTO, SchedMD LLC

Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/slurm_ug_agenda.html
Gerry Creager - NOAA Affiliate
2014-08-20 21:08:36 UTC
Permalink
Moe,

Thanks. I've tried. I'm noting a pair of errors in the slurmctld.log file:

2014-08-20T15:58:58.458] debug: No DownNodes
[2014-08-20T15:58:58.458] fatal: No PartitionName information available!

So far, Google hasn't helped me much in this regard.

gerry
Post by j***@public.gmane.org
http://slurm.schedmd.com/troubleshoot.html
I'm trying to learn how to use and administer slurm on a new Cray system,
Post by Gerry Creager - NOAA Affiliate
squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
I'm at a loss as to how to proceed.
Thanks, Gerry
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Morris "Moe" Jette
CTO, SchedMD LLC
Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/slurm_ug_agenda.html
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
Trey Dockendorf
2014-08-20 21:25:31 UTC
Permalink
What's your slurm.conf look like? Do you have valid Nodes and Partitions defined?

For example:

egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf

Sounds like invalid slurm.conf is preventing slurmctld from starting.

- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org

----- Original Message -----
Sent: Wednesday, August 20, 2014 4:09:25 PM
Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
Moe,
2014-08-20T15:58:58.458] debug: No DownNodes
[2014-08-20T15:58:58.458] fatal: No PartitionName information
available!
So far, Google hasn't helped me much in this regard.
gerry
http://slurm.schedmd.com/ troubleshoot.html
I'm trying to learn how to use and administer slurm on a new Cray system,
squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
I'm at a loss as to how to proceed.
Thanks, Gerry
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Morris "Moe" Jette
CTO, SchedMD LLC
Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
Gerry Creager - NOAA Affiliate
2014-08-20 21:39:41 UTC
Permalink
Hi, Trey

That's what I am intuiting, as well, but:

***@loki:~/software/wrf/NME/DART_Lanai/models/wrf/work> egrep
'^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf
NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536
PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60
Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
MaxNodes=12

looks pretty normal.

gerry
Post by Trey Dockendorf
What's your slurm.conf look like? Do you have valid Nodes and Partitions defined?
egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf
Sounds like invalid slurm.conf is preventing slurmctld from starting.
- Trey
=============================
Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
----- Original Message -----
Sent: Wednesday, August 20, 2014 4:09:25 PM
Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
Moe,
2014-08-20T15:58:58.458] debug: No DownNodes
[2014-08-20T15:58:58.458] fatal: No PartitionName information available!
So far, Google hasn't helped me much in this regard.
gerry
http://slurm.schedmd.com/ troubleshoot.html
I'm trying to learn how to use and administer slurm on a new Cray system,
squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
I'm at a loss as to how to proceed.
Thanks, Gerry
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Morris "Moe" Jette
CTO, SchedMD LLC
Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
Trey Dockendorf
2014-08-21 03:11:32 UTC
Permalink
Is slurmctld running? My guess is that you need at least one partition defined in addition to the DEFAULT partition. Try creating a partition with any name, which will inherit everything from DEFAULT.

- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org

----- Original Message -----
Sent: Wednesday, August 20, 2014 4:40:40 PM
Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
Hi, Trey
'^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf
NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536
PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60
Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
MaxNodes=12
looks pretty normal.
gerry
What's your slurm.conf look like? Do you have valid Nodes and
Partitions defined?
egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf
Sounds like invalid slurm.conf is preventing slurmctld from starting.
- Trey
=============================
Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
----- Original Message -----
Sent: Wednesday, August 20, 2014 4:09:25 PM
Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
Moe,
Thanks. I've tried. I'm noting a pair of errors in the
slurmctld.log
2014-08-20T15:58:58.458] debug: No DownNodes
[2014-08-20T15:58:58.458] fatal: No PartitionName information available!
So far, Google hasn't helped me much in this regard.
gerry
http://slurm.schedmd.com/ troubleshoot.html
I'm trying to learn how to use and administer slurm on a new Cray system,
squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
I'm at a loss as to how to proceed.
Thanks, Gerry
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Morris "Moe" Jette
CTO, SchedMD LLC
Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
Gerry Creager - NOAA Affiliate
2014-08-21 14:47:48 UTC
Permalink
No, slurmctld isn't running. Now. It was when I started, but I suspect I
made at least one mod too many to slurm.conf. When I try to start
slurmctld, I get these in slurmctld.log:
[2014-08-21T09:30:09.626] debug2: No ApbasilTimeout configured (65534)
[2014-08-21T09:30:09.630] debug2: No ApbasilTimeout configured (65534)
[2014-08-21T09:30:09.673] fatal: system has no usable batch compute nodes


I've just made a mod to slurm.conf that makes sure there's a default
partition. I'd had named partitions in previously, but got some errors and
warnings when trying to get the partition naming right in #SBATCH, so I'd
gone back to the default config.

This appears to have started with a reboot several days ago. I'm now making
sure it's not something deeper causing a Gemini network problem.

Thanks, Trey!
gerry
Post by Trey Dockendorf
Is slurmctld running? My guess is that you need at least one partition
defined in addition to the DEFAULT partition. Try creating a partition
with any name, which will inherit everything from DEFAULT.
- Trey
=============================
Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
----- Original Message -----
Sent: Wednesday, August 20, 2014 4:40:40 PM
Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
Hi, Trey
'^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf
NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536
PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60
Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
MaxNodes=12
looks pretty normal.
gerry
What's your slurm.conf look like? Do you have valid Nodes and
Partitions defined?
egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf
Sounds like invalid slurm.conf is preventing slurmctld from starting.
- Trey
=============================
Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
----- Original Message -----
Sent: Wednesday, August 20, 2014 4:09:25 PM
Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
Moe,
Thanks. I've tried. I'm noting a pair of errors in the
slurmctld.log
2014-08-20T15:58:58.458] debug: No DownNodes
[2014-08-20T15:58:58.458] fatal: No PartitionName information available!
So far, Google hasn't helped me much in this regard.
gerry
http://slurm.schedmd.com/ troubleshoot.html
I'm trying to learn how to use and administer slurm on a new Cray system,
squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
I'm at a loss as to how to proceed.
Thanks, Gerry
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Morris "Moe" Jette
CTO, SchedMD LLC
Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
Andrew Elwell
2014-08-24 13:21:32 UTC
Permalink
Hi Gerry,
Post by Gerry Creager - NOAA Affiliate
[2014-08-21T09:30:09.673] fatal: system has no usable batch compute nodes
We see this on our systems (running Slurm + Alps/basil rather than native)
when the slurmctld starts before the sdb has a list of batch nodes. It's
bitten us when we've set the nodes to interactive rather than batch, and
more regularly when we've restarted the sdb and slurmctld has started too
early in the boot process. (a quick 'service slurm restart' sorts that tho)

Andrew
Jeff Falgout
2014-08-25 15:10:35 UTC
Permalink
On Wed, Aug 20, 2014 at 3:39 PM, Gerry Creager - NOAA Affiliate <
Post by Gerry Creager - NOAA Affiliate
Hi, Trey
'^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf
NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536
PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60
Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
MaxNodes=12
Basic question here ... just covering the bases:

Do you intend for your node names to be nid00002, nid00003, nid00004...? or
nid002, nid003, nid004 ...?

Jeff

Loading...