How to apply changes in the slurm.conf

Discussion:

José Manuel Molero

2014-06-10 06:24:29 UTC

Dear Slurm user,
Maybe this are dummy questions, but I can't find the response in the manual.
Recently we have installed in a cluster, the slurm 14.03 version, in a Red Hat/ Scientific Linux enviroment.In order to tune the configuration, we want to test different parameters of the slurm.confBut there are several users running important jobs for several days.
How can I change the configuration of slurm and restart the slurmctld without affecting to the users and the jobs of the users? Its also necessary restart the slurm daemons?Is also possible to upgrade or change the slurm version while there are jobs running?
Thanks in advance.

Barbara Krasovec

2014-06-10 06:44:35 UTC

Permalink

Post by JosÃ© Manuel Molero
Dear Slurm user,
Maybe this are dummy questions, but I can't find the response in the manual.
Recently we have installed in a cluster, the slurm 14.03 version, in a
Red Hat/ Scientific Linux enviroment.
In order to tune the configuration, we want to test different
parameters of the slurm.conf
But there are several users running important jobs for several days.
How can I change the configuration of slurm and restart the slurmctld
without affecting to the users and the jobs of the users? Its also
necessary restart the slurm daemons?
Is also possible to upgrade or change the slurm version while there
are jobs running?
Thanks in advance.

Hello!

We apply new configuration parameters with "scontrol reconfigure" (first
I arrange new slurm.conf on all nodes).

Upgrading slurm: in my experience, when upgrading to a minor release
(e.g. from 2.6.4 to 2.6.X), it is not a problem to do it on a running
cluster, jobs are conserved. But upgrading to a major release (e.g. from
2.5 to 2.6), cluster has to be drained first, otherwise jobs are killed.

Cheers,
Barbara

j***@public.gmane.org

2014-06-10 14:16:32 UTC

Permalink

Pending and running jobs should be preserved across major releases too.

Post by Barbara Krasovec

Post by JosÃ© Manuel Molero
Dear Slurm user,
Maybe this are dummy questions, but I can't find the response in the manual.
Recently we have installed in a cluster, the slurm 14.03 version,
in a Red Hat/ Scientific Linux enviroment.
In order to tune the configuration, we want to test different
parameters of the slurm.conf
But there are several users running important jobs for several days.
How can I change the configuration of slurm and restart the
slurmctld without affecting to the users and the jobs of the users?
Its also necessary restart the slurm daemons?
Is also possible to upgrade or change the slurm version while there
are jobs running?
Thanks in advance.

Hello!
We apply new configuration parameters with "scontrol reconfigure"
(first I arrange new slurm.conf on all nodes).
Upgrading slurm: in my experience, when upgrading to a minor release
(e.g. from 2.6.4 to 2.6.X), it is not a problem to do it on a
running cluster, jobs are conserved. But upgrading to a major
release (e.g. from 2.5 to 2.6), cluster has to be drained first,
otherwise jobs are killed.
Cheers,
Barbara

Michael Gutteridge

2014-06-10 14:55:32 UTC

Permalink

Note that some parameters require a full restart, not just
reconfigure. There's little user impact when restarting slurmctld.

When you get to upgrading have a look at this:

http://slurm.schedmd.com/quickstart_admin.html#upgrade

M

Post by j***@public.gmane.org
Pending and running jobs should be preserved across major releases too.

Post by Barbara Krasovec

Post by JosÃ© Manuel Molero
Dear Slurm user,
Maybe this are dummy questions, but I can't find the response in the manual.
Recently we have installed in a cluster, the slurm 14.03 version, in a
Red Hat/ Scientific Linux enviroment.
In order to tune the configuration, we want to test different parameters
of the slurm.conf
But there are several users running important jobs for several days.
How can I change the configuration of slurm and restart the slurmctld
without affecting to the users and the jobs of the users? Its also necessary
restart the slurm daemons?
Is also possible to upgrade or change the slurm version while there are
jobs running?
Thanks in advance.

Hello!
We apply new configuration parameters with "scontrol reconfigure" (first I
arrange new slurm.conf on all nodes).
Upgrading slurm: in my experience, when upgrading to a minor release (e.g.
from 2.6.4 to 2.6.X), it is not a problem to do it on a running cluster,
jobs are conserved. But upgrading to a major release (e.g. from 2.5 to 2.6),
cluster has to be drained first, otherwise jobs are killed.
Cheers,
Barbara

--
Hey! Somebody punched the foley guy!
- Crow, MST3K ep. 508

Christopher B Coffey

2014-06-10 16:05:35 UTC

Permalink

Hi Moe,

Where are these jobs preserved at? I’ve been curious about this. For
when the time comes necessary to take down the primary head node for
maintenance, I see that you can have the backup controller take over. How
does this happen? Sorry if I’ve missed it in the docs somewhere, thanks!

Chris

Post by Michael Gutteridge
Note that some parameters require a full restart, not just
reconfigure. There's little user impact when restarting slurmctld.
http://slurm.schedmd.com/quickstart_admin.html#upgrade
M

Post by j***@public.gmane.org
Pending and running jobs should be preserved across major releases too.

Post by Barbara Krasovec

Post by JosÃ© Manuel Molero
Dear Slurm user,
Maybe this are dummy questions, but I can't find the response in the manual.
Recently we have installed in a cluster, the slurm 14.03 version, in a
Red Hat/ Scientific Linux enviroment.
In order to tune the configuration, we want to test different parameters
of the slurm.conf
But there are several users running important jobs for several days.
How can I change the configuration of slurm and restart the slurmctld
without affecting to the users and the jobs of the users? Its also necessary
restart the slurm daemons?
Is also possible to upgrade or change the slurm version while there are
jobs running?
Thanks in advance.

Hello!
We apply new configuration parameters with "scontrol reconfigure" (first I
arrange new slurm.conf on all nodes).
Upgrading slurm: in my experience, when upgrading to a minor release (e.g.
from 2.6.4 to 2.6.X), it is not a problem to do it on a running cluster,
jobs are conserved. But upgrading to a major release (e.g. from 2.5 to 2.6),
cluster has to be drained first, otherwise jobs are killed.
Cheers,
Barbara

--
Hey! Somebody punched th

j***@public.gmane.org

2014-06-10 16:09:33 UTC

Permalink

See your configuration parameter for
StateSaveLocation

Post by Christopher B Coffey
Hi Moe,
Where are these jobs preserved at? I’ve been curious about this. For
when the time comes necessary to take down the primary head node for
maintenance, I see that you can have the backup controller take over. How
does this happen? Sorry if I’ve missed it in the docs somewhere, thanks!
Chris

Post by j***@public.gmane.org
Pending and running jobs should be preserved across major releases too.

Post by Barbara Krasovec

Post by JosÃ© Manuel Molero
Dear Slurm user,
Maybe this are dummy questions, but I can't find the response in the manual.
Recently we have installed in a cluster, the slurm 14.03 version, in a
Red Hat/ Scientific Linux enviroment.
In order to tune the configuration, we want to test different parameters
of the slurm.conf
But there are several users running important jobs for several days.
How can I change the configuration of slurm and restart the slurmctld
without affecting to the users and the jobs of the users? Its also necessary
restart the slurm daemons?
Is also possible to upgrade or change the slurm version while there are
jobs running?
Thanks in advance.

Hello!
We apply new configuration parameters with "scontrol reconfigure" (first I
arrange new slurm.conf on all nodes).
Upgrading slurm: in my experience, when upgrading to a minor release (e.g.
from 2.6.4 to 2.6.X), it is not a problem to do it on a running cluster,
jobs are conserved. But upgrading to a major release (e.g. from 2.5 to 2.6),
cluster has to be drained first, otherwise jobs are killed.
Cheers,
Barbara

--
Hey! Somebody punched the foley guy!
- Crow, MST3K ep. 508

Barbara Krasovec

2014-06-11 08:56:31 UTC

Permalink

Post by j***@public.gmane.org
Pending and running jobs should be preserved across major releases too.

When we upgraded slurm from 2.5 to 2.6, it was tested before on a
working test cluster and all jobs were killed.
So, if I do an upgrade of slurm from 2.6.5 to 14.03., it should work on
a working cluster and it is not necessary to drain it? I just stop new
jobs, those that are already in queue (running or pending) should be
preserved?

Thanks,
Barbara

Post by j***@public.gmane.org

Post by Barbara Krasovec

Post by JosÃ© Manuel Molero
Dear Slurm user,
Maybe this are dummy questions, but I can't find the response in the manual.
Recently we have installed in a cluster, the slurm 14.03 version, in
a Red Hat/ Scientific Linux enviroment.
In order to tune the configuration, we want to test different
parameters of the slurm.conf
But there are several users running important jobs for several days.
How can I change the configuration of slurm and restart the
slurmctld without affecting to the users and the jobs of the users?
Its also necessary restart the slurm daemons?
Is also possible to upgrade or change the slurm version while there
are jobs running?
Thanks in advance.

Hello!
We apply new configuration parameters with "scontrol reconfigure"
(first I arrange new slurm.conf on all nodes).
Upgrading slurm: in my experience, when upgrading to a minor release
(e.g. from 2.6.4 to 2.6.X), it is not a problem to do it on a running
cluster, jobs are conserved. But upgrading to a major release (e.g.
from 2.5 to 2.6), cluster has to be drained first, otherwise jobs are
killed.
Cheers,
Barbara

Janne Blomqvist

2014-06-11 13:40:32 UTC

Permalink

Post by Barbara Krasovec

Post by j***@public.gmane.org
Pending and running jobs should be preserved across major releases too.

When we upgraded slurm from 2.5 to 2.6, it was tested before on a
working test cluster and all jobs were killed.
So, if I do an upgrade of slurm from 2.6.5 to 14.03., it should work
on a working cluster and it is not necessary to drain it? I just stop
new jobs, those that are already in queue (running or pending) should
be preserved?

It ought to work, yes, but if something goes wrong... Some issues we
have seen the past few years:

1) Jobs killed on upgrade, some complaints about protocol
incompatibility between the new slurmctld and older slurmd's in logs.
IIRC this might have been the 2.5 -> 2.6.0 upgrade, a fix was included
in 2.6.1(?).

2) Jobs killed due to slurmd timeout. This was due to an upgrade
procedure where (for some reason?) slurmd's were first stopped, then
new rpm packages installed, then slurmd's restarted. Well, with enough
nodes upgrading the packages on all the nodes took long enough that
slurmctld decided all the nodes were down and killed the jobs, even
though the jobs themselves were running fine. (This issue is of course
trivial to avoid with a saner upgrade procedure and/or larger
SlurmdTimeout parameter. Would have been nice to think of it before the
"OH F***"-moment.. ;) )

3) slurmdbd hanging for 45 minutes during "service slurmdbd restart",
due to updating the MySQL tables. Our Job Id's are at ~11M, and
/var/lib/mysql is ~10GB, so I guess it's just a lot of work to do.

4) The libslurm so version is bumped every release. So things like MPI
libraries with slurm integration ought to be recompiled. Sometimes it
works to just symlink the old so name(s) to the new one, but this is of
course a giant kludge with no guarantee of working. Some kind of ABI
stability with symbol versioning etc. would be nice..

Issues (2) and (3) are unfortunately the kind you tend to run into when
upgrading your production system rather than some test cluster.. :( But
generally, on the fly upgrades have worked fine for us. Still, we try
to do major upgrades at the same time we're doing other maintenance if
possible.

Post by Barbara Krasovec
Thanks,
Barbara

Post by j***@public.gmane.org

Post by Barbara Krasovec

Post by JosÃ© Manuel Molero
Dear Slurm user,
Maybe this are dummy questions, but I can't find the response in the manual.
Recently we have installed in a cluster, the slurm 14.03 version,
in a Red Hat/ Scientific Linux enviroment.
In order to tune the configuration, we want to test different
parameters of the slurm.conf
But there are several users running important jobs for several days.
How can I change the configuration of slurm and restart the
slurmctld without affecting to the users and the jobs of the users?
Its also necessary restart the slurm daemons?
Is also possible to upgrade or change the slurm version while there
are jobs running?
Thanks in advance.

Hello!
We apply new configuration parameters with "scontrol reconfigure"
(first I arrange new slurm.conf on all nodes).
Upgrading slurm: in my experience, when upgrading to a minor release
(e.g. from 2.6.4 to 2.6.X), it is not a problem to do it on a
running cluster, jobs are conserved. But upgrading to a major
release (e.g. from 2.5 to 2.6), cluster has to be drained first,
otherwise jobs are killed.
Cheers,
Barbara

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & BECS
+358503841576 || janne.blomqvist-***@public.gmane.org