Post by Barbara KrasovecPost by j***@public.gmane.orgPending and running jobs should be preserved across major releases too.
When we upgraded slurm from 2.5 to 2.6, it was tested before on a
working test cluster and all jobs were killed.
So, if I do an upgrade of slurm from 2.6.5 to 14.03., it should work
on a working cluster and it is not necessary to drain it? I just stop
new jobs, those that are already in queue (running or pending) should
be preserved?
It ought to work, yes, but if something goes wrong... Some issues we
have seen the past few years:
1) Jobs killed on upgrade, some complaints about protocol
incompatibility between the new slurmctld and older slurmd's in logs.
IIRC this might have been the 2.5 -> 2.6.0 upgrade, a fix was included
in 2.6.1(?).
2) Jobs killed due to slurmd timeout. This was due to an upgrade
procedure where (for some reason?) slurmd's were first stopped, then
new rpm packages installed, then slurmd's restarted. Well, with enough
nodes upgrading the packages on all the nodes took long enough that
slurmctld decided all the nodes were down and killed the jobs, even
though the jobs themselves were running fine. (This issue is of course
trivial to avoid with a saner upgrade procedure and/or larger
SlurmdTimeout parameter. Would have been nice to think of it before the
"OH F***"-moment.. ;) )
3) slurmdbd hanging for 45 minutes during "service slurmdbd restart",
due to updating the MySQL tables. Our Job Id's are at ~11M, and
/var/lib/mysql is ~10GB, so I guess it's just a lot of work to do.
4) The libslurm so version is bumped every release. So things like MPI
libraries with slurm integration ought to be recompiled. Sometimes it
works to just symlink the old so name(s) to the new one, but this is of
course a giant kludge with no guarantee of working. Some kind of ABI
stability with symbol versioning etc. would be nice..
Issues (2) and (3) are unfortunately the kind you tend to run into when
upgrading your production system rather than some test cluster.. :( But
generally, on the fly upgrades have worked fine for us. Still, we try
to do major upgrades at the same time we're doing other maintenance if
possible.
Post by Barbara KrasovecThanks,
Barbara
Post by j***@public.gmane.orgPost by Barbara KrasovecPost by José Manuel MoleroDear Slurm user,
Maybe this are dummy questions, but I can't find the response in the manual.
Recently we have installed in a cluster, the slurm 14.03 version,
in a Red Hat/ Scientific Linux enviroment.
In order to tune the configuration, we want to test different
parameters of the slurm.conf
But there are several users running important jobs for several days.
How can I change the configuration of slurm and restart the
slurmctld without affecting to the users and the jobs of the users?
Its also necessary restart the slurm daemons?
Is also possible to upgrade or change the slurm version while there
are jobs running?
Thanks in advance.
Hello!
We apply new configuration parameters with "scontrol reconfigure"
(first I arrange new slurm.conf on all nodes).
Upgrading slurm: in my experience, when upgrading to a minor release
(e.g. from 2.6.4 to 2.6.X), it is not a problem to do it on a
running cluster, jobs are conserved. But upgrading to a major
release (e.g. from 2.5 to 2.6), cluster has to be drained first,
otherwise jobs are killed.
Cheers,
Barbara
--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & BECS
+358503841576 || janne.blomqvist-***@public.gmane.org