Discussion:
Protocol mismatch and side effects(?)
Всеволод Никоноров
2014-09-01 10:26:30 UTC
Permalink
Hello,

I tried to test slurm-14.11 on some of my nodes while other nodes ran slurm-2.5.7, and nodes running 14.11 were not excluded from 2.5.7 controller config. It seems like something confused 2.5.7 controller, for tasks have doubled for some time (each task were visible twice in smap list), and after excluding 14.11 nodes from 2.5.7 controller config those tasks have restarted and doubling has ended.

Can protocol mismatch (which was definitely visible in log) be related to task doubling and hanging? Are there any other safety measures except cross-excluding foreign-version nodes from controllers? I don't want to make our polite users sad again :)

Thanks in advance!
j***@public.gmane.org
2014-09-02 03:05:36 UTC
Permalink
See:
http://slurm.schedmd.com/quickstart_admin.html#upgrade
Post by Всеволод Никоноров
Hello,
I tried to test slurm-14.11 on some of my nodes while other nodes
ran slurm-2.5.7, and nodes running 14.11 were not excluded from
2.5.7 controller config. It seems like something confused 2.5.7
controller, for tasks have doubled for some time (each task were
visible twice in smap list), and after excluding 14.11 nodes from
2.5.7 controller config those tasks have restarted and doubling has
ended.
Can protocol mismatch (which was definitely visible in log) be
related to task doubling and hanging? Are there any other safety
measures except cross-excluding foreign-version nodes from
controllers? I don't want to make our polite users sad again :)
Thanks in advance!
--
Morris "Moe" Jette
CTO, SchedMD LLC

Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/slurm_ug_agenda.html
Всеволод Никоноров
2014-09-02 07:53:31 UTC
Permalink
Thank you very much, but what I am searching for is not exactly the upgrade procedure, I am rather trying to understand what happened in my enviroment and how to avoid such problems in future. We are testing two installations of slurm on ajacent nodes, so that users who test the new version could have all the network-mounted filesystems (nfs, lustre) from the main installation. It seems that slurmctld 2.5.7 adressed to a node running slurmctld 14.11 and slurmd 14.11 simultaneously, and then some of the nodes controlled by slurmctld 2.5.7 got confused and lost jobs.

Could interaction of a slurmctld 2.5.7 with slurm daemon of different version (which was not supposed to consider it it's master) confuse it so much that it lost the jobs?

Will such things happen if I exclude nodes running newer slurm daemons from older slurm's master node's config? Is there anything else I should do for two independent sets of slurm nodes to co-exist without issues?

Thank you!
Post by j***@public.gmane.org
http://slurm.schedmd.com/quickstart_admin.html#upgrade
О©╫Hello,
О©╫I tried to test slurm-14.11 on some of my nodes while other nodes
О©╫ran slurm-2.5.7, and nodes running 14.11 were not excluded from
О©╫2.5.7 controller config. It seems like something confused 2.5.7
О©╫controller, for tasks have doubled for some time (each task were
О©╫visible twice in smap list), and after excluding 14.11 nodes from
О©╫2.5.7 controller config those tasks have restarted and doubling has
О©╫ended.
О©╫Can protocol mismatch (which was definitely visible in log) be
О©╫related to task doubling and hanging? Are there any other safety
О©╫measures except cross-excluding foreign-version nodes from
О©╫controllers? I don't want to make our polite users sad again :)
О©╫Thanks in advance!
--
Morris "Moe" Jette
CTO, SchedMD LLC
Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/slurm_ug_agenda.html
Michael Jennings
2014-09-02 17:56:55 UTC
Permalink
On Tue, Sep 2, 2014 at 12:53 AM, Всеволод Никоноров
Post by Всеволод Никоноров
Thank you very much, but what I am searching for is not exactly the upgrade procedure, I am rather trying to understand what happened in my enviroment and how to avoid such problems in future. We are testing two installations of slurm on ajacent nodes, so that users who test the new version could have all the network-mounted filesystems (nfs, lustre) from the main installation. It seems that slurmctld 2.5.7 adressed to a node running slurmctld 14.11 and slurmd 14.11 simultaneously, and then some of the nodes controlled by slurmctld 2.5.7 got confused and lost jobs.
If you read carefully the first paragraph on the page Moe linked you
to, it will explain the problem. 2.5.7 can only interact with 2.6.x
and 14.03. To upgrade to 14.11, you'll need to upgrade somewhere in
between first.

You also intermix references to slurmctld (which runs on the master)
with slurmd (which runs on the nodes), so it's not clear you followed
the proper upgrade procedure in terms of order of operations; this is
also documented on the page Moe provided.

Any time you're upgrading more than a single major version step, we
strongly recommend you try the upgrade on a non-production system
first. This will help to avoid any surprises (like job loss) during
the actual upgrade.

HTH,
Michael
--
Michael Jennings <mej-/***@public.gmane.org>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
Всеволод Никоноров
2014-09-05 13:13:32 UTC
Permalink
Our upgrade went successful after all, thanks everybody!
On Tue, Sep 2, 2014 at 12:53 AM, О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫ О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫
О©╫Thank you very much, but what I am searching for is not exactly the upgrade procedure, I am rather trying to understand what happened in my enviroment and how to avoid such problems in future. We are testing two installations of slurm on ajacent nodes, so that users who test the new version could have all the network-mounted filesystems (nfs, lustre) from the main installation. It seems that slurmctld 2.5.7 adressed to a node running slurmctld 14.11 and slurmd 14.11 simultaneously, and then some of the nodes controlled by slurmctld 2.5.7 got confused and lost jobs.
If you read carefully the first paragraph on the page Moe linked you
to, it will explain the problem. О©╫2.5.7 can only interact with 2.6.x
and 14.03. О©╫To upgrade to 14.11, you'll need to upgrade somewhere in
between first.
You also intermix references to slurmctld (which runs on the master)
with slurmd (which runs on the nodes), so it's not clear you followed
the proper upgrade procedure in terms of order of operations; this is
also documented on the page Moe provided.
Any time you're upgrading more than a single major version step, we
strongly recommend you try the upgrade on a non-production system
first. О©╫This will help to avoid any surprises (like job loss) during
the actual upgrade.
HTH,
Michael
--
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E О©╫О©╫О©╫О©╫О©╫О©╫О©╫W: 510-495-2687
MS 050B-3209 О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫F: 510-486-8615
Continue reading on narkive:
Loading...