Discussion:
Adding new nodes to slurm.conf
Paul Edmon
2013-01-30 15:02:02 UTC
Permalink
Perhaps I missed the documentation on this but what is the proper order
of operations to add new nodes to slurm.conf? Currently if we start up
slurmd on the new nodes but then don't have them in the conf it just
fails on the nodes. However, if we then later add them to the conf and
to a reconfigure on the master the master process falls over and we have
to restart it. At that point they show up as unknown and waiting for
the slurmd's on the respective new nodes to connect. Ideally this
wouldn't happen, the master shouldn't tip over just because new hosts
are added to the conf. Once those hosts are in though then simply
restarting slurmd on the hosts works fine.

So what is the proper order? Do you put the new hosts in the conf and
start up their slurmd's before you reconfig the master?

-Paul Edmon-
David Bigagli
2013-01-30 17:50:05 UTC
Permalink
Do you have the slurmctld log when the master failed? It should be enough
to add the hostname to slurm.conf, NodeName and PartitionName then
'scontrol reconfigure'.

/David
Post by Paul Edmon
Perhaps I missed the documentation on this but what is the proper order
of operations to add new nodes to slurm.conf? Currently if we start up
slurmd on the new nodes but then don't have them in the conf it just
fails on the nodes. However, if we then later add them to the conf and
to a reconfigure on the master the master process falls over and we have
to restart it. At that point they show up as unknown and waiting for
the slurmd's on the respective new nodes to connect. Ideally this
wouldn't happen, the master shouldn't tip over just because new hosts
are added to the conf. Once those hosts are in though then simply
restarting slurmd on the hosts works fine.
So what is the proper order? Do you put the new hosts in the conf and
start up their slurmd's before you reconfig the master?
-Paul Edmon-
Moe Jette
2013-01-30 20:45:04 UTC
Permalink
You need to restart slurmctld to add nodes. Adding nodes causes a
multitude of bitmaps to be rebuilt, which does not happen when
"scontrol reconfig" is executed. Also you probably want to maintain a
single slurm.conf file on all nodes.

I would recommend
1. Stop slurmctld
2. Update slurm.conf on all nodes
3. Restart slurmctld
4. Start slurmd on the new nodes
Post by David Bigagli
Do you have the slurmctld log when the master failed? It should be enough
to add the hostname to slurm.conf, NodeName and PartitionName then
'scontrol reconfigure'.
/David
Post by Paul Edmon
Perhaps I missed the documentation on this but what is the proper order
of operations to add new nodes to slurm.conf? Currently if we start up
slurmd on the new nodes but then don't have them in the conf it just
fails on the nodes. However, if we then later add them to the conf and
to a reconfigure on the master the master process falls over and we have
to restart it. At that point they show up as unknown and waiting for
the slurmd's on the respective new nodes to connect. Ideally this
wouldn't happen, the master shouldn't tip over just because new hosts
are added to the conf. Once those hosts are in though then simply
restarting slurmd on the hosts works fine.
So what is the proper order? Do you put the new hosts in the conf and
start up their slurmd's before you reconfig the master?
-Paul Edmon-
Paul Edmon
2013-01-30 22:30:07 UTC
Permalink
So during that period the master would cease managing everything and you
wouldn't be able to submit? Are those the only dangers for shutting
down the master?

We tend to be in an environment where things are in production but also
in flux.

-Paul Edmon-
Post by Moe Jette
You need to restart slurmctld to add nodes. Adding nodes causes a
multitude of bitmaps to be rebuilt, which does not happen when
"scontrol reconfig" is executed. Also you probably want to maintain a
single slurm.conf file on all nodes.
I would recommend
1. Stop slurmctld
2. Update slurm.conf on all nodes
3. Restart slurmctld
4. Start slurmd on the new nodes
Post by David Bigagli
Do you have the slurmctld log when the master failed? It should be enough
to add the hostname to slurm.conf, NodeName and PartitionName then
'scontrol reconfigure'.
/David
Post by Paul Edmon
Perhaps I missed the documentation on this but what is the proper order
of operations to add new nodes to slurm.conf? Currently if we start up
slurmd on the new nodes but then don't have them in the conf it just
fails on the nodes. However, if we then later add them to the conf and
to a reconfigure on the master the master process falls over and we have
to restart it. At that point they show up as unknown and waiting for
the slurmd's on the respective new nodes to connect. Ideally this
wouldn't happen, the master shouldn't tip over just because new hosts
are added to the conf. Once those hosts are in though then simply
restarting slurmd on the hosts works fine.
So what is the proper order? Do you put the new hosts in the conf and
start up their slurmd's before you reconfig the master?
-Paul Edmon-
Paul Edmon
2013-01-30 21:20:06 UTC
Permalink
Here is what I have:

Jan 29 15:45:29 iliadserv2 slurmctld[18705]: sched:
_slurm_rpc_step_complete StepId=753.0 usec=19
Jan 29 15:56:47 iliadserv2 slurmctld[18705]: Processing RPC:
REQUEST_RECONFIGURE from uid=0
Jan 29 15:56:47 iliadserv2 slurmctld[18705]: error: Unable to create
NodeAddr list from west[6-7][1-2][1-8]
Jan 29 15:56:47 iliadserv2 slurmctld[18705]: fatal: Unable to create
NodeAddr list from west[6-7][1-2][1-8]
Jan 29 15:57:08 iliadserv2 slurmctld[7193]: error: Job accounting
information gathered, but not stored
Jan 29 15:57:08 iliadserv2 slurmctld[7193]: slurmctld version 2.5.1
started on cluster cluster
Jan 29 15:57:08 iliadserv2 slurmctld[7193]: error: WARNING: Even though
we are collecting accounting information you have asked for it not to be
stored (accounting_storage/none) if this is not what you have in mind
you will need to change it.
Jan 29 15:57:08 iliadserv2 slurmctld[7193]: error: Unable to create
NodeAddr list from west[6-7][1-2][1-8]
Jan 29 15:57:08 iliadserv2 slurmctld[7193]: fatal: Unable to create
NodeAddr list from west[6-7][1-2][1-8]
Jan 29 15:58:36 iliadserv2 slurmctld[7258]: error: Job accounting
information gathered, but not stored
Jan 29 15:58:36 iliadserv2 slurmctld[7258]: slurmctld version 2.5.1
started on cluster cluster
Jan 29 15:58:36 iliadserv2 slurmctld[7258]: error: WARNING: Even though
we are collecting accounting information you have asked for it not to be
stored (accounting_storage/none) if this is not what you have in mind
you will need to change it.
Jan 29 15:58:36 iliadserv2 slurmctld[7258]: Recovered state of 28 nodes

I fixed the slurm.conf in between to be just
west61[1-8],west62[1-8],west71[1-8],west72[1-8]. However, I wouldn't
expect the master to go down do to not being able to make a NodeAddr
list. I would expect it to refuse the new conf and spit out an error
message.

-Paul Edmon-
Re: [slurm-dev] Adding new nodes to slurm.conf
Do you have the slurmctld log when the master failed? It should be
enough to add the hostname to slurm.conf, NodeName and PartitionName
then 'scontrol reconfigure'.
/David
Perhaps I missed the documentation on this but what is the proper order
of operations to add new nodes to slurm.conf? Currently if we start up
slurmd on the new nodes but then don't have them in the conf it just
fails on the nodes. However, if we then later add them to the conf and
to a reconfigure on the master the master process falls over and we have
to restart it. At that point they show up as unknown and waiting for
the slurmd's on the respective new nodes to connect. Ideally this
wouldn't happen, the master shouldn't tip over just because new hosts
are added to the conf. Once those hosts are in though then simply
restarting slurmd on the hosts works fine.
So what is the proper order? Do you put the new hosts in the conf and
start up their slurmd's before you reconfig the master?
-Paul Edmon-
Loading...