Using the same (mounted) slurm installation on all nodes

Discussion:

Bastian Krüger

2014-07-25 08:33:34 UTC

I recently began working with a cluster that consists of 1 control node and
several computation node and it was set up a couple of years ago by someone
else. In this current setup, there is only one actual slurm installation,
which is located on the control node in /usr/local/slurm. All the other
nodes just mount that directory to their /usr/local/slurm. The only thing
that is copied between the nodes is the service startup script in
/etc/init.d.

The question is, if that is a good idea or not. I realize that if the
control node fails, that all the other nodes lose the mounted slurm
directory. But how crucial is that?

Also, I'm thinking about adding a backup control node. This node has to
share a directory with the first control node. Are there any advises on
where this directory should be located? Could it live on the backup control
node or would it be better to use a separate server?

Jason Bacon

2014-07-25 13:18:31 UTC

Permalink

Our CentOS cluster uses a shared installation for all the compute nodes,
but separate local installations for the head node and backup head
node. The compute nodes share binaries and configuration files via NFS,
but keep separate logs in their own local /var/log and the startup
script in their local init.d.

The head node and backup head node are independent of each other except
for shared state information. See "High Availability" in the SLURM docs:

http://slurm.schedmd.com/quickstart_admin.html#Config

If NFS is properly configured, clients will wait indefinitely and
continue where they left off, so an NFS server failure should not result
in loss of data as long as the server comes back online while the client
is still trying to complete its operations.

There are pros and cons to a separate server for the head node and
backup head node state information. With a separate server, both can
operate normally while the other is down. However, is the separate
server goes down, neither head node can operate normally until it comes
back up. A single server failure is more likely with 3 servers than with 2.

If state information is kept on the primary head node, the backup head
node will be blocked from updating state information while the primary
is down, and vice versa. This shouldn't be a problem as long as the
outage is brief, such as a reboot required for system updates. I
routinely reboot our primary head node for yum updates (after verifying
that the backup head node is running normally).

In any case, the server where the state information is kept should be
*very* reliable. We keep ours on the primary head node, which uses a
hardware RAID1 for the boot disk and has very strict limits to keep the
load to a minimum. Memory use and processes are both limited via
/etc/security/limits.d/ and the head node has no access to the
computational software installed on the cluster, so users aren't tempted
to run "quick" jobs on the head node outside the scheduler.

It would be a nice feature if the head node and backup head node could
be completely independent of each other, but I imagine that keeping them
synchronized would require some challenging coding and the real benefit
would be minimal.

Regards,

Jason

Using the same (mounted) slurm installation on all nodes
I recently began working with a cluster that consists of 1 control
node and several computation node and it was set up a couple of years
ago by someone else. In this current setup, there is only one actual
slurm installation, which is located on the control node in
/usr/local/slurm. All the other nodes just mount that directory to
their /usr/local/slurm. The only thing that is copied between the
nodes is the service startup script in /etc/init.d.
The question is, if that is a good idea or not. I realize that if the
control node fails, that all the other nodes lose the mounted slurm
directory. But how crucial is that?
Also, I'm thinking about adding a backup control node. This node has
to share a directory with the first control node. Are there any
advises on where this directory should be located? Could it live on
the backup control node or would it be better to use a separate server?

Bastian Krüger

2014-07-28 08:49:38 UTC

Permalink

Hi Jason! That sounds like an overall good concept. I just don't understand
one thing. If you keep your state information on the primary head node, how
can the backup node possibly resume the last state after a failure on the
primary node, if it can't read the state information? How does that work?
I think I will try something very similar though with the backup node being
the one hosting the state information and the installations for the compute
nodes. So thank you for the advises!
Best,
Bastian

Post by Jason Bacon
Our CentOS cluster uses a shared installation for all the compute nodes,
but separate local installations for the head node and backup head node.
The compute nodes share binaries and configuration files via NFS, but keep
separate logs in their own local /var/log and the startup script in their
local init.d.
The head node and backup head node are independent of each other except
http://slurm.schedmd.com/quickstart_admin.html#Config
If NFS is properly configured, clients will wait indefinitely and continue
where they left off, so an NFS server failure should not result in loss of
data as long as the server comes back online while the client is still
trying to complete its operations.
There are pros and cons to a separate server for the head node and backup
head node state information. With a separate server, both can operate
normally while the other is down. However, is the separate server goes
down, neither head node can operate normally until it comes back up. A
single server failure is more likely with 3 servers than with 2.
If state information is kept on the primary head node, the backup head
node will be blocked from updating state information while the primary is
down, and vice versa. This shouldn't be a problem as long as the outage is
brief, such as a reboot required for system updates. I routinely reboot
our primary head node for yum updates (after verifying that the backup head
node is running normally).
In any case, the server where the state information is kept should be
*very* reliable. We keep ours on the primary head node, which uses a
hardware RAID1 for the boot disk and has very strict limits to keep the
load to a minimum. Memory use and processes are both limited via
/etc/security/limits.d/ and the head node has no access to the
computational software installed on the cluster, so users aren't tempted to
run "quick" jobs on the head node outside the scheduler.
It would be a nice feature if the head node and backup head node could be
completely independent of each other, but I imagine that keeping them
synchronized would require some challenging coding and the real benefit
would be minimal.
Regards,
Jason

Using the same (mounted) slurm installation on all nodes
I recently began working with a cluster that consists of 1 control node
and several computation node and it was set up a couple of years ago by
someone else. In this current setup, there is only one actual slurm
installation, which is located on the control node in /usr/local/slurm. All
the other nodes just mount that directory to their /usr/local/slurm. The
only thing that is copied between the nodes is the service startup script
in /etc/init.d.
The question is, if that is a good idea or not. I realize that if the
control node fails, that all the other nodes lose the mounted slurm
directory. But how crucial is that?
Also, I'm thinking about adding a backup control node. This node has to
share a directory with the first control node. Are there any advises on
where this directory should be located? Could it live on the backup control
node or would it be better to use a separate server?

Chris Samuel

2014-07-25 14:06:31 UTC

Permalink

Post by Bastian KrÃ¼ger
I recently began working with a cluster that consists of 1 control node and
several computation node and it was set up a couple of years ago by someone
else. In this current setup, there is only one actual slurm installation,
which is located on the control node in /usr/local/slurm. All the other
nodes just mount that directory to their /usr/local/slurm. The only thing
that is copied between the nodes is the service startup script in
/etc/init.d.

That's almost exactly how we run all our Intel clusters and our BlueGene/Q.

Works very well for us. We don't have a backup node for slurmctld.

All the best,
Chris

--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci