Feedback on integration tests systemd/slurm and questions

Discussion:

Rémi Palancher

2014-08-28 16:17:47 UTC

Hi developers,

You should already know that systemd[1] is the fast growing init
alternative that will be the new default on all major GNU/linux
distributions including RHEL7, Centos, Fedora, Debian, Ubuntu and so on.
Among other things, systemd has notably the particularity to put all
processes into cgroups. This also includes all system services and
therefore slurm daemons. Since slurmd is also able to manage cgroups, we
(with workmates at EDF) were curious to know how systemd and slurm could
work together.

My testing environment is:

- Debian Wheezy 7.6
- Linux kernel 3.2.60
- systemd 204
- slurm 14.11.0-0pre3

systemd
=======

Here are short explanations of how systemd works (at least AFAIU!).

At boot time, systemd mounts the following cgroups FS:

tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)

Inside a tmpfs, it mounts a first cgroup filesystem named 'systemd':

- without any resource controller associated to it
- notify_on_release is set to 1
- the release agent is /lib/systemd/systemd-cgroups-agent

Then, systemd looks for all available resource controller on the running
kernel and mount one filesystem for each of these (except with cpu and
cpuacct which are mounted together). In all these controller cgroup fs,
there is none release_agent.

Systemd actually manages all processes running on the system (user
sessions, kernel threads, services and forks) in dedicated cgroups
inside the 'systemd' hierarchy. Then eventually, if limits are
configured in so-called unit files, it also creates cgroups in the
appropriate controller filesystems. For example, if you set a memory
usage limit to slurmd service, additionally to the
/system/slurmd.service cgroup in systemd fs, it will also create a
/system/slurmd.service cgroup in memory fs with appropriate memory
limits.

By default for services, it simply creates a cgroup in cpu,cpuacct
controllers. For example with slurmd:

# cat /proc/`pgrep slurmd`/cgroup
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/
3:cpuacct,cpu:/system/slurmd.service
2:cpuset:/
1:name=systemd:/system/slurmd.service

When all processes of a cgroup end, systemd is notified with the
execution of the release agent /lib/systemd/systemd-cgroups-agent in the
'systemd' fs. This program basically sends a DBUS notification the
systemd core daemon with the path of the empty cgroup in parameter. When
the core daemon receives this DBUS notification, it looks over its
internal data structures for all associated cgroups in all controller fs
and delete all of these. This is how all cgroup controllers fs are kept
clean when they become empty.

slurm
=====

Well, then the question is: How slurmd and its cgroup plugins could work
on top of that?

First, here is an excerpt of cgroups.txt in Linux doc:

"If an active hierarchy with exactly the same set of subsystems already
exists, it will be reused for the new mount. If no existing hierarchy
matches, and any of the requested subsystems are in use in an existing
hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
is activated, associated with the requested subsystems."

Therefore, if we configure slurmd to mount by itself (on other
mountpoints) already existing cpuset, memory and freezer controller fs
and set its own release_agent for emptiness notification, it works. The
cleanup at the end of jobs is correctly done by Slurm release agent and
systemd does not complain.

Here is the corresponding cgroup.conf:

CgroupMountpoint=/cgroup
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"

(in slurm.conf, I enable plugins proctrack/cgroup and task/cgroup but I
avoided jobacct_gather/none since it's still flagged as "experimental"
in the doc.)

Then:

# mkdir /cgroup
# mount -t tmpfs tmpfs /cgroup
# slurmd

My only source of sadness with this solution is the number of mounts:

tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)
tmpfs on /cgroup type tmpfs (rw,relatime)
cgroup on /cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
cgroup on /cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
cgroup on /cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)

Therefore, I tried to make slurm use the controller filesystems already
mounted by systemd. In cgroup.conf, it looks like this:

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=no
#CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"

With this configuration, slurmd does not set the release_agent in the
root directory of the controller filesystem (and commenting out the
CgroupReleaseAgentDir parameter does not change anything):

# for controller in perf_event net_cls freezer devices memory
cpu,cpuacct cpuset systemd;
do echo "${controller}: $(cat
/sys/fs/cgroup/${controller}/release_agent)"; done
perf_event:
net_cls:
freezer:
devices:
memory:
cpu,cpuacct:
cpuset:
systemd: /lib/systemd/systemd-cgroups-agent

When a job is launched, the cgroups are properly created by slurmd:

# find /sys/fs/cgroup/ -path '*slurm*tasks'
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/tasks
/sys/fs/cgroup/freezer/slurm/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/tasks
/sys/fs/cgroup/memory/slurm/tasks
/sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
/sys/fs/cgroup/cpuset/slurm/tasks
/sys/fs/cgroup/systemd/system/slurmd.service/tasks

(my job 27 has one batch step with `sleep 600`)

# cat /proc/`pgrep sleep`/cgroup
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/slurm/uid_1000/job_27/step_batch
5:devices:/
4:memory:/slurm/uid_1000/job_27/step_batch
3:cpuacct,cpu:/system/slurmd.service
2:cpuset:/slurm/uid_1000/job_27/step_batch
1:name=systemd:/system/slurmd.service

But when I cancel the job some garbage are let in the cgroup
filesystems:

# find /sys/fs/cgroup/ -path '*slurm*tasks'
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/tasks
/sys/fs/cgroup/freezer/slurm/tasks
/sys/fs/cgroup/memory/slurm/tasks
/sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
/sys/fs/cgroup/cpuset/slurm/tasks
/sys/fs/cgroup/systemd/system/slurmd.service/tasks

The systemd release agent has not been called by linux since the cgroups
were not present in 'systemd' fs. Slurm release script was not called
neither since it was not set as release_agent in controllers
filesystems.

But strangely, the memory controller has been totally cleaned and the
step_batch in the freezer controller has vanished. Actually, I figured
out that there is some cleanup logic which explains that result in:

- _slurm_cgroup_destroy() called by fini() in
src/plugins/proctrack/cgroup/proctrack_cgroup.c
- task_cgroup_memory_fini() in
src/plugins/task/cgroup/task_cgroup_memory.c

But there is none in task_cgroup_cpuset_fini() in
src/plugins/task/cgroup/task_cgroup_cpuset.c.

So finally, here come my questions:

- Is the cleanup logic in the plugins supposed to work and just an
unfinished work for all controllers? And the release_agent script is
simply a workaround?
- Is slurmd supposed to rely only on the release_agent for the cleanup?
And therefore all the cleanup logic triggered by fini() in plugins is
irrelevant?
- Or a mix of these that I just don't understand?

I would be glad to have your insightful lights on this matter :) I would
also appreciate to get feedback from other people who have done other
tests with slurm and systemd!

The funny thing about all of this is that it will become totally
irrelevant with the upcoming releases of the linux kernel (3.16+) and
the ongoing effort on the cgroup unified hierarchy[3][4]! So if
modifications should be done in Slurm on cgroup management, it would be
wise to take this into account.

[1] http://www.freedesktop.org/wiki/Software/systemd/
[2] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
[3] http://lwn.net/Articles/601840/
[4]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroups/unified-hierarchy.txt

Thank you for having taken the time to read this!

Regards,

--
Rï¿½mi Palancher
<remi-externe.palancher-***@public.gmane.org>

Janne Blomqvist

2014-08-29 13:01:33 UTC

Permalink

Post by RÃ©mi Palancher
I would be glad to have your insightful lights on this matter :) I would
also appreciate to get feedback from other people who have done other
tests with slurm and systemd!

Haven't tested anything yet, but with RHEL/CentOS 7 already available, I suspect it won't be long before people are starting to roll out clusters based on those OS'es. So the topic certainly deserves some attention, thanks for bringing it up!

Post by RÃ©mi Palancher
The funny thing about all of this is that it will become totally
irrelevant with the upcoming releases of the linux kernel (3.16+) and
the ongoing effort on the cgroup unified hierarchy[3][4]! So if
modifications should be done in Slurm on cgroup management, it would be
wise to take this into account.
[3] http://lwn.net/Articles/601840/
[4]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroups/unified-hierarchy.txt

Seems that in the brave new unified hierarchy cgroup world cgroups must be controlled by communicating with the cgroup controller process (which would be systemd on systemd-using systems which should be most of them), rather than manipulating the cgroups fs directly. Systemd provides a D-Bus API for this, see

http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

That implies quite a lot of changes in the slurm cgroups support..

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & BECS
+358503841576 || janne.blomqvist-***@public.gmane.org

Chris Samuel

2014-08-29 23:18:31 UTC

Permalink

Post by Janne Blomqvist
Haven't tested anything yet, but with RHEL/CentOS 7 already available, I
suspect it won't be long before people are starting to roll out clusters
based on those OS'es. So the topic certainly deserves some attention,
thanks for bringing it up!

Well Slurm works with cgroups without modification on CentOS 7, so unless they
backport some disruptive changes into systemd in later RHEL releases (not
without possibility) it's probably RHEL8 before it becomes a concern.

cheers,
Chris

--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Martin Perry

2014-08-29 18:01:54 UTC

Permalink

Rémi,

Thanks for investigating this. It looks like some work will be required to fully integrate Slurm cgroups with systemd. The different mount point shouldn't be a problem since that is configurable with CgroupMountPoint. The main issue seems to be with cleanup. The release_agent cleanup mechanism has always been a bit clunky. I think the long-term plan has been to do the cleanup inside Slurm instead. As you note, this has already been partially implemented for some of the subsystems. If the cgroup unified hierarchy will provide a better way of doing cleanup then maybe we should use that. Not sure when any of the previous Slurm cgroup developers will have time to work on this though...

Martin Perry
Bull

-----Original Message-----
From: Rémi Palancher [mailto:***@rezib.org]
Sent: Thursday, August 28, 2014 9:18 AM
To: slurm-dev
Subject: [slurm-dev] Feedback on integration tests systemd/slurm and questions

Hi developers,

You should already know that systemd[1] is the fast growing init alternative that will be the new default on all major GNU/linux distributions including RHEL7, Centos, Fedora, Debian, Ubuntu and so on.
Among other things, systemd has notably the particularity to put all processes into cgroups. This also includes all system services and therefore slurm daemons. Since slurmd is also able to manage cgroups, we (with workmates at EDF) were curious to know how systemd and slurm could work together.

My testing environment is:

- Debian Wheezy 7.6
- Linux kernel 3.2.60
- systemd 204
- slurm 14.11.0-0pre3

systemd
=======

Here are short explanations of how systemd works (at least AFAIU!).

At boot time, systemd mounts the following cgroups FS:

tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)

Inside a tmpfs, it mounts a first cgroup filesystem named 'systemd':

- without any resource controller associated to it
- notify_on_release is set to 1
- the release agent is /lib/systemd/systemd-cgroups-agent

Then, systemd looks for all available resource controller on the running kernel and mount one filesystem for each of these (except with cpu and cpuacct which are mounted together). In all these controller cgroup fs, there is none release_agent.

Systemd actually manages all processes running on the system (user sessions, kernel threads, services and forks) in dedicated cgroups inside the 'systemd' hierarchy. Then eventually, if limits are configured in so-called unit files, it also creates cgroups in the appropriate controller filesystems. For example, if you set a memory usage limit to slurmd service, additionally to the /system/slurmd.service cgroup in systemd fs, it will also create a /system/slurmd.service cgroup in memory fs with appropriate memory limits.

By default for services, it simply creates a cgroup in cpu,cpuacct controllers. For example with slurmd:

# cat /proc/`pgrep slurmd`/cgroup
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/
3:cpuacct,cpu:/system/slurmd.service
2:cpuset:/
1:name=systemd:/system/slurmd.service

When all processes of a cgroup end, systemd is notified with the execution of the release agent /lib/systemd/systemd-cgroups-agent in the 'systemd' fs. This program basically sends a DBUS notification the systemd core daemon with the path of the empty cgroup in parameter. When the core daemon receives this DBUS notification, it looks over its internal data structures for all associated cgroups in all controller fs and delete all of these. This is how all cgroup controllers fs are kept clean when they become empty.

slurm
=====

Well, then the question is: How slurmd and its cgroup plugins could work on top of that?

First, here is an excerpt of cgroups.txt in Linux doc:

"If an active hierarchy with exactly the same set of subsystems already
exists, it will be reused for the new mount. If no existing hierarchy
matches, and any of the requested subsystems are in use in an existing
hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
is activated, associated with the requested subsystems."

Therefore, if we configure slurmd to mount by itself (on other
mountpoints) already existing cpuset, memory and freezer controller fs and set its own release_agent for emptiness notification, it works. The cleanup at the end of jobs is correctly done by Slurm release agent and systemd does not complain.

Here is the corresponding cgroup.conf:

CgroupMountpoint=/cgroup
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"

(in slurm.conf, I enable plugins proctrack/cgroup and task/cgroup but I avoided jobacct_gather/none since it's still flagged as "experimental"
in the doc.)

Then:

# mkdir /cgroup
# mount -t tmpfs tmpfs /cgroup
# slurmd

My only source of sadness with this solution is the number of mounts:

tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)
tmpfs on /cgroup type tmpfs (rw,relatime)
cgroup on /cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
cgroup on /cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
cgroup on /cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)

Therefore, I tried to make slurm use the controller filesystems already mounted by systemd. In cgroup.conf, it looks like this:

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=no
#CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"

With this configuration, slurmd does not set the release_agent in the root directory of the controller filesystem (and commenting out the CgroupReleaseAgentDir parameter does not change anything):

# for controller in perf_event net_cls freezer devices memory cpu,cpuacct cpuset systemd;
do echo "${controller}: $(cat
/sys/fs/cgroup/${controller}/release_agent)"; done
perf_event:
net_cls:
freezer:
devices:
memory:
cpu,cpuacct:
cpuset:
systemd: /lib/systemd/systemd-cgroups-agent

When a job is launched, the cgroups are properly created by slurmd:

# find /sys/fs/cgroup/ -path '*slurm*tasks'
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/tasks
/sys/fs/cgroup/freezer/slurm/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/tasks
/sys/fs/cgroup/memory/slurm/tasks
/sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
/sys/fs/cgroup/cpuset/slurm/tasks
/sys/fs/cgroup/systemd/system/slurmd.service/tasks

(my job 27 has one batch step with `sleep 600`)

# cat /proc/`pgrep sleep`/cgroup
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/slurm/uid_1000/job_27/step_batch
5:devices:/
4:memory:/slurm/uid_1000/job_27/step_batch
3:cpuacct,cpu:/system/slurmd.service
2:cpuset:/slurm/uid_1000/job_27/step_batch
1:name=systemd:/system/slurmd.service

But when I cancel the job some garbage are let in the cgroup
filesystems:

# find /sys/fs/cgroup/ -path '*slurm*tasks'
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/tasks
/sys/fs/cgroup/freezer/slurm/tasks
/sys/fs/cgroup/memory/slurm/tasks
/sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
/sys/fs/cgroup/cpuset/slurm/tasks
/sys/fs/cgroup/systemd/system/slurmd.service/tasks

The systemd release agent has not been called by linux since the cgroups were not present in 'systemd' fs. Slurm release script was not called neither since it was not set as release_agent in controllers filesystems.

But strangely, the memory controller has been totally cleaned and the step_batch in the freezer controller has vanished. Actually, I figured out that there is some cleanup logic which explains that result in:

- _slurm_cgroup_destroy() called by fini() in
src/plugins/proctrack/cgroup/proctrack_cgroup.c
- task_cgroup_memory_fini() in
src/plugins/task/cgroup/task_cgroup_memory.c

But there is none in task_cgroup_cpuset_fini() in src/plugins/task/cgroup/task_cgroup_cpuset.c.

So finally, here come my questions:

- Is the cleanup logic in the plugins supposed to work and just an
unfinished work for all controllers? And the release_agent script is
simply a workaround?
- Is slurmd supposed to rely only on the release_agent for the cleanup?
And therefore all the cleanup logic triggered by fini() in plugins is
irrelevant?
- Or a mix of these that I just don't understand?

I would be glad to have your insightful lights on this matter :) I would also appreciate to get feedback from other people who have done other tests with slurm and systemd!

The funny thing about all of this is that it will become totally irrelevant with the upcoming releases of the linux kernel (3.16+) and the ongoing effort on the cgroup unified hierarchy[3][4]! So if modifications should be done in Slurm on cgroup management, it would be wise to take this into account.

[1] http://www.freedesktop.org/wiki/Software/systemd/
[2] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
[3] http://lwn.net/Articles/601840/
[4]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroups/unified-hierarchy.txt

Thank you for having taken the time to read this!

Regards,

--
Rï¿½mi Palancher
<r

Chris Samuel

2014-08-29 23:16:33 UTC

Permalink

Post by Martin Perry
Thanks for investigating this. It looks like some work will be required to
fully integrate Slurm cgroups with systemd.

I suspect that's going to depend mightily on the version of systemd and the
kernel you are using (for points already covered).

I've mentioned my concerns with the kernel changes coming before on the list
and to the developers but I'm happy to report that the Slurm 2.6 branch head
worked fine [1] on a test CentOS7 machine (which uses systemd) at home using
our work Slurm 2.6 config with no changes other than removing the slurmdbd
config.

The unified hierarchy is still a while away (a "developer preview" was shipped
in 3.16 that can be activated with a special mount option) but I do think the
HPC community needs to get involved in the current discussions to make sure
that our needs are kept in mind.

[1] - very simple echo "hello world" plus confirming that its cgroup was
configured to only allow it to access a single core.

cheers,
Chris