Rémi Palancher
2014-08-28 16:17:47 UTC
Hi developers,
You should already know that systemd[1] is the fast growing init
alternative that will be the new default on all major GNU/linux
distributions including RHEL7, Centos, Fedora, Debian, Ubuntu and so on.
Among other things, systemd has notably the particularity to put all
processes into cgroups. This also includes all system services and
therefore slurm daemons. Since slurmd is also able to manage cgroups, we
(with workmates at EDF) were curious to know how systemd and slurm could
work together.
My testing environment is:
- Debian Wheezy 7.6
- Linux kernel 3.2.60
- systemd 204
- slurm 14.11.0-0pre3
systemd
=======
Here are short explanations of how systemd works (at least AFAIU!).
At boot time, systemd mounts the following cgroups FS:
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)
Inside a tmpfs, it mounts a first cgroup filesystem named 'systemd':
- without any resource controller associated to it
- notify_on_release is set to 1
- the release agent is /lib/systemd/systemd-cgroups-agent
Then, systemd looks for all available resource controller on the running
kernel and mount one filesystem for each of these (except with cpu and
cpuacct which are mounted together). In all these controller cgroup fs,
there is none release_agent.
Systemd actually manages all processes running on the system (user
sessions, kernel threads, services and forks) in dedicated cgroups
inside the 'systemd' hierarchy. Then eventually, if limits are
configured in so-called unit files, it also creates cgroups in the
appropriate controller filesystems. For example, if you set a memory
usage limit to slurmd service, additionally to the
/system/slurmd.service cgroup in systemd fs, it will also create a
/system/slurmd.service cgroup in memory fs with appropriate memory
limits.
By default for services, it simply creates a cgroup in cpu,cpuacct
controllers. For example with slurmd:
# cat /proc/`pgrep slurmd`/cgroup
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/
3:cpuacct,cpu:/system/slurmd.service
2:cpuset:/
1:name=systemd:/system/slurmd.service
When all processes of a cgroup end, systemd is notified with the
execution of the release agent /lib/systemd/systemd-cgroups-agent in the
'systemd' fs. This program basically sends a DBUS notification the
systemd core daemon with the path of the empty cgroup in parameter. When
the core daemon receives this DBUS notification, it looks over its
internal data structures for all associated cgroups in all controller fs
and delete all of these. This is how all cgroup controllers fs are kept
clean when they become empty.
slurm
=====
Well, then the question is: How slurmd and its cgroup plugins could work
on top of that?
First, here is an excerpt of cgroups.txt in Linux doc:
"If an active hierarchy with exactly the same set of subsystems already
exists, it will be reused for the new mount. If no existing hierarchy
matches, and any of the requested subsystems are in use in an existing
hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
is activated, associated with the requested subsystems."
Therefore, if we configure slurmd to mount by itself (on other
mountpoints) already existing cpuset, memory and freezer controller fs
and set its own release_agent for emptiness notification, it works. The
cleanup at the end of jobs is correctly done by Slurm release agent and
systemd does not complain.
Here is the corresponding cgroup.conf:
CgroupMountpoint=/cgroup
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
(in slurm.conf, I enable plugins proctrack/cgroup and task/cgroup but I
avoided jobacct_gather/none since it's still flagged as "experimental"
in the doc.)
Then:
# mkdir /cgroup
# mount -t tmpfs tmpfs /cgroup
# slurmd
My only source of sadness with this solution is the number of mounts:
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)
tmpfs on /cgroup type tmpfs (rw,relatime)
cgroup on /cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
cgroup on /cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
cgroup on /cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)
Therefore, I tried to make slurm use the controller filesystems already
mounted by systemd. In cgroup.conf, it looks like this:
CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=no
#CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
With this configuration, slurmd does not set the release_agent in the
root directory of the controller filesystem (and commenting out the
CgroupReleaseAgentDir parameter does not change anything):
# for controller in perf_event net_cls freezer devices memory
cpu,cpuacct cpuset systemd;
do echo "${controller}: $(cat
/sys/fs/cgroup/${controller}/release_agent)"; done
perf_event:
net_cls:
freezer:
devices:
memory:
cpu,cpuacct:
cpuset:
systemd: /lib/systemd/systemd-cgroups-agent
When a job is launched, the cgroups are properly created by slurmd:
# find /sys/fs/cgroup/ -path '*slurm*tasks'
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/tasks
/sys/fs/cgroup/freezer/slurm/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/tasks
/sys/fs/cgroup/memory/slurm/tasks
/sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
/sys/fs/cgroup/cpuset/slurm/tasks
/sys/fs/cgroup/systemd/system/slurmd.service/tasks
(my job 27 has one batch step with `sleep 600`)
# cat /proc/`pgrep sleep`/cgroup
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/slurm/uid_1000/job_27/step_batch
5:devices:/
4:memory:/slurm/uid_1000/job_27/step_batch
3:cpuacct,cpu:/system/slurmd.service
2:cpuset:/slurm/uid_1000/job_27/step_batch
1:name=systemd:/system/slurmd.service
But when I cancel the job some garbage are let in the cgroup
filesystems:
# find /sys/fs/cgroup/ -path '*slurm*tasks'
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/tasks
/sys/fs/cgroup/freezer/slurm/tasks
/sys/fs/cgroup/memory/slurm/tasks
/sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
/sys/fs/cgroup/cpuset/slurm/tasks
/sys/fs/cgroup/systemd/system/slurmd.service/tasks
The systemd release agent has not been called by linux since the cgroups
were not present in 'systemd' fs. Slurm release script was not called
neither since it was not set as release_agent in controllers
filesystems.
But strangely, the memory controller has been totally cleaned and the
step_batch in the freezer controller has vanished. Actually, I figured
out that there is some cleanup logic which explains that result in:
- _slurm_cgroup_destroy() called by fini() in
src/plugins/proctrack/cgroup/proctrack_cgroup.c
- task_cgroup_memory_fini() in
src/plugins/task/cgroup/task_cgroup_memory.c
But there is none in task_cgroup_cpuset_fini() in
src/plugins/task/cgroup/task_cgroup_cpuset.c.
So finally, here come my questions:
- Is the cleanup logic in the plugins supposed to work and just an
unfinished work for all controllers? And the release_agent script is
simply a workaround?
- Is slurmd supposed to rely only on the release_agent for the cleanup?
And therefore all the cleanup logic triggered by fini() in plugins is
irrelevant?
- Or a mix of these that I just don't understand?
I would be glad to have your insightful lights on this matter :) I would
also appreciate to get feedback from other people who have done other
tests with slurm and systemd!
The funny thing about all of this is that it will become totally
irrelevant with the upcoming releases of the linux kernel (3.16+) and
the ongoing effort on the cgroup unified hierarchy[3][4]! So if
modifications should be done in Slurm on cgroup management, it would be
wise to take this into account.
[1] http://www.freedesktop.org/wiki/Software/systemd/
[2] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
[3] http://lwn.net/Articles/601840/
[4]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroups/unified-hierarchy.txt
Thank you for having taken the time to read this!
Regards,
--
R�mi Palancher
<remi-externe.palancher-***@public.gmane.org>
You should already know that systemd[1] is the fast growing init
alternative that will be the new default on all major GNU/linux
distributions including RHEL7, Centos, Fedora, Debian, Ubuntu and so on.
Among other things, systemd has notably the particularity to put all
processes into cgroups. This also includes all system services and
therefore slurm daemons. Since slurmd is also able to manage cgroups, we
(with workmates at EDF) were curious to know how systemd and slurm could
work together.
My testing environment is:
- Debian Wheezy 7.6
- Linux kernel 3.2.60
- systemd 204
- slurm 14.11.0-0pre3
systemd
=======
Here are short explanations of how systemd works (at least AFAIU!).
At boot time, systemd mounts the following cgroups FS:
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)
Inside a tmpfs, it mounts a first cgroup filesystem named 'systemd':
- without any resource controller associated to it
- notify_on_release is set to 1
- the release agent is /lib/systemd/systemd-cgroups-agent
Then, systemd looks for all available resource controller on the running
kernel and mount one filesystem for each of these (except with cpu and
cpuacct which are mounted together). In all these controller cgroup fs,
there is none release_agent.
Systemd actually manages all processes running on the system (user
sessions, kernel threads, services and forks) in dedicated cgroups
inside the 'systemd' hierarchy. Then eventually, if limits are
configured in so-called unit files, it also creates cgroups in the
appropriate controller filesystems. For example, if you set a memory
usage limit to slurmd service, additionally to the
/system/slurmd.service cgroup in systemd fs, it will also create a
/system/slurmd.service cgroup in memory fs with appropriate memory
limits.
By default for services, it simply creates a cgroup in cpu,cpuacct
controllers. For example with slurmd:
# cat /proc/`pgrep slurmd`/cgroup
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/
3:cpuacct,cpu:/system/slurmd.service
2:cpuset:/
1:name=systemd:/system/slurmd.service
When all processes of a cgroup end, systemd is notified with the
execution of the release agent /lib/systemd/systemd-cgroups-agent in the
'systemd' fs. This program basically sends a DBUS notification the
systemd core daemon with the path of the empty cgroup in parameter. When
the core daemon receives this DBUS notification, it looks over its
internal data structures for all associated cgroups in all controller fs
and delete all of these. This is how all cgroup controllers fs are kept
clean when they become empty.
slurm
=====
Well, then the question is: How slurmd and its cgroup plugins could work
on top of that?
First, here is an excerpt of cgroups.txt in Linux doc:
"If an active hierarchy with exactly the same set of subsystems already
exists, it will be reused for the new mount. If no existing hierarchy
matches, and any of the requested subsystems are in use in an existing
hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
is activated, associated with the requested subsystems."
Therefore, if we configure slurmd to mount by itself (on other
mountpoints) already existing cpuset, memory and freezer controller fs
and set its own release_agent for emptiness notification, it works. The
cleanup at the end of jobs is correctly done by Slurm release agent and
systemd does not complain.
Here is the corresponding cgroup.conf:
CgroupMountpoint=/cgroup
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
(in slurm.conf, I enable plugins proctrack/cgroup and task/cgroup but I
avoided jobacct_gather/none since it's still flagged as "experimental"
in the doc.)
Then:
# mkdir /cgroup
# mount -t tmpfs tmpfs /cgroup
# slurmd
My only source of sadness with this solution is the number of mounts:
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)
tmpfs on /cgroup type tmpfs (rw,relatime)
cgroup on /cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
cgroup on /cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
cgroup on /cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)
Therefore, I tried to make slurm use the controller filesystems already
mounted by systemd. In cgroup.conf, it looks like this:
CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=no
#CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
With this configuration, slurmd does not set the release_agent in the
root directory of the controller filesystem (and commenting out the
CgroupReleaseAgentDir parameter does not change anything):
# for controller in perf_event net_cls freezer devices memory
cpu,cpuacct cpuset systemd;
do echo "${controller}: $(cat
/sys/fs/cgroup/${controller}/release_agent)"; done
perf_event:
net_cls:
freezer:
devices:
memory:
cpu,cpuacct:
cpuset:
systemd: /lib/systemd/systemd-cgroups-agent
When a job is launched, the cgroups are properly created by slurmd:
# find /sys/fs/cgroup/ -path '*slurm*tasks'
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/tasks
/sys/fs/cgroup/freezer/slurm/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/memory/slurm/uid_1000/tasks
/sys/fs/cgroup/memory/slurm/tasks
/sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
/sys/fs/cgroup/cpuset/slurm/tasks
/sys/fs/cgroup/systemd/system/slurmd.service/tasks
(my job 27 has one batch step with `sleep 600`)
# cat /proc/`pgrep sleep`/cgroup
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/slurm/uid_1000/job_27/step_batch
5:devices:/
4:memory:/slurm/uid_1000/job_27/step_batch
3:cpuacct,cpu:/system/slurmd.service
2:cpuset:/slurm/uid_1000/job_27/step_batch
1:name=systemd:/system/slurmd.service
But when I cancel the job some garbage are let in the cgroup
filesystems:
# find /sys/fs/cgroup/ -path '*slurm*tasks'
/sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/freezer/slurm/uid_1000/tasks
/sys/fs/cgroup/freezer/slurm/tasks
/sys/fs/cgroup/memory/slurm/tasks
/sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
/sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
/sys/fs/cgroup/cpuset/slurm/tasks
/sys/fs/cgroup/systemd/system/slurmd.service/tasks
The systemd release agent has not been called by linux since the cgroups
were not present in 'systemd' fs. Slurm release script was not called
neither since it was not set as release_agent in controllers
filesystems.
But strangely, the memory controller has been totally cleaned and the
step_batch in the freezer controller has vanished. Actually, I figured
out that there is some cleanup logic which explains that result in:
- _slurm_cgroup_destroy() called by fini() in
src/plugins/proctrack/cgroup/proctrack_cgroup.c
- task_cgroup_memory_fini() in
src/plugins/task/cgroup/task_cgroup_memory.c
But there is none in task_cgroup_cpuset_fini() in
src/plugins/task/cgroup/task_cgroup_cpuset.c.
So finally, here come my questions:
- Is the cleanup logic in the plugins supposed to work and just an
unfinished work for all controllers? And the release_agent script is
simply a workaround?
- Is slurmd supposed to rely only on the release_agent for the cleanup?
And therefore all the cleanup logic triggered by fini() in plugins is
irrelevant?
- Or a mix of these that I just don't understand?
I would be glad to have your insightful lights on this matter :) I would
also appreciate to get feedback from other people who have done other
tests with slurm and systemd!
The funny thing about all of this is that it will become totally
irrelevant with the upcoming releases of the linux kernel (3.16+) and
the ongoing effort on the cgroup unified hierarchy[3][4]! So if
modifications should be done in Slurm on cgroup management, it would be
wise to take this into account.
[1] http://www.freedesktop.org/wiki/Software/systemd/
[2] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
[3] http://lwn.net/Articles/601840/
[4]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroups/unified-hierarchy.txt
Thank you for having taken the time to read this!
Regards,
--
R�mi Palancher
<remi-externe.palancher-***@public.gmane.org>