Discussion:
Running batch jobs to handle change management via Puppet
Trey Dockendorf
2014-10-16 04:48:30 UTC
Permalink
I have begun using batch jobs to apply Puppet changes across nodes and
have noticed that some changes made by Puppet are not working. For
example Puppet ran and restarted Zabbix. I noticed after the restart
that all nodes had zabbix not running. As a debug measure I added
"service zabbix-agent status" after the Puppet apply command and it is
running after the Puppet changes and before the job finishes. This
leads me to believe that SLURM is somehow killing any processes
started during the job. That is simply a hunch at this point and
unsure how to dig deeper.

The batch script is this:

#!/bin/bash
#SBATCH -p admin
#SBATCH --qos=admin
#SBATCH -N 1
#SBATCH --exclusive
#SBATCH --time=00:15:00
#SBATCH --begin=now+20
#SBATCH -o /home/admin/puppet_batch/slurm-%j.out
#SBATCH --workdir=/home/admin/puppet_batch

/usr/local/sbin/papply

/etc/init.d/zabbix-agent status

if [ $? -ne 0 ]; then
/etc/init.d/zabbix-agent start
fi

exit 0

The papply script executes:
https://gist.github.com/treydock/2bf5e08caf4284c19c14

The output I got from that batch job above was:

< puppet output about changes, and restarting Service[zabbix-agent]>
Notice: Finished catalog run in 30.24 seconds
zabbix_agentd (pid 16347) is running...
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path
/cgroup/freezer/slurm/uid_0/job_27858/step_batch: Device or resource
busy

I then log into the node and 'service zabbix-agent status' returns
"zabbix_agentd dead but pid file exists". I am using cgroups for
TaskPlugin and ProctrackType, and am unsure if the cgroups or
something internal to slurm would kill a service started during the
job.

Any experience at other sites in doing administrative tasks (restart
service, run config changes, etc) as jobs? My goal with this approach
was to make changes that were quick and would not interfere with
running jobs (hence not using clush or other similar tools) and by
making the administrative tasks a job that's exclusive, it would
essentially block users from the node while the task ran.

Thanks,
- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org
Christopher Samuel
2014-10-16 04:59:39 UTC
Permalink
Post by Trey Dockendorf
I have begun using batch jobs to apply Puppet changes across nodes and
have noticed that some changes made by Puppet are not working. For
example Puppet ran and restarted Zabbix. I noticed after the restart
that all nodes had zabbix not running.
[...]
Post by Trey Dockendorf
I am using cgroups for TaskPlugin and ProctrackType, and am unsure
if the cgroups or something internal to slurm would kill a service
started during the job.
Slurm is working as designed, generally a job should not leave processes
lying around and so Slurm will kill anything you start up there and
cgroups are just the way to track what was started (as they cannot
escape unless they are running as root and take actions to do so).

You'll probably want to try and unconfigure cgroups for your use case
and try and perhaps will need to run the init script with 'nohup'.

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
Trey Dockendorf
2014-10-16 05:16:38 UTC
Permalink
Chris,

Thanks for the response. I figured cgroups was the reason the service
was being killed if started during a job, but it's good to have
confirmation. I think this is just a use case I have to abandon as
cgroups are too useful in our environment to unconfigure them for
sysadmin specific jobs.

Thanks,
- Trey
=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org


On Thu, Oct 16, 2014 at 12:00 AM, Christopher Samuel
Post by Christopher Samuel
Post by Trey Dockendorf
I have begun using batch jobs to apply Puppet changes across nodes and
have noticed that some changes made by Puppet are not working. For
example Puppet ran and restarted Zabbix. I noticed after the restart
that all nodes had zabbix not running.
[...]
Post by Trey Dockendorf
I am using cgroups for TaskPlugin and ProctrackType, and am unsure
if the cgroups or something internal to slurm would kill a service
started during the job.
Slurm is working as designed, generally a job should not leave processes
lying around and so Slurm will kill anything you start up there and
cgroups are just the way to track what was started (as they cannot
escape unless they are running as root and take actions to do so).
You'll probably want to try and unconfigure cgroups for your use case
and try and perhaps will need to run the init script with 'nohup'.
cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
http://www.vlsci.org.au/ http://twitter.com/vlsci
Loading...