shellshock patch uses a different function export, caused some errors on our Slurm cluster

Discussion:

John Brunelle

2014-09-26 22:29:33 UTC

Though I hope everyone is putting the bash shellshock patching in
their rearview mirror, it might still help to be aware of a change to
function exports that the latest version introduced. Instead of the
corresponding environment variable being named "myfunction", it's now
"BASH_FUNC_myfunction()".

This caused a bit of trouble for us when we patched some head nodes
before compute nodes. Since job environments are created on the
submission host, but run on the compute host, the compute hosts didn't
understand/accept the environment variable definition. Along with the
error message, our jobs lost the ability to load software environment
modules (which is implemented with bash functions).

Though not specific to Slurm, I think it's relevant here because of
the sharing of environments across hosts that comes up in this
context. We wrote up a bit more detail here:

https://rc.fas.harvard.edu/shellshock-update-issue-temporarily-affecting-slurm-jobs-software-modules/

Hope that helps someone,

John

John Brunelle
Harvard University FAS Research Computing, Informatics and Scientific
Applications
john_brunelle-***@public.gmane.org @jabrcx

Marcin Stolarek

2014-09-28 09:53:34 UTC

Permalink

Post by John Brunelle
Though I hope everyone is putting the bash shellshock patching in
their rearview mirror, it might still help to be aware of a change to
function exports that the latest version introduced. Instead of the
corresponding environment variable being named "myfunction", it's now
"BASH_FUNC_myfunction()".
This caused a bit of trouble for us when we patched some head nodes
before compute nodes. Since job environments are created on the
submission host, but run on the compute host, the compute hosts didn't
understand/accept the environment variable definition. Along with the
error message, our jobs lost the ability to load software environment
modules (which is implemented with bash functions).
Though not specific to Slurm, I think it's relevant here because of
the sharing of environments across hosts that comes up in this
https://rc.fas.harvard.edu/shellshock-update-issue-temporarily-affecting-slurm-jobs-software-modules/
Hope that helps someone,
John
John Brunelle
Harvard University FAS Research Computing, Informatics and Scientific
Applications

Does I understood you correctly, that it's able to start interactive shell
with:
srun --pty bash in yours configuration and because this is non-login shell
the environment have to be set on submit host?

We forced our users to always use bash in login (-l) mode in this case
environment is set on worker nodes, I believe it's comon

cheers,
marcin

John Brunelle

2014-09-29 16:21:50 UTC

Permalink

Post by Marcin Stolarek
Does I understood you correctly, that it's able to start interactive shell
srun --pty bash in yours configuration and because this is non-login shell
the environment have to be set on submit host?
We forced our users to always use bash in login (-l) mode in this case
environment is set on worker nodes, I believe it's comon

This affects even non-interactive jobs, submitted with sbatch. Environment
modifications done on the submission host in the submitting shell are
re-created in the job's environment on the compute node. It's arguably
better practice to capture that all in the job script, but many people
(where I work at least) do it before submitting the job.

I'm not aware of anything we're setting to force bash login mode one way or
the other. I would guess that might affect how the base environment is
built, but I believe Slurm will still propagate the relevant submission
environment after that in either case.

Best,

John

Christopher Samuel

2014-09-29 05:27:33 UTC

Permalink

Post by John Brunelle
This caused a bit of trouble for us when we patched some head nodes
before compute nodes.

We did some testing to confirm that:

A) If you update a login node before compute nodes jobs will fail as
John describes.

B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).

Our way to (hopefully safely) upgrade our x86-64 clusters was:

0) Note that our slurmctld runs on the cluster management node which is
separate to the login nodes and not accessible to users.

1) Kick all the users off the login nodes, update bash, reboot them
(ours come back with nologin enabled to stop users getting back on
before we're ready).

2) Set all partitions down to stop new jobs starting

3) Move all compute nodes to an "old" partition

4) Move all queued (pending) jobs to the "old" partition

5) Update bash on any idle nodes and move them back to our "main"
(default) partition

6) Set an AllowGroups on the "old" partition so users can't submit jobs
to it by accident.

7) Let users back onto the login nodes.

8) Set partitions back to "up" to start jobs going again.

Hope this helps folks..

cheers!
Chris

--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Alan Orth

2014-09-29 09:09:31 UTC

Permalink

Wow, well spotted. I came here to see if anyone had reported this same
issue with environment modules, as I noticed several of my jobs failing
on our cluster this morning. Turns out, I'm probably the only one who
had failed jobs, as I have a long-running tmux session open on the head
node, and therefore old bash. ;)

Other users wouldn't have noticed because we updated all of our
infrastructure in one go using ansible[0] last Friday.

In any case, glad to be in good company. Cheers!

Alan

[0]
http://mjanja.co.ke/2014/09/update-hosts-via-ansible-to-mitigate-bash-shellshock-vulnerability/

Post by Christopher Samuel

Post by John Brunelle
This caused a bit of trouble for us when we patched some head nodes
before compute nodes.

A) If you update a login node before compute nodes jobs will fail as
John describes.
B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).
0) Note that our slurmctld runs on the cluster management node which is
separate to the login nodes and not accessible to users.
1) Kick all the users off the login nodes, update bash, reboot them
(ours come back with nologin enabled to stop users getting back on
before we're ready).
2) Set all partitions down to stop new jobs starting
3) Move all compute nodes to an "old" partition
4) Move all queued (pending) jobs to the "old" partition
5) Update bash on any idle nodes and move them back to our "main"
(default) partition
6) Set an AllowGroups on the "old" partition so users can't submit jobs
to it by accident.
7) Let users back onto the login nodes.
8) Set partitions back to "up" to start jobs going again.
Hope this helps folks..
cheers!
Chris

--
Alan Orth
alan.orth-***@public.gmane.org
http://alaninkenya.org
http://mjanja.co.ke
"I have always wished for my computer to be as easy to use as my telephone; my wish has come true because I can no longer figure out how to use my telephone." -Bjarne Stroustrup, inventor of C++
GPG public key ID: 0x8cb0d0acb5cd81ec209c6cdfbd1a0e09c2f836c0

Marcin Stolarek

2014-09-29 10:17:35 UTC

Permalink

Post by Alan Orth
Wow, well spotted. I came here to see if anyone had reported this same
issue with environment modules, as I noticed several of my jobs failing
on our cluster this morning. Turns out, I'm probably the only one who
had failed jobs, as I have a long-running tmux session open on the head
node, and therefore old bash. ;)
Other users wouldn't have noticed because we updated all of our
infrastructure in one go using ansible[0]

^^^^^^^^^ +1 :)

Post by Alan Orth
last Friday.
In any case, glad to be in good company. Cheers!
Alan
[0]
http://mjanja.co.ke/2014/09/update-hosts-via-ansible-to-mitigate-bash-shellshock-vulnerability/

Post by Christopher Samuel

Post by John Brunelle
This caused a bit of trouble for us when we patched some head nodes
before compute nodes.

--
Alan Orth
http://alaninkenya.org
http://mjanja.co.ke
"I have always wished for my computer to be as easy to use as my
telephone; my wish has come true because I can no longer figure out how to
use my telephone." -Bjarne Stroustrup, inventor of C++
GPG public key ID: 0x8cb0d0acb5cd81ec209c6cdfbd1a0e09c2f836c0

Chris Samuel

2014-09-29 12:30:29 UTC

Permalink

Post by Alan Orth
Other users wouldn't have noticed because we updated all of our
infrastructure in one go using ansible[0] last Friday.

We use xCAT to manage our clusters and whilst we could have done that if we
had wished it would have caused any jobs queued before the bash upgrade to
fail when they finally got onto a compute node.

I don't think that would have been a popular move. :-)

All the best,
Chris

Alan Orth

2014-09-29 13:20:33 UTC

Permalink

True. We're lucky, our queue is very short! Also, to be honest, I was
mainly thinking of my web servers etc when I ran the updates, as the
list of shellshock vectors is quite expansive, and covers bash released
in 1994 - 2014! I didn't realize until afterwards that modules were
implemented as:

module() { blah; }
export -f module

!

Alan

Post by Chris Samuel

Post by Alan Orth
Other users wouldn't have noticed because we updated all of our
infrastructure in one go using ansible[0] last Friday.

We use xCAT to manage our clusters and whilst we could have done that if we
had wished it would have caused any jobs queued before the bash upgrade to
fail when they finally got onto a compute node.
I don't think that would have been a popular move. :-)
All the best,
Chris

John Brunelle

2014-09-29 16:27:30 UTC

Permalink

On Mon, Sep 29, 2014 at 1:27 AM, Christopher Samuel

Post by Christopher Samuel
B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).

Thanks for making this point more clear. I thought this was the case,
but then had trouble reproducing it, chalking it up to BASH_ENV or
some other thing we're doing special in our transition from legacy
environment modules to Lmod. I'm going to take a closer to look at
our pending jobs to see if any will fail because of this.

John

Christopher Samuel

2014-10-02 01:22:32 UTC

Permalink

Well embarrassingly it turns out that the fact that BASH_ENV didn't work
was a typo on our part. :-(

-echo export BASH_ENV=/etc/profile.d/modules.sh
+echo export BASH_ENV=/etc/profile.d/module.sh

Found when trying to debug why a user who had tcsh as his default shell
couldn't run jobs after the bash upgrade (his job script had #!/bin/sh).

I'll see if I can find a way to retest with an older bash.

cheers,
Chris