Discussion:
shellshock patch uses a different function export, caused some errors on our Slurm cluster
John Brunelle
2014-09-26 22:29:33 UTC
Permalink
Though I hope everyone is putting the bash shellshock patching in
their rearview mirror, it might still help to be aware of a change to
function exports that the latest version introduced. Instead of the
corresponding environment variable being named "myfunction", it's now
"BASH_FUNC_myfunction()".

This caused a bit of trouble for us when we patched some head nodes
before compute nodes. Since job environments are created on the
submission host, but run on the compute host, the compute hosts didn't
understand/accept the environment variable definition. Along with the
error message, our jobs lost the ability to load software environment
modules (which is implemented with bash functions).

Though not specific to Slurm, I think it's relevant here because of
the sharing of environments across hosts that comes up in this
context. We wrote up a bit more detail here:

https://rc.fas.harvard.edu/shellshock-update-issue-temporarily-affecting-slurm-jobs-software-modules/

Hope that helps someone,

John

John Brunelle
Harvard University FAS Research Computing, Informatics and Scientific
Applications
john_brunelle-***@public.gmane.org @jabrcx
Marcin Stolarek
2014-09-28 09:53:34 UTC
Permalink
Post by John Brunelle
Though I hope everyone is putting the bash shellshock patching in
their rearview mirror, it might still help to be aware of a change to
function exports that the latest version introduced. Instead of the
corresponding environment variable being named "myfunction", it's now
"BASH_FUNC_myfunction()".
This caused a bit of trouble for us when we patched some head nodes
before compute nodes. Since job environments are created on the
submission host, but run on the compute host, the compute hosts didn't
understand/accept the environment variable definition. Along with the
error message, our jobs lost the ability to load software environment
modules (which is implemented with bash functions).
Though not specific to Slurm, I think it's relevant here because of
the sharing of environments across hosts that comes up in this
https://rc.fas.harvard.edu/shellshock-update-issue-temporarily-affecting-slurm-jobs-software-modules/
Hope that helps someone,
John
John Brunelle
Harvard University FAS Research Computing, Informatics and Scientific
Applications
Does I understood you correctly, that it's able to start interactive shell
with:
srun --pty bash in yours configuration and because this is non-login shell
the environment have to be set on submit host?

We forced our users to always use bash in login (-l) mode in this case
environment is set on worker nodes, I believe it's comon

cheers,
marcin
John Brunelle
2014-09-29 16:21:50 UTC
Permalink
Post by Marcin Stolarek
Does I understood you correctly, that it's able to start interactive shell
srun --pty bash in yours configuration and because this is non-login shell
the environment have to be set on submit host?
We forced our users to always use bash in login (-l) mode in this case
environment is set on worker nodes, I believe it's comon
This affects even non-interactive jobs, submitted with sbatch. Environment
modifications done on the submission host in the submitting shell are
re-created in the job's environment on the compute node. It's arguably
better practice to capture that all in the job script, but many people
(where I work at least) do it before submitting the job.

I'm not aware of anything we're setting to force bash login mode one way or
the other. I would guess that might affect how the base environment is
built, but I believe Slurm will still propagate the relevant submission
environment after that in either case.

Best,

John
Christopher Samuel
2014-09-29 05:27:33 UTC
Permalink
Post by John Brunelle
This caused a bit of trouble for us when we patched some head nodes
before compute nodes.
We did some testing to confirm that:

A) If you update a login node before compute nodes jobs will fail as
John describes.

B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).


Our way to (hopefully safely) upgrade our x86-64 clusters was:

0) Note that our slurmctld runs on the cluster management node which is
separate to the login nodes and not accessible to users.

1) Kick all the users off the login nodes, update bash, reboot them
(ours come back with nologin enabled to stop users getting back on
before we're ready).

2) Set all partitions down to stop new jobs starting

3) Move all compute nodes to an "old" partition

4) Move all queued (pending) jobs to the "old" partition

5) Update bash on any idle nodes and move them back to our "main"
(default) partition

6) Set an AllowGroups on the "old" partition so users can't submit jobs
to it by accident.

7) Let users back onto the login nodes.

8) Set partitions back to "up" to start jobs going again.


Hope this helps folks..

cheers!
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
Alan Orth
2014-09-29 09:09:31 UTC
Permalink
Wow, well spotted. I came here to see if anyone had reported this same
issue with environment modules, as I noticed several of my jobs failing
on our cluster this morning. Turns out, I'm probably the only one who
had failed jobs, as I have a long-running tmux session open on the head
node, and therefore old bash. ;)

Other users wouldn't have noticed because we updated all of our
infrastructure in one go using ansible[0] last Friday.

In any case, glad to be in good company. Cheers!

Alan

[0]
http://mjanja.co.ke/2014/09/update-hosts-via-ansible-to-mitigate-bash-shellshock-vulnerability/
Post by Christopher Samuel
Post by John Brunelle
This caused a bit of trouble for us when we patched some head nodes
before compute nodes.
A) If you update a login node before compute nodes jobs will fail as
John describes.
B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).
0) Note that our slurmctld runs on the cluster management node which is
separate to the login nodes and not accessible to users.
1) Kick all the users off the login nodes, update bash, reboot them
(ours come back with nologin enabled to stop users getting back on
before we're ready).
2) Set all partitions down to stop new jobs starting
3) Move all compute nodes to an "old" partition
4) Move all queued (pending) jobs to the "old" partition
5) Update bash on any idle nodes and move them back to our "main"
(default) partition
6) Set an AllowGroups on the "old" partition so users can't submit jobs
to it by accident.
7) Let users back onto the login nodes.
8) Set partitions back to "up" to start jobs going again.
Hope this helps folks..
cheers!
Chris
--
Alan Orth
alan.orth-***@public.gmane.org
http://alaninkenya.org
http://mjanja.co.ke
"I have always wished for my computer to be as easy to use as my telephone; my wish has come true because I can no longer figure out how to use my telephone." -Bjarne Stroustrup, inventor of C++
GPG public key ID: 0x8cb0d0acb5cd81ec209c6cdfbd1a0e09c2f836c0
Marcin Stolarek
2014-09-29 10:17:35 UTC
Permalink
Post by Alan Orth
Wow, well spotted. I came here to see if anyone had reported this same
issue with environment modules, as I noticed several of my jobs failing
on our cluster this morning. Turns out, I'm probably the only one who
had failed jobs, as I have a long-running tmux session open on the head
node, and therefore old bash. ;)
Other users wouldn't have noticed because we updated all of our
infrastructure in one go using ansible[0]
^^^^^^^^^ +1 :)
Post by Alan Orth
last Friday.
In any case, glad to be in good company. Cheers!
Alan
[0]
http://mjanja.co.ke/2014/09/update-hosts-via-ansible-to-mitigate-bash-shellshock-vulnerability/
Post by Christopher Samuel
Post by John Brunelle
This caused a bit of trouble for us when we patched some head nodes
before compute nodes.
A) If you update a login node before compute nodes jobs will fail as
John describes.
B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).
0) Note that our slurmctld runs on the cluster management node which is
separate to the login nodes and not accessible to users.
1) Kick all the users off the login nodes, update bash, reboot them
(ours come back with nologin enabled to stop users getting back on
before we're ready).
2) Set all partitions down to stop new jobs starting
3) Move all compute nodes to an "old" partition
4) Move all queued (pending) jobs to the "old" partition
5) Update bash on any idle nodes and move them back to our "main"
(default) partition
6) Set an AllowGroups on the "old" partition so users can't submit jobs
to it by accident.
7) Let users back onto the login nodes.
8) Set partitions back to "up" to start jobs going again.
Hope this helps folks..
cheers!
Chris
--
Alan Orth
http://alaninkenya.org
http://mjanja.co.ke
"I have always wished for my computer to be as easy to use as my
telephone; my wish has come true because I can no longer figure out how to
use my telephone." -Bjarne Stroustrup, inventor of C++
GPG public key ID: 0x8cb0d0acb5cd81ec209c6cdfbd1a0e09c2f836c0
Chris Samuel
2014-09-29 12:30:29 UTC
Permalink
Post by Alan Orth
Other users wouldn't have noticed because we updated all of our
infrastructure in one go using ansible[0] last Friday.
We use xCAT to manage our clusters and whilst we could have done that if we
had wished it would have caused any jobs queued before the bash upgrade to
fail when they finally got onto a compute node.

I don't think that would have been a popular move. :-)

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
Alan Orth
2014-09-29 13:20:33 UTC
Permalink
True. We're lucky, our queue is very short! Also, to be honest, I was
mainly thinking of my web servers etc when I ran the updates, as the
list of shellshock vectors is quite expansive, and covers bash released
in 1994 - 2014! I didn't realize until afterwards that modules were
implemented as:

module() { blah; }
export -f module

!

Alan
Post by Chris Samuel
Post by Alan Orth
Other users wouldn't have noticed because we updated all of our
infrastructure in one go using ansible[0] last Friday.
We use xCAT to manage our clusters and whilst we could have done that if we
had wished it would have caused any jobs queued before the bash upgrade to
fail when they finally got onto a compute node.
I don't think that would have been a popular move. :-)
All the best,
Chris
--
Alan Orth
alan.orth-***@public.gmane.org
http://alaninkenya.org
http://mjanja.co.ke
"I have always wished for my computer to be as easy to use as my telephone; my wish has come true because I can no longer figure out how to use my telephone." -Bjarne Stroustrup, inventor of C++
GPG public key ID: 0x8cb0d0acb5cd81ec209c6cdfbd1a0e09c2f836c0
John Brunelle
2014-09-29 16:27:30 UTC
Permalink
On Mon, Sep 29, 2014 at 1:27 AM, Christopher Samuel
Post by Christopher Samuel
B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).
Thanks for making this point more clear. I thought this was the case,
but then had trouble reproducing it, chalking it up to BASH_ENV or
some other thing we're doing special in our transition from legacy
environment modules to Lmod. I'm going to take a closer to look at
our pending jobs to see if any will fail because of this.

John
Christopher Samuel
2014-10-02 01:22:32 UTC
Permalink
Post by Christopher Samuel
B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).
Well embarrassingly it turns out that the fact that BASH_ENV didn't work
was a typo on our part. :-(

-echo export BASH_ENV=/etc/profile.d/modules.sh
+echo export BASH_ENV=/etc/profile.d/module.sh

Found when trying to debug why a user who had tcsh as his default shell
couldn't run jobs after the bash upgrade (his job script had #!/bin/sh).

I'll see if I can find a way to retest with an older bash.

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
Loading...