Discussion:
starting OpenMPI job directly with srun
Lev Givon
2014-09-23 18:49:31 UTC
Permalink
I have OpenMPI 1.8.2 compiled with PMI support enabled and slurm 2.6.5 installed on an
8-CPU machine running Ubuntu 14.04.1. I noticed that attempting to any
program compiled against said OpenMPI installation via srun using

srun -n X mpiexec program

with X > 1 effectively is equivalent to running

mpiexec -np X program

X times. Is this behavior expected? Running the program via sbatch only causes 1
run over X MPI processes.
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/
Andy Riebs
2014-09-23 18:57:33 UTC
Permalink
Lev, if you drop "mpiexec" from your command line, you should see the
desired behaviour, i.e.,

$ srun -n X program

(Also, be sure to recognize the difference between "-n" and "-N"!)

Andy
Post by Lev Givon
I have OpenMPI 1.8.2 compiled with PMI support enabled and slurm 2.6.5 installed on an
8-CPU machine running Ubuntu 14.04.1. I noticed that attempting to any
program compiled against said OpenMPI installation via srun using
srun -n X mpiexec program
with X > 1 effectively is equivalent to running
mpiexec -np X program
X times. Is this behavior expected? Running the program via sbatch only causes 1
run over X MPI processes.
Lev Givon
2014-09-23 19:07:34 UTC
Permalink
Post by Andy Riebs
Post by Lev Givon
I have OpenMPI 1.8.2 compiled with PMI support enabled and slurm 2.6.5 installed on an
8-CPU machine running Ubuntu 14.04.1. I noticed that attempting to run any
program compiled against said OpenMPI installation via srun using
srun -n X mpiexec program
with X > 1 effectively is equivalent to running
mpiexec -np X program
X times. Is this behavior expected? Running the program via sbatch only causes 1
run over X MPI processes.
Lev, if you drop "mpiexec" from your command line, you should see
the desired behaviour, i.e.,
$ srun -n X program
Doing so does launch the program only X times, but the communicator size seen by each
instance is 1, e.g., for the proverbial "Hello world" program, the output

Hello, world, I am 0 of 1 (myhost)

is generated X times.

Incidentally, I verified that OpenMPI was build against PMI successfully:

$ ldd /opt/openmpi-1.8.2/bin/mpiexec | grep pmi
libpmi.so.0 => /usr/lib/libpmi.so.0 (0x00002aed18f66000)
Post by Andy Riebs
(Also, be sure to recognize the difference between "-n" and "-N"!)
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/
Andy Riebs
2014-09-23 19:57:31 UTC
Permalink
Ahhh... try adding "--mpi=pmi" or "--mpi=pmi2" to your srun command.

Andy

p.s. If this fixes it, you might want to set the mpi default in
slurm.conf appropriately.
Post by Lev Givon
Post by Andy Riebs
Post by Lev Givon
I have OpenMPI 1.8.2 compiled with PMI support enabled and slurm 2.6.5 installed on an
8-CPU machine running Ubuntu 14.04.1. I noticed that attempting to run any
program compiled against said OpenMPI installation via srun using
srun -n X mpiexec program
with X > 1 effectively is equivalent to running
mpiexec -np X program
X times. Is this behavior expected? Running the program via sbatch only causes 1
run over X MPI processes.
Lev, if you drop "mpiexec" from your command line, you should see
the desired behaviour, i.e.,
$ srun -n X program
Doing so does launch the program only X times, but the communicator size seen by each
instance is 1, e.g., for the proverbial "Hello world" program, the output
Hello, world, I am 0 of 1 (myhost)
is generated X times.
$ ldd /opt/openmpi-1.8.2/bin/mpiexec | grep pmi
libpmi.so.0 => /usr/lib/libpmi.so.0 (0x00002aed18f66000)
Post by Andy Riebs
(Also, be sure to recognize the difference between "-n" and "-N"!)
Lev Givon
2014-09-23 20:45:30 UTC
Permalink
(snip)
Post by Andy Riebs
Post by Lev Givon
Post by Andy Riebs
Lev, if you drop "mpiexec" from your command line, you should see
the desired behaviour, i.e.,
$ srun -n X program
Doing so does launch the program only X times, but the communicator size seen by each
instance is 1, e.g., for the proverbial "Hello world" program, the output
Hello, world, I am 0 of 1 (myhost)
is generated X times.
$ ldd /opt/openmpi-1.8.2/bin/mpiexec | grep pmi
libpmi.so.0 => /usr/lib/libpmi.so.0 (0x00002aed18f66000)
Ahhh... try adding "--mpi=pmi" or "--mpi=pmi2" to your srun command.
Andy
That did the trick - thanks!
Post by Andy Riebs
p.s. If this fixes it, you might want to set the mpi default in
slurm.conf appropriately.
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/
Kiran Thyagaraja
2014-09-23 19:29:34 UTC
Permalink
You should be doing:

srun -n X program

instead of srun -n X mpiexec ...

Thanks,
Kiran
Post by Lev Givon
I have OpenMPI 1.8.2 compiled with PMI support enabled and slurm 2.6.5 installed on an
8-CPU machine running Ubuntu 14.04.1. I noticed that attempting to any
program compiled against said OpenMPI installation via srun using
srun -n X mpiexec program
with X > 1 effectively is equivalent to running
mpiexec -np X program
X times. Is this behavior expected? Running the program via sbatch only causes 1
run over X MPI processes.
Loading...