Discussion:
srun: error:: slurm_send_recv_rc_msg_only_one to :: Connection refused
Fabricio Kyt
2014-07-09 15:49:46 UTC
Permalink
Hi All - I've been running a parallel application using OpenMPI and SLURM
and getting the following error messages. The same application runs fine in
another cluster with Torque so I'm suspecting I'm missing some kind of
SLURM configuration setting.

Ps. Simple commands like srun -N 128 hostname work fine.

Any help is greatly appreciated. Thanks!

=======================================================

srun: auth plugin for Munge (http://code.google.com/p/munge/) loaded

srun: jobid 8232: nodes(16):`n[0242-0257]', cpu counts: 16(x16)

srun: launching 8232.0 on host n0242, 1 tasks: 0

srun: launching 8232.0 on host n0243, 1 tasks: 1

srun: launching 8232.0 on host n0244, 1 tasks: 2

srun: launching 8232.0 on host n0245, 1 tasks: 3

srun: launching 8232.0 on host n0246, 1 tasks: 4

srun: launching 8232.0 on host n0247, 1 tasks: 5

srun: launching 8232.0 on host n0248, 1 tasks: 6

srun: launching 8232.0 on host n0249, 1 tasks: 7

srun: launching 8232.0 on host n0250, 1 tasks: 8

srun: launching 8232.0 on host n0251, 1 tasks: 9

srun: launching 8232.0 on host n0252, 1 tasks: 10

srun: launching 8232.0 on host n0253, 1 tasks: 11

srun: launching 8232.0 on host n0254, 1 tasks: 12

srun: launching 8232.0 on host n0255, 1 tasks: 13

srun: launching 8232.0 on host n0256, 1 tasks: 14

srun: launching 8232.0 on host n0257, 1 tasks: 15

srun: Node n0244, 1 tasks started

srun: Node n0243, 1 tasks started

srun: Node n0245, 1 tasks started

srun: Node n0242, 1 tasks started

srun: Node n0247, 1 tasks started

srun: Node n0246, 1 tasks started

srun: Node n0249, 1 tasks started

srun: Node n0248, 1 tasks started

srun: Node n0251, 1 tasks started

srun: Node n0250, 1 tasks started

srun: Node n0254, 1 tasks started

srun: Node n0252, 1 tasks started

srun: Node n0256, 1 tasks started

srun: Node n0253, 1 tasks started

srun: Node n0257, 1 tasks started

srun: Node n0255, 1 tasks started

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0244:26366] [[8232,1],2][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_modex failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0244:26366] *** An error occurred in MPI_Init_thread

[n0244:26366] *** on a NULL communicator

[n0244:26366] *** Unknown error

[n0244:26366] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0244

PID: 26366

--------------------------------------------------------------------------

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0243:22413] [[8232,1],1][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_modex failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0243:22413] *** An error occurred in MPI_Init_thread

[n0243:22413] *** on a NULL communicator

[n0243:22413] *** Unknown error

[n0243:22413] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0243

PID: 22413

--------------------------------------------------------------------------

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0245:17194] [[8232,1],3][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0242:32029] [[8232,1],0][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_modex failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0245:17194] *** An error occurred in MPI_Init_thread

[n0245:17194] *** on a NULL communicator

[n0245:17194] *** Unknown error

[n0245:17194] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_modex failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0245

PID: 17194

--------------------------------------------------------------------------

[n0242:32029] *** An error occurred in MPI_Init_thread

[n0242:32029] *** on a NULL communicator

[n0242:32029] *** Unknown error

[n0242:32029] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0242

PID: 32029

--------------------------------------------------------------------------

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0247:08603] [[8232,1],5][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_modex failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0247:8603] *** An error occurred in MPI_Init_thread

[n0247:8603] *** on a NULL communicator

[n0247:8603] *** Unknown error

[n0247:8603] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0247

PID: 8603

--------------------------------------------------------------------------

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0249:01391] [[8232,1],7][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_modex failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0249:1391] *** An error occurred in MPI_Init_thread

[n0249:1391] *** on a NULL communicator

[n0249:1391] *** Unknown error

[n0249:1391] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

nmf: error: slurm_accept_msg_conn: Interrupted system call

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0249

PID: 1391

--------------------------------------------------------------------------

[n0246:19634] [[8232,1],4][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_modex failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0246:19634] *** An error occurred in MPI_Init_thread

[n0246:19634] *** on a NULL communicator

[n0246:19634] *** Unknown error

[n0246:19634] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0246

PID: 19634

--------------------------------------------------------------------------

srun: Received task exit notification for 1 task (status=0x0100).

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0251:20401] [[8232,1],9][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed

nmf: error: slurm_accept_msg_conn: Interrupted system call

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_modex failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0251:20401] *** An error occurred in MPI_Init_thread

[n0251:20401] *** on a NULL communicator

[n0251:20401] *** Unknown error

[n0251:20401] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0251

PID: 20401

--------------------------------------------------------------------------

[n0250:08714] [[8232,1],8][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_modex failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0250:8714] *** An error occurred in MPI_Init_thread

[n0250:8714] *** on a NULL communicator

[n0250:8714] *** Unknown error

[n0250:8714] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0250

PID: 8714

--------------------------------------------------------------------------

srun: error: n0244: task 2: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0242: task 0: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0245: task 3: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0243: task 1: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0247: task 5: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0249: task 7: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0246: task 4: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0251: task 9: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: Sent KVS info to 16 nodes, up to 1 tasks per node

srun: error: slurm_send_recv_rc_msg_only_one to n0242:46095 : Connection
refused

srun: error: slurm_send_recv_rc_msg_only_one to n0243:56307 : Connection
refused

srun: error: slurm_send_recv_rc_msg_only_one to n0244:42705 : Connection
refused

srun: error: n0250: task 8: Exited with exit code 1

srun: error: slurm_send_recv_rc_msg_only_one to n0245:57746 : Connection
refused

srun: error: slurm_send_recv_rc_msg_only_one to n0246:57496 : Connection
refused

srun: error: slurm_send_recv_rc_msg_only_one to n0250:50673 : Connection
refused

srun: error: slurm_send_recv_rc_msg_only_one to n0249:36371 : Connection
refused

srun: error: slurm_send_recv_rc_msg_only_one to n0251:47692 : Connection
refused

srun: error: slurm_send_recv_rc_msg_only_one to n0247:57347 : Connection
refused

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0254:18359] [[8232,1],12][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_barrier failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0254:18359] *** An error occurred in MPI_Init_thread

[n0254:18359] *** on a NULL communicator

[n0254:18359] *** Unknown error

[n0254:18359] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0254

PID: 18359

--------------------------------------------------------------------------

nmf: error: slurm_accept_msg_conn: Interrupted system call

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0252:05124] [[8232,1],10][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed

[n0256:27363] [[8232,1],14][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_barrier failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0256:27363] *** An error occurred in MPI_Init_thread

[n0256:27363] *** on a NULL communicator

[n0256:27363] *** Unknown error

[n0256:27363] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0256

PID: 27363

--------------------------------------------------------------------------

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_barrier failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0252:5124] *** An error occurred in MPI_Init_thread

[n0252:5124] *** on a NULL communicator

[n0252:5124] *** Unknown error

[n0252:5124] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0252

PID: 5124

--------------------------------------------------------------------------

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0253:00668] [[8232,1],11][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_barrier failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0253:668] *** An error occurred in MPI_Init_thread

[n0253:668] *** on a NULL communicator

[n0253:668] *** Unknown error

[n0253:668] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0253

PID: 668

--------------------------------------------------------------------------

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0257:26458] [[8232,1],15][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_barrier failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0257:26458] *** An error occurred in MPI_Init_thread

[n0257:26458] *** on a NULL communicator

[n0257:26458] *** Unknown error

[n0257:26458] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0257

PID: 26458

--------------------------------------------------------------------------

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0255:32396] [[8232,1],13][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_barrier failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0255:32396] *** An error occurred in MPI_Init_thread

[n0255:32396] *** on a NULL communicator

[n0255:32396] *** Unknown error

[n0255:32396] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0255

PID: 32396

--------------------------------------------------------------------------

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0252: task 10: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0256: task 14: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0253: task 11: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0254: task 12: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0257: task 15: Exited with exit code 1

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0255: task 13: Exited with exit code 1

nmf: error: slurm_accept_msg_conn: Interrupted system call

[n0248:07754] [[8232,1],6][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems. This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

orte_grpcomm_barrier failed

--> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[n0248:7754] *** An error occurred in MPI_Init_thread

[n0248:7754] *** on a NULL communicator

[n0248:7754] *** Unknown error

[n0248:7754] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly. You should

double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed

Local host: n0248

PID: 7754

--------------------------------------------------------------------------

srun: Received task exit notification for 1 task (status=0x0100).

srun: error: n0248: task 6: Exited with exit code 1
--
Abraços \ Regards \ Saludos
-----------------------------
Fabricio Silva Kyt
Michael Robbert
2014-07-09 19:28:13 UTC
Permalink
Fabricio,
I think that I've seen this error if OpenMPI isn't compiled with support for Slurm's PMI library. See question #2 on this page: http://www.open-mpi.org/faq/?category=slurm
Do you know if that has been done?
I'm not sure if this is a perfect check, but does the following command return anything:

ompi_info | grep -i pmi

Mike
Hi All - I've been running a parallel application using OpenMPI and SLURM and getting the following error messages. The same application runs fine in another cluster with Torque so I'm suspecting I'm missing some kind of SLURM configuration setting.
Ps. Simple commands like srun -N 128 hostname work fine.
Any help is greatly appreciated. Thanks!
=======================================================
srun: auth plugin for Munge (http://code.google.com/p/munge/) loaded
srun: jobid 8232: nodes(16):`n[0242-0257]', cpu counts: 16(x16)
srun: launching 8232.0 on host n0242, 1 tasks: 0
srun: launching 8232.0 on host n0243, 1 tasks: 1
srun: launching 8232.0 on host n0244, 1 tasks: 2
srun: launching 8232.0 on host n0245, 1 tasks: 3
srun: launching 8232.0 on host n0246, 1 tasks: 4
srun: launching 8232.0 on host n0247, 1 tasks: 5
srun: launching 8232.0 on host n0248, 1 tasks: 6
srun: launching 8232.0 on host n0249, 1 tasks: 7
srun: launching 8232.0 on host n0250, 1 tasks: 8
srun: launching 8232.0 on host n0251, 1 tasks: 9
srun: launching 8232.0 on host n0252, 1 tasks: 10
srun: launching 8232.0 on host n0253, 1 tasks: 11
srun: launching 8232.0 on host n0254, 1 tasks: 12
srun: launching 8232.0 on host n0255, 1 tasks: 13
srun: launching 8232.0 on host n0256, 1 tasks: 14
srun: launching 8232.0 on host n0257, 1 tasks: 15
srun: Node n0244, 1 tasks started
srun: Node n0243, 1 tasks started
srun: Node n0245, 1 tasks started
srun: Node n0242, 1 tasks started
srun: Node n0247, 1 tasks started
srun: Node n0246, 1 tasks started
srun: Node n0249, 1 tasks started
srun: Node n0248, 1 tasks started
srun: Node n0251, 1 tasks started
srun: Node n0250, 1 tasks started
srun: Node n0254, 1 tasks started
srun: Node n0252, 1 tasks started
srun: Node n0256, 1 tasks started
srun: Node n0253, 1 tasks started
srun: Node n0257, 1 tasks started
srun: Node n0255, 1 tasks started
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0244:26366] [[8232,1],2][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0244:26366] *** An error occurred in MPI_Init_thread
[n0244:26366] *** on a NULL communicator
[n0244:26366] *** Unknown error
[n0244:26366] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0244
PID: 26366
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0243:22413] [[8232,1],1][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0243:22413] *** An error occurred in MPI_Init_thread
[n0243:22413] *** on a NULL communicator
[n0243:22413] *** Unknown error
[n0243:22413] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0243
PID: 22413
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0245:17194] [[8232,1],3][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0242:32029] [[8232,1],0][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0245:17194] *** An error occurred in MPI_Init_thread
[n0245:17194] *** on a NULL communicator
[n0245:17194] *** Unknown error
[n0245:17194] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0245
PID: 17194
--------------------------------------------------------------------------
[n0242:32029] *** An error occurred in MPI_Init_thread
[n0242:32029] *** on a NULL communicator
[n0242:32029] *** Unknown error
[n0242:32029] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0242
PID: 32029
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0247:08603] [[8232,1],5][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0247:8603] *** An error occurred in MPI_Init_thread
[n0247:8603] *** on a NULL communicator
[n0247:8603] *** Unknown error
[n0247:8603] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0247
PID: 8603
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0249:01391] [[8232,1],7][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0249:1391] *** An error occurred in MPI_Init_thread
[n0249:1391] *** on a NULL communicator
[n0249:1391] *** Unknown error
[n0249:1391] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
nmf: error: slurm_accept_msg_conn: Interrupted system call
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0249
PID: 1391
--------------------------------------------------------------------------
[n0246:19634] [[8232,1],4][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0246:19634] *** An error occurred in MPI_Init_thread
[n0246:19634] *** on a NULL communicator
[n0246:19634] *** Unknown error
[n0246:19634] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0246
PID: 19634
--------------------------------------------------------------------------
srun: Received task exit notification for 1 task (status=0x0100).
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0251:20401] [[8232,1],9][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed
nmf: error: slurm_accept_msg_conn: Interrupted system call
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0251:20401] *** An error occurred in MPI_Init_thread
[n0251:20401] *** on a NULL communicator
[n0251:20401] *** Unknown error
[n0251:20401] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0251
PID: 20401
--------------------------------------------------------------------------
[n0250:08714] [[8232,1],8][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0250:8714] *** An error occurred in MPI_Init_thread
[n0250:8714] *** on a NULL communicator
[n0250:8714] *** Unknown error
[n0250:8714] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0250
PID: 8714
--------------------------------------------------------------------------
srun: error: n0244: task 2: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0242: task 0: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0245: task 3: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0243: task 1: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0247: task 5: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0249: task 7: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0246: task 4: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0251: task 9: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: Sent KVS info to 16 nodes, up to 1 tasks per node
srun: error: slurm_send_recv_rc_msg_only_one to n0242:46095 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0243:56307 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0244:42705 : Connection refused
srun: error: n0250: task 8: Exited with exit code 1
srun: error: slurm_send_recv_rc_msg_only_one to n0245:57746 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0246:57496 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0250:50673 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0249:36371 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0251:47692 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0247:57347 : Connection refused
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0254:18359] [[8232,1],12][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0254:18359] *** An error occurred in MPI_Init_thread
[n0254:18359] *** on a NULL communicator
[n0254:18359] *** Unknown error
[n0254:18359] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0254
PID: 18359
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0252:05124] [[8232,1],10][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed
[n0256:27363] [[8232,1],14][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0256:27363] *** An error occurred in MPI_Init_thread
[n0256:27363] *** on a NULL communicator
[n0256:27363] *** Unknown error
[n0256:27363] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0256
PID: 27363
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0252:5124] *** An error occurred in MPI_Init_thread
[n0252:5124] *** on a NULL communicator
[n0252:5124] *** Unknown error
[n0252:5124] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0252
PID: 5124
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0253:00668] [[8232,1],11][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0253:668] *** An error occurred in MPI_Init_thread
[n0253:668] *** on a NULL communicator
[n0253:668] *** Unknown error
[n0253:668] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0253
PID: 668
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0257:26458] [[8232,1],15][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0257:26458] *** An error occurred in MPI_Init_thread
[n0257:26458] *** on a NULL communicator
[n0257:26458] *** Unknown error
[n0257:26458] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0257
PID: 26458
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0255:32396] [[8232,1],13][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0255:32396] *** An error occurred in MPI_Init_thread
[n0255:32396] *** on a NULL communicator
[n0255:32396] *** Unknown error
[n0255:32396] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0255
PID: 32396
--------------------------------------------------------------------------
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0252: task 10: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0256: task 14: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0253: task 11: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0254: task 12: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0257: task 15: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0255: task 13: Exited with exit code 1
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0248:07754] [[8232,1],6][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0248:7754] *** An error occurred in MPI_Init_thread
[n0248:7754] *** on a NULL communicator
[n0248:7754] *** Unknown error
[n0248:7754] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0248
PID: 7754
--------------------------------------------------------------------------
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0248: task 6: Exited with exit code 1
--
Abraços \ Regards \ Saludos
-----------------------------
Fabricio Silva Kyt
Fabricio Kyt
2014-07-09 21:07:42 UTC
Permalink
Michael - Thanks for responding. Yes, our OpenMPI has been compiled with
support to SLURMS PMI. The output from the command is:

[ ~]$ ompi_info | grep -i pmi
MCA pubsub: pmi (MCA v2.0, API v2.0, Component v1.6.5)
MCA ess: pmi (MCA v2.0, API v2.0, Component v1.6.5)
MCA grpcomm: pmi (MCA v2.0, API v2.0, Component v1.6.5)
Post by Michael Robbert
Fabricio,
I think that I've seen this error if OpenMPI isn't compiled with support
http://www.open-mpi.org/faq/?category=slurm
Do you know if that has been done?
ompi_info | grep -i pmi
Mike
Hi All - I've been running a parallel application using OpenMPI and
SLURM and getting the following error messages. The same application runs
fine in another cluster with Torque so I'm suspecting I'm missing some kind
of SLURM configuration setting.
Ps. Simple commands like srun -N 128 hostname work fine.
Any help is greatly appreciated. Thanks!
=======================================================
srun: auth plugin for Munge (http://code.google.com/p/munge/) loaded
srun: jobid 8232: nodes(16):`n[0242-0257]', cpu counts: 16(x16)
srun: launching 8232.0 on host n0242, 1 tasks: 0
srun: launching 8232.0 on host n0243, 1 tasks: 1
srun: launching 8232.0 on host n0244, 1 tasks: 2
srun: launching 8232.0 on host n0245, 1 tasks: 3
srun: launching 8232.0 on host n0246, 1 tasks: 4
srun: launching 8232.0 on host n0247, 1 tasks: 5
srun: launching 8232.0 on host n0248, 1 tasks: 6
srun: launching 8232.0 on host n0249, 1 tasks: 7
srun: launching 8232.0 on host n0250, 1 tasks: 8
srun: launching 8232.0 on host n0251, 1 tasks: 9
srun: launching 8232.0 on host n0252, 1 tasks: 10
srun: launching 8232.0 on host n0253, 1 tasks: 11
srun: launching 8232.0 on host n0254, 1 tasks: 12
srun: launching 8232.0 on host n0255, 1 tasks: 13
srun: launching 8232.0 on host n0256, 1 tasks: 14
srun: launching 8232.0 on host n0257, 1 tasks: 15
srun: Node n0244, 1 tasks started
srun: Node n0243, 1 tasks started
srun: Node n0245, 1 tasks started
srun: Node n0242, 1 tasks started
srun: Node n0247, 1 tasks started
srun: Node n0246, 1 tasks started
srun: Node n0249, 1 tasks started
srun: Node n0248, 1 tasks started
srun: Node n0251, 1 tasks started
srun: Node n0250, 1 tasks started
srun: Node n0254, 1 tasks started
srun: Node n0252, 1 tasks started
srun: Node n0256, 1 tasks started
srun: Node n0253, 1 tasks started
srun: Node n0257, 1 tasks started
srun: Node n0255, 1 tasks started
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0244:26366] [[8232,1],2][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0244:26366] *** An error occurred in MPI_Init_thread
[n0244:26366] *** on a NULL communicator
[n0244:26366] *** Unknown error
[n0244:26366] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0244
PID: 26366
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0243:22413] [[8232,1],1][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0243:22413] *** An error occurred in MPI_Init_thread
[n0243:22413] *** on a NULL communicator
[n0243:22413] *** Unknown error
[n0243:22413] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0243
PID: 22413
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0245:17194] [[8232,1],3][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0242:32029] [[8232,1],0][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0245:17194] *** An error occurred in MPI_Init_thread
[n0245:17194] *** on a NULL communicator
[n0245:17194] *** Unknown error
[n0245:17194] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0245
PID: 17194
--------------------------------------------------------------------------
[n0242:32029] *** An error occurred in MPI_Init_thread
[n0242:32029] *** on a NULL communicator
[n0242:32029] *** Unknown error
[n0242:32029] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0242
PID: 32029
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0247:08603] [[8232,1],5][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0247:8603] *** An error occurred in MPI_Init_thread
[n0247:8603] *** on a NULL communicator
[n0247:8603] *** Unknown error
[n0247:8603] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0247
PID: 8603
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0249:01391] [[8232,1],7][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0249:1391] *** An error occurred in MPI_Init_thread
[n0249:1391] *** on a NULL communicator
[n0249:1391] *** Unknown error
[n0249:1391] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
nmf: error: slurm_accept_msg_conn: Interrupted system call
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0249
PID: 1391
--------------------------------------------------------------------------
[n0246:19634] [[8232,1],4][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0246:19634] *** An error occurred in MPI_Init_thread
[n0246:19634] *** on a NULL communicator
[n0246:19634] *** Unknown error
[n0246:19634] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0246
PID: 19634
--------------------------------------------------------------------------
srun: Received task exit notification for 1 task (status=0x0100).
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0251:20401] [[8232,1],9][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed
nmf: error: slurm_accept_msg_conn: Interrupted system call
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0251:20401] *** An error occurred in MPI_Init_thread
[n0251:20401] *** on a NULL communicator
[n0251:20401] *** Unknown error
[n0251:20401] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0251
PID: 20401
--------------------------------------------------------------------------
[n0250:08714] [[8232,1],8][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit
failed: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_modex failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0250:8714] *** An error occurred in MPI_Init_thread
[n0250:8714] *** on a NULL communicator
[n0250:8714] *** Unknown error
[n0250:8714] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0250
PID: 8714
--------------------------------------------------------------------------
srun: error: n0244: task 2: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0242: task 0: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0245: task 3: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0243: task 1: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0247: task 5: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0249: task 7: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0246: task 4: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0251: task 9: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: Sent KVS info to 16 nodes, up to 1 tasks per node
srun: error: slurm_send_recv_rc_msg_only_one to n0242:46095 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0243:56307 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0244:42705 : Connection refused
srun: error: n0250: task 8: Exited with exit code 1
srun: error: slurm_send_recv_rc_msg_only_one to n0245:57746 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0246:57496 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0250:50673 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0249:36371 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0251:47692 : Connection refused
srun: error: slurm_send_recv_rc_msg_only_one to n0247:57347 : Connection refused
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0254:18359] [[8232,1],12][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0254:18359] *** An error occurred in MPI_Init_thread
[n0254:18359] *** on a NULL communicator
[n0254:18359] *** Unknown error
[n0254:18359] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0254
PID: 18359
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0252:05124] [[8232,1],10][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed
[n0256:27363] [[8232,1],14][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0256:27363] *** An error occurred in MPI_Init_thread
[n0256:27363] *** on a NULL communicator
[n0256:27363] *** Unknown error
[n0256:27363] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0256
PID: 27363
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0252:5124] *** An error occurred in MPI_Init_thread
[n0252:5124] *** on a NULL communicator
[n0252:5124] *** Unknown error
[n0252:5124] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0252
PID: 5124
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0253:00668] [[8232,1],11][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0253:668] *** An error occurred in MPI_Init_thread
[n0253:668] *** on a NULL communicator
[n0253:668] *** Unknown error
[n0253:668] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0253
PID: 668
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0257:26458] [[8232,1],15][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0257:26458] *** An error occurred in MPI_Init_thread
[n0257:26458] *** on a NULL communicator
[n0257:26458] *** Unknown error
[n0257:26458] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0257
PID: 26458
--------------------------------------------------------------------------
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0255:32396] [[8232,1],13][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0255:32396] *** An error occurred in MPI_Init_thread
[n0255:32396] *** on a NULL communicator
[n0255:32396] *** Unknown error
[n0255:32396] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0255
PID: 32396
--------------------------------------------------------------------------
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0252: task 10: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0256: task 14: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0253: task 11: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0254: task 12: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0257: task 15: Exited with exit code 1
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0255: task 13: Exited with exit code 1
nmf: error: slurm_accept_msg_conn: Interrupted system call
[n0248:07754] [[8232,1],6][grpcomm_pmi_module.c:195:pmi_barrier]
PMI_Barrier: Operation failed
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
orte_grpcomm_barrier failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n0248:7754] *** An error occurred in MPI_Init_thread
[n0248:7754] *** on a NULL communicator
[n0248:7754] *** Unknown error
[n0248:7754] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: n0248
PID: 7754
--------------------------------------------------------------------------
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: n0248: task 6: Exited with exit code 1
--
Abraços \ Regards \ Saludos
-----------------------------
Fabricio Silva Kyt
--
Abraços \ Regards \ Saludos
-----------------------------
Fabricio Silva Kyt
Loading...