slurm cannot work with Infiniband after rebooting

Tingyang Xu

2014-10-20 16:46:48 UTC

To whom it may concern,
Hello. I am new in slurm. I am facing a problem of using slurm with Infiniband. When I ran the mpi jobs on a rebooted node, I would get fabric errors. For example, I tried a simple “hello world” via Intel mpi. I did like:
$ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted
salloc: Granted job allocation 1201
$ module list
Currently Loaded Modulefiles:
1) modules 2) null 3) intelics/2013.1.039
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$srun ./hello
[3] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[4] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[5] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[6] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[7] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[8] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[10] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[11] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[9] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[2] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
srun: error: cn117: tasks 0-11: Exited with exit code 254
srun: Terminating job step 1201.0
================================================================
However, as long as I manually restart the slurm on the cn117, the problem will be fixed. For example:
$ ssh ***@cn117<mailto:***@cn117>
cn117# service slurm restart
stopping slurmd: [ OK ]
slurmd is stopped
starting slurmd: [ OK ]
# exit
$ salloc -N1 -n12 -w cn117
salloc: Granted job allocation 1203
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$ srun ./hello
This is Process 9 out of 12 running on host cn117
This is Process 3 out of 12 running on host cn117
This is Process 2 out of 12 running on host cn117
This is Process 7 out of 12 running on host cn117
This is Process 6 out of 12 running on host cn117
This is Process 0 out of 12 running on host cn117
This is Process 5 out of 12 running on host cn117
This is Process 1 out of 12 running on host cn117
This is Process 4 out of 12 running on host cn117
This is Process 10 out of 12 running on host cn117
This is Process 8 out of 12 running on host cn117
This is Process 11 out of 12 running on host cn117
=============================================================
Although I can manully do it, I still hope the system can be more automatic. I tried to add “sleep 10s;/etc/init.d/slurm restart” in the end of the file, rc.local but the issue is still there. Can anyone help me about that?

Sincerely,
Tingyang Xu
HPC Administrator
University of Connecticut

PS: some information of the infiniband:
$ slurmd -V
slurm 14.03.0

cn117$ ofed_info|head -n1
MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0):

cn117$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.550
node_guid:
sys_image_guid: ##########
vendor_id: ##########
vendor_part_id: ########
hw_ver: 0x0
board_id: ########
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 131
port_lmc: 0x00
link_layer: InfiniBand

port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand

cn117$ cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 6.5 (Santiago)