Quantcast
Viewing all articles
Browse latest Browse all 930

Intel MPI issue with the usage of Slurm

To whom it may concern,

Hello. We are using Slurm to manage our Cluster. However, we met a new issue of Intel MPI with Slurm. When one node reboots, the Intel MPI will fail with that node but manaully restart of slurm daemon will fix it. I also tried to add "service slurm restart" in /etc/rc.local which runs in the end of booting but the issue is still there.

Moreover, I submitted this issue to the slurm-dev but they believed that it was due to Infiniband+IMPI configuration. They suggested me to configure dat.conf and set up some Intel MPI variables. However, I don't know how to set them.

Here is an example:

$ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted
salloc: Granted job allocation 1201
$ module list
Currently Loaded Modulefiles:
  1) modules                    2) null                       3) intelics/2013.1.039
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$ export I_MPI_FABRICS=shm:ofa
$ srun ./hello
[3] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[4] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[5] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[6] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[7] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[8] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[10] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[11] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[9] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[2] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
srun: error: cn117: tasks 0-11: Exited with exit code 254
srun: Terminating job step 1201.0

After restarting the slurm daemon:

$ ssh root@cn117
cn117$  service slurm restart
stopping slurmd:                                           [  OK  ]
slurmd is stopped
starting slurmd:                                           [  OK  ]
$ exit
$ salloc -N1 -n12 -w cn117
salloc: Granted job allocation 1203
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$ export I_MPI_FABRICS=shm:ofa
$ srun ./hello
This is Process  9 out of 12 running on host cn117
This is Process  3 out of 12 running on host cn117
This is Process  2 out of 12 running on host cn117
This is Process  7 out of 12 running on host cn117
This is Process  6 out of 12 running on host cn117
This is Process  0 out of 12 running on host cn117
This is Process  5 out of 12 running on host cn117
This is Process  1 out of 12 running on host cn117
This is Process  4 out of 12 running on host cn117
This is Process 10 out of 12 running on host cn117
This is Process  8 out of 12 running on host cn117
This is Process 11 out of 12 running on host cn117

Here is the default dat.conf we have:

# DAT v2.0, v1.2 configuration file
#
# Each entry should have the following fields:
#
# <ia_name> <api_version> <threadsafety> <default> <lib_path> \
#           <provider_version> <ia_params> <platform_params>
#
# For uDAPL cma provder, <ia_params> is one of the following:
#       network address, network hostname, or netdev name and 0 for port
#
# For uDAPL scm provider, <ia_params> is device name and port
# For uDAPL ucm provider, <ia_params> is device name and port
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL RoCE provider, <ia_params> is device name and 0
#
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0"""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0"""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1"""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2"""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1"""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2"""
ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1"""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0"""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1"""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2"""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0"""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0"""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-mcm-1 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mcm-2 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-scif0 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "scif0 1"""
ofa-v2-scif0-u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "scif0 1"""
ofa-v2-mic0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mic0:ib 1"""
ofa-v2-mlx4_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mlx4_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-mlx4_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 1"""
ofa-v2-mlx4_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 2"""
ofa-v2-mlx4_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 1"""
ofa-v2-mlx4_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 2"""
ofa-v2-mlx4_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mlx4_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-mlx4_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 1"""
ofa-v2-mlx4_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 2"""
ofa-v2-mlx5_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 1"""
ofa-v2-mlx5_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 2"""
ofa-v2-mlx5_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 1"""
ofa-v2-mlx5_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 2"""
ofa-v2-mlx5_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 1"""
ofa-v2-mlx5_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 2"""
ofa-v2-mlx5_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 1"""
ofa-v2-mlx5_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 2"""
ofa-v2-mlx5_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 1"""
ofa-v2-mlx5_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 2"""
ofa-v2-mlx5_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 1"""
ofa-v2-mlx5_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 2"""

Some system information here:

$ slurmd -V
slurm 14.03.0

$ mpirun –V
Intel(R) MPI Library for Linux* OS, Version 4.1 Update 1 Build 20130522
Copyright (C) 2003-2013, Intel Corporation. All rights reserved.

cn117$ ofed_info|head -n1
MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0):

cn117$ ibv_devinfo
hca_id: mlx4_0
transport:   InfiniBand (0)
fw_ver:    2.11.550
node_guid:
sys_image_guid:   ##########
vendor_id:   ##########
vendor_part_id:   ########
hw_ver:    0x0
board_id:   ########
phys_port_cnt:   2
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  4096 (5)
   active_mtu:  4096 (5)
   sm_lid:   1
   port_lid:  131
   port_lmc:  0x00
   link_layer:  InfiniBand

  port: 2
   state:   PORT_DOWN (1)
   max_mtu:  4096 (5)
   active_mtu:  4096 (5)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  InfiniBand


cn117$ cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 6.5 (Santiago)
cn117$ uname –r
2.6.32-431.23.3.el6.x86_64

I wonder if anyone faced similar issue before and could help us to figure out a solution.

Thanks,

Tingyang Xu


Viewing all articles
Browse latest Browse all 930

Trending Articles