Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 930

IMPI and DAPL fabrics on Infiniband cluster

$
0
0

Hello, I have been trying to submit a job in our cluster for a intel17 compiled and impi enabled code. I keep getting trouble at startup when running through PBS.

This is the submission script:

#!/bin/bash
#PBS -N propane_XO2_ramp_dx_p3125cm(IMPI)
#PBS -W umask=0022
#PBS -e /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.err
#PBS -o /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.log
#PBS -l nodes=16:ppn=12
#PBS -l walltime=999:0:0
module purge
module load null modules torque-maui intel/17
export OMP_NUM_THREADS=1
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER=OpenIB-cma
export I_MPI_FALLBACK_DEVICE=0
export I_MPI_DEBUG=100
cd /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind
echo
echo $PBS_O_HOME
echo `date`
echo "Input file: propane_XO2_ramp_dx_p3125cm.fds"
echo " Directory: `pwd`"
echo "      Host: `hostname`"
/opt/intel17/compilers_and_libraries/linux/mpi/bin64/mpiexec   -np 184 /home4/mnv/FIREMODELS_ISSUES/fds/Build/impi_intel_linux_64/fds_impi_intel_linux_64 propane_XO2_ramp_dx_p3125cm.fds

As you can see I'm invoking DAPL and OpenIB-cma as dapl provider. This is what I see on my login node /etc/dat.conf

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0"""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0"""
OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0"""
OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0"""
OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0"""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0"""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0"""
ofa-v2-ib2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib2 0"""
ofa-v2-ib3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib3 0"""
ofa-v2-bond u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "bond0 0"""

Now logging in to the actual compute nodes I don't see an /etc/dat.conf on these. I don't know if this is normal or there is an issue there.

Anyways, when I submit the job I get the following attached stdout file, where it seems some of the nodes fail to load OpenIB-cma (with no fallback fabrics).

To be sure, some nodes on the cluster use Qlogic infiniband cards and others use Mellanox.

At this point I've tried several combinations, either specifying or not ib fabrics, without success. I'd really appreciate if you help me troubleshooting this.

Thank you,

Marcos

 

 


Viewing all articles
Browse latest Browse all 930

Trending Articles