Hello, I have been trying to submit a job in our cluster for a intel17 compiled and impi enabled code. I keep getting trouble at startup when running through PBS.
This is the submission script:
#!/bin/bash #PBS -N propane_XO2_ramp_dx_p3125cm(IMPI) #PBS -W umask=0022 #PBS -e /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.err #PBS -o /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.log #PBS -l nodes=16:ppn=12 #PBS -l walltime=999:0:0 module purge module load null modules torque-maui intel/17 export OMP_NUM_THREADS=1 export I_MPI_FABRICS=shm:dapl export I_MPI_DAPL_PROVIDER=OpenIB-cma export I_MPI_FALLBACK_DEVICE=0 export I_MPI_DEBUG=100 cd /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind echo echo $PBS_O_HOME echo `date` echo "Input file: propane_XO2_ramp_dx_p3125cm.fds" echo " Directory: `pwd`" echo " Host: `hostname`" /opt/intel17/compilers_and_libraries/linux/mpi/bin64/mpiexec -np 184 /home4/mnv/FIREMODELS_ISSUES/fds/Build/impi_intel_linux_64/fds_impi_intel_linux_64 propane_XO2_ramp_dx_p3125cm.fds
As you can see I'm invoking DAPL and OpenIB-cma as dapl provider. This is what I see on my login node /etc/dat.conf
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0""" OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0""" OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0""" OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0""" OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0""" ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0""" ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0""" ofa-v2-ib2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib2 0""" ofa-v2-ib3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib3 0""" ofa-v2-bond u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "bond0 0"""
Now logging in to the actual compute nodes I don't see an /etc/dat.conf on these. I don't know if this is normal or there is an issue there.
Anyways, when I submit the job I get the following attached stdout file, where it seems some of the nodes fail to load OpenIB-cma (with no fallback fabrics).
To be sure, some nodes on the cluster use Qlogic infiniband cards and others use Mellanox.
At this point I've tried several combinations, either specifying or not ib fabrics, without success. I'd really appreciate if you help me troubleshooting this.
Thank you,
Marcos