Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 930

SLURM 14.11.9 with MPI_Comm_accept causes Assertion failed when communicating

$
0
0

Dear all,

I'd like to run different MPI-processes in a server/client-setup, as explained in:

https://software.intel.com/en-us/articles/using-the-intel-mpi-library-in-a-serverclient-setup

I have attached the two programs that I used as test:

  • attach.c opens a port and calls MPI_Comm_attach
  • connect.c calls MPI_Comm_connect and expects the port as argument
  • when the connection is setup, MPI_Allreduce is used to sum some integers

everything works fine when I start these programs interactively with mpirun:

[donners@int2 openport]$ mpirun -n 1 ./accept_c &
[3] 9575
[donners@int2 openport]$ ./accept_c: MPI_Open_port..
./accept_c: mpiport=tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$
./accept_c: MPI_Comm_Accept..


[3]+  Stopped                 mpirun -n 1 ./accept_c
[donners@int2 openport]$ mpirun -n 1 ./connect_c 'tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$'
./connect_c: Port name entered: tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$
./connect_c: MPI_Comm_connect..
./connect_c: Size of intercommunicator: 1
./connect_c: intercomm, MPI_Allreduce..
./accept_c: Size of intercommunicator: 1
./accept_c: intercomm, MPI_Allreduce..
./connect_c: intercomm, my_value=7 SUM=8
./accept_c: intercomm, my_value=8 SUM=7
./accept_c: intracomm, MPI_Allreduce..
./accept_c: intracomm, my_value=8 SUM=15
Done
./accept_c: Done
./connect_c: intracomm, MPI_Allreduce..
./connect_c: intracomm, my_value=7 SUM=15
Done

However, it fails when started by SLURM. The job script looks like:

#!/bin/bash
#SBATCH -n 2

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
tmp=$(mktemp)
srun -l -n 1 ./accept_c 2>&1 | tee $tmp &
until [ "$port" != "" ];do
  port=$(cat $tmp|fgrep mpiport|cut -d= -f2-)
  echo "Found port: $port"
  sleep 1
done
srun -l -n 1 ./connect_c "$port"<<EOF
$port
EOF

The output is:

Found port:
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Open_port..
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: mpiport=tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Comm_Accept..
Found port: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$
output connect_c: /scratch/nodespecific/srv4/donners.2300217/tmp.rQY9kHs8HS
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Port name entered: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: MPI_Comm_connect..
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: Size of intercommunicator: 1
0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: intercomm, MPI_Allreduce..
0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr && ptr == (char*) MPIDI_Process.my_pg->id
0: internal ABORT - process 0
0: In: PMI_Abort(1, internal ABORT - process 0)
0: slurmstepd: *** STEP 2300217.0 ON srv4 CANCELLED AT 2016-07-28T13:41:40 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Size of intercommunicator: 1
0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: intercomm, MPI_Allreduce..
0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr && ptr == (char*) MPIDI_Process.my_pg->id
0: internal ABORT - process 0
0: In: PMI_Abort(1, internal ABORT - process 0)
0: slurmstepd: *** STEP 2300217.1 ON srv4 CANCELLED AT 2016-07-28T13:41:40 ***

The server and client do connect, but fail when communication starts. This looks like a bug in the MPI-library.
Could you let me know if this is the case, or if this use is not supported by Intel MPI?

With regards,
John

AttachmentSize
Downloadtext/x-csrcaccept.c1 KB
Downloadtext/x-csrcconnect.c959 bytes

Viewing all articles
Browse latest Browse all 930

Trending Articles