Incorrect result of mpi_reduce over real(16) sums. (2019)

September 18, 2018, 4:09 am

Latest and popular articles on Intel Technologies

≫ Next: Openmpi compilation for omnipath

≪ Previous: Announcing Intel® Parallel Studio XE 2019 Release

I have found that MPI_REDUCE does not perform correctly sum reduction over real(16) variables.

Here is a simple code:

program testred16

use mpi_f08

implicit none

integer :: me,np
real(16) :: voltq,voltq0
real(8) :: voltd,voltd0
!
! initialize mpi and get the rank and total number of precesses
!
call mpi_init
call mpi_comm_rank(mpi_comm_world,me)
call mpi_comm_size(mpi_comm_world,np)
!
! determine total volume of active computational domain and send to the master
!
voltq = 1.0q0
voltd = 1.0d0
write(*,*) 'voltq is',voltq,'in rank',me
write(*,*) 'voltd is',voltd,'in rank',me
voltq0 = 0.0q0
voltd0 = 0.0d0

call mpi_reduce(voltq,voltq0,1,mpi_real16,mpi_sum,0,mpi_comm_world)
call mpi_reduce(voltd,voltd0,1,mpi_real8, mpi_sum,0,mpi_comm_world)

if(me.eq.0) then
  write(*,*) 'voltq0 (16):',voltq0
  write(*,*) 'voltd0 ( 8):',voltd0
endif

call mpi_finalize

end program

I have compiled it by issuing the following command:

mpiifort -o  test-mpi-real-16 test-mpi-real-16.f90 -check all -traceback -O0 -debug -warn all

Here are some results:

$ mpiexec -np 2 ./test-mpi-real-16
 voltq is   1.00000000000000000000000000000000       in rank           1
 voltd is   1.00000000000000      in rank           1
 voltq is   1.00000000000000000000000000000000       in rank           0
 voltd is   1.00000000000000      in rank           0
 voltq0 (16):   1.00000000000000000000000000000000      
 voltd0 ( 8):   2.00000000000000     
$ mpiexec -np 4 ./test-mpi-real-16
 voltq is   1.00000000000000000000000000000000       in rank           1
 voltq is   1.00000000000000000000000000000000       in rank           2
 voltd is   1.00000000000000      in rank           2
 voltq is   1.00000000000000000000000000000000       in rank           3
 voltd is   1.00000000000000      in rank           3
 voltd is   1.00000000000000      in rank           1
 voltq is   1.00000000000000000000000000000000       in rank           0
 voltd is   1.00000000000000      in rank           0
 voltq0 (16):   1.00000000000000000000000000000000      
 voltd0 ( 8):   4.00000000000000     
$ mpiexec -np 8 ./test-mpi-real-16
 voltq is   1.00000000000000000000000000000000       in rank           1
 voltd is   1.00000000000000      in rank           1
 voltq is   1.00000000000000000000000000000000       in rank           2
 voltd is   1.00000000000000      in rank           2
 voltq is   1.00000000000000000000000000000000       in rank           4
 voltd is   1.00000000000000      in rank           4
 voltq is   1.00000000000000000000000000000000       in rank           6
 voltd is   1.00000000000000      in rank           6
 voltq is   1.00000000000000000000000000000000       in rank           7
 voltd is   1.00000000000000      in rank           7
 voltq is   1.00000000000000000000000000000000       in rank           3
 voltd is   1.00000000000000      in rank           3
 voltq is   1.00000000000000000000000000000000       in rank           5
 voltd is   1.00000000000000      in rank           5
 voltq is   1.00000000000000000000000000000000       in rank           0
 voltd is   1.00000000000000      in rank           0
 voltq0 (16):   1.00000000000000000000000000000000      
 voltd0 ( 8):   8.00000000000000

The reduction of real(16) variable is wrong, whereas real(8) reduction is right. I encountered same error in previous versions (2017,2018), but by issuing the environment variable I_MPI_ADJUST_REDUCE = 1, it was fixed. Now I cannot recover exact result whatsever value I set (or leaving it unset).

↧

Openmpi compilation for omnipath

September 18, 2018, 10:55 pm

Latest and popular articles on Intel Technologies

≫ Next: PS XE 2019 MPI not reading from stdin on Windows

≪ Previous: Incorrect result of mpi_reduce over real(16) sums. (2019)

Hi Team,

Can Someone Guide me how to configure openmpi with omnipath , we have a intel omnipath managed switch 24 port.

what is the correct process to configure openmpi for omnipath

Regards

Amit Kumar

↧

PS XE 2019 MPI not reading from stdin on Windows

September 20, 2018, 3:12 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel MPI Library Runtime Environment for Windows

≪ Previous: Openmpi compilation for omnipath

Hi,

This is a problem associated with the latest release of PS 2019 cluster ed. When running an MPI program from mpiexec, process 0 is unable to read stdin, at least on a Windows box (have not installed upgrade on CentOS yet) For instance, in this test program

 program MPItesting
      use mpi
      implicit none
      integer myRank, mpierr, noProc, narg
      character(80) basefilename
      
      call MPI_INIT( mpierr )
      call MPI_COMM_RANK(MPI_COMM_WORLD, myRank, mpierr) ! get rank of this process in world
      call MPI_COMM_SIZE(MPI_COMM_WORLD, noProc, mpiErr)
      
      if (myRank == 0) write(*,'(a,i4)') 'Number of MPI processes: ', noProc
      narg = command_argument_count () ! see if a filename has been included in command line
      if (narg == 1) then
         call get_command_argument(narg,value=basefilename)
      else 
         if (myRank == 0) then
            write(*,'(a,$)') 'Enter parameter file base name, no extension: '
            read(*,*) basefilename
         end if
         call MPI_BCAST(basefilename,80,MPI_CHARACTER,0,MPI_COMM_WORLD,mpierr)
      end if
      
      write(*,'(a,i3,a)') 'Process ', myRank,' filename: '//trim(basefilename)
      call MPI_FINALIZE(mpierr)         
   end program MPItesting

Process 0 properly writes to the console, but any typing done at the prompt is not echoed, nor is stdin read. If I cntl-C out (which is the only way to exit), then what I typed at the program's prompt is instead passed to the windows command prompt. This just started occurring after upgrading to the 2019 PS XE cluster official release. I have tried using the -s option for mpiexec, which should default to process 0, but this does not help regardless of what process defined by -s is set to. Any ideas? Thanks!

↧

Intel MPI Library Runtime Environment for Windows

September 25, 2018, 3:48 am

Latest and popular articles on Intel Technologies

≫ Next: How to map consecutive ranks to same node

≪ Previous: PS XE 2019 MPI not reading from stdin on Windows

Dear Intel Team

If one intends to create a software (the "Software") on a Windows platform that utilizes the free version of the Intel MPI Library (the "MPILIB") does a user of that Software has to register at

https://registrationcenter.intel.com/en/forms/?ProductID=1745

to acquire the required Intel MPI Library Runtime Environment for Windows (the "RUNENV")? Or does the Intel Simplified Software License, i.e.,

https://software.intel.com/en-us/license/intel-simplified-software-license

which by

https://software.intel.com/en-us/articles/end-user-license-agreement

governs the MPILIB/RUNENV grant the permission to redistribute the RUNENV?

TL;DR: May a developer of a Software that utilizes the MPILIB redistribute the RUNENV with his or her Software? Or is a user of that Software required to register and download the RUNENV by himself or herself?

Thanks in advance for your help.

↧

How to map consecutive ranks to same node

September 26, 2018, 2:39 pm

Latest and popular articles on Intel Technologies

≫ Next: Extraordinarily Slow First AllToAllV Performance with Intel MPI Compared to MPT

≪ Previous: Intel MPI Library Runtime Environment for Windows

Hi,

Intel Parallel Studio Cluster Edition, 2017 Update 5, on CentOS 7.3

I am trying to run a hybrid parallel NWChem job with 2 ranks per 24-core node, 12 threads per rank. The underlying ARMCI library seems to expect consecutive ranks to reside on the same node, i.e., ranks 0 and 1 on node 1, ranks 2 and 3 on node 2, etc. With the simple "mpirun ... -perhost 2", I see round-robin assignment, instead of the documented group-round-robin assignment (4 nodes):

[cchang@login1 03:38:06 /scratch/cchang/C6H6_CCSD_NWC]$ mpirun -h | grep perhost
-perhost place consecutive processes on each host

[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 16252 n1757 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 1 3900 n1756 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 2 28323 n1738 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 3 13358 n1733 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 4 16253 n1757 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 5 3901 n1756 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 6 28324 n1738 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 7 13359 n1733 {6,7,8,9,10,11,18,19,20,21,22,23}

If I try I_MPI_PIN_DOMAIN, or using a hexmap in the nodefile, all ranks end up on the same node:

[cchang@login1 03:31:24 /scratch/cchang/C6H6_CCSD_NWC]$ cat nodefile
n2123:2 binding=map=[03F03F,FC0FC0]
n1942:2 binding=map=[03F03F,FC0FC0]
n1915:2 binding=map=[03F03F,FC0FC0]
n1876:2 binding=map=[03F03F,FC0FC0]
[cchang@login1 03:31:27 /scratch/cchang/C6H6_CCSD_NWC]$ head -20 proc8.log
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[4] MPI startup(): shm data transfer mode
[5] MPI startup(): shm data transfer mode
[6] MPI startup(): shm data transfer mode
[7] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 8510 n2123 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 1 8511 n2123 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 2 8512 n2123 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 3 8513 n2123 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 4 8514 n2123 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 5 8515 n2123 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 6 8516 n2123 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 7 8517 n2123 {6,7,8,9,10,11,18,19,20,21,22,23}
...

What is Intel's preferred mechanism to achieve paired consecutive ranks, with each multi-threaded rank bound to a socket?

Thanks; Chris

↧

Extraordinarily Slow First AllToAllV Performance with Intel MPI Compared to MPT

September 28, 2018, 7:16 am

Latest and popular articles on Intel Technologies

≫ Next: OpenMP slower than no OpenMP

≪ Previous: How to map consecutive ranks to same node

Dear Intel MPI Gurus,

We've been trying to track down why a code we can run quite well with HPE MPT on our Haswell-based SGI/HPE Infiniband-network cluster, but when we use Intel MPI, it's just way too slow. Eventually, we think we found it was the first AllToAllV code inside this program where Intel MPI was just "halting" before it proceeded. We've created a reproducer that seems to support this theory whose core is (I can share the full reproducer):

   do iter = 1,3
      t0 = mpi_wtime()
      call MPI_AlltoAllV(s_buf, s_count, s_disp,MPI_INTEGER, &
            r_buf, r_count, r_disp, MPI_INTEGER, MPI_COMM_WORLD, ierror)
      t1 = mpi_wtime()

      if (rank == 0) then
         write(*,*)"Iter: ", iter, " T = ", t1 - t0
      end if
   end do

We are doing 3 iterations of an MPI_AllToAllV call. That's it. The reproducer can vary the size of the buffers, etc.

So with HPE MPT 2.17 on this cluster, we see (for a "problem size" of 10000; it's a bit hard to figure out the size of what's happening in the actual code we first saw this in, but my guess is 10000 is smaller than reality):

# nprocs T1 T2 T3
 72 3.383068833500147E-003 1.581361982971430E-003 1.497713848948479E-003
192 8.767310064285994E-003 3.687836695462465E-003 3.472075797617435E-003
312 1.676454907283187E-002 8.718995843082666E-003 8.802385069429874E-003
432 1.770043326541781E-002 1.390126813203096E-002 1.413645874708891E-002
552 2.205356908962131E-002 1.850109826773405E-002 1.872858591377735E-002
672 3.307574009522796E-002 3.664174210280180E-002 3.548037912696600E-002

The first column is the number of processes and the next three are the MPI_Wtime numbers for each iteration of the loop.

Now let's look at Intel MPI 18.0.3.222:

# nprocs T1 T2 T3
 72 0.476876974105835 4.508972167968750E-003 5.246162414550781E-003
192 2.92623281478882 1.846385002136230E-002 1.933908462524414E-002
312 4.00109887123108 3.393721580505371E-002 3.367590904235840E-002
432 6.74378299713135 5.490398406982422E-002 5.541920661926270E-002
552 8.19235110282898 8.167219161987305E-002 8.110594749450684E-002
672 12.1262009143829 0.103807926177979 0.107892990112305

Well that's not good. The first MPI_AllToAllV call is much slower and the more processes, the worse it gets. At 672 processes it is nearly 4 orders of magnitudeslower than HPE MPT. (Our cluster admins are working on getting Intel MPI 19 on but license server changes are making it fun so I can't report those numbers yet. This is also the *best* I can do by ignoring SLURM and running 10 cores per node. A straight mpirun is about 3x slower.)

Now, I do have access to a new cluster that is Skylake/OmniPath-based rather than Infiniband. If I run with Intel MPI 18.0.3.222 there:

# nprocs T1 T2 T3
 72 3.640890121459961E-003 2.669811248779297E-003 2.519130706787109E-003
192 9.490966796875000E-003 8.697032928466797E-003 8.977174758911133E-003
312 1.729822158813477E-002 1.571893692016602E-002 1.684498786926270E-002
432 2.593088150024414E-002 2.414894104003906E-002 2.196598052978516E-002
552 3.740596771240234E-002 3.293609619140625E-002 3.402209281921387E-002
672 5.194902420043945E-002 4.933309555053711E-002 5.183196067810059E-002

Better! So, plus side, OmniPath doesn't show this issue. Downside, the OmniPath cluster isn't available for general use yet and there will be far more HPE nodes for users to use even when it is.

My question for you is: Are there some environment variables we can set to allow Intel MPI to have comparable performance on the HPE nodes? It would be nice to start transitioning users from MPT to Intel MPI because the newer OmniPath cluster is not an HPE machine, so it can't have the HPE MPI stack on it. Thus, if we can start shaking out issues and making sure Intel MPI works and is performant, when we will essentially need to move users to use Intel MPI for portability, it will be an easy transition.

Thanks,

Matt

↧

OpenMP slower than no OpenMP

September 28, 2018, 2:18 pm

Latest and popular articles on Intel Technologies

≫ Next: Bug: I_MPI_VERSION vanished into 2019 release and different than MPI_Get_library_version

≪ Previous: Extraordinarily Slow First AllToAllV Performance with Intel MPI Compared to MPT

Here is a Friday post that has a sufficient lack of information that will probably be impossible to answer. I have some older Fortran code I'm trying to improve the performance of. VTune shows 75% of serial execution is consumed calculating the numerical Jacobian of an expensive function. It's easily to parallelize. I first used MPI, and that does show modest improvement when a few processes are added, but it does not scale very well probably because of the large Jacobian matrix that must be broadcast to all the processes.

So, I also tired OpenMP, thinking that it might do slightly better since it does not need to broadcast the matrix. However, when I run the serial code with OMP directives disabled, it runs 4 times fast than the code with the OMP directives enabled, but using only one thread. If more threads are used, some improvement occurs, but it never gets better than the code with out the OMP directives.

My question: Does OpenMP incur a large overhead even if only one thread is used so there is no forking?

↧

Bug: I_MPI_VERSION vanished into 2019 release and different than MPI_Get_library_version

October 1, 2018, 11:51 am

Latest and popular articles on Intel Technologies

≫ Next: What is the differences between “-genvall” and “-envall” ?

≪ Previous: OpenMP slower than no OpenMP

Hi,

I just noticed that the mpi.h file included with 2019 release is missing any #define that helps to detect the mpi flavor used like former I_MPI_VERSION #define.

The wrong thing is that, when calling the MPI_Get_library_version function, the output is:

Intel(R) MPI Library 2019 for Linux* OS

which I think, should be something like:

MPICH Version: 3.3b2

or close to...

I suggest two ways to fix this:

- Modify MPI_Get_library_version to return the mpich version string...

- Re-add the I_MPI_VERSION or any #define, to detect at compile-time that "Intel MPI" is used.

Thanks,

Eric

p, li { white-space: pre-wrap; }

↧

What is the differences between “-genvall” and “-envall” ?

October 1, 2018, 1:45 am

Latest and popular articles on Intel Technologies

≫ Next: Troubles with Intel MPI library

≪ Previous: Bug: I_MPI_VERSION vanished into 2019 release and different than MPI_Get_library_version

In the man doc of mpirun, it says that:

-genvall

Use this option to enable propagation of all environment variables to all MPI processes.

-envall

Use this option to propagate all environment variables in the current argument set.

May I have a more plain-spoken explanation in their meaning and differences?

Thank you.

↧

Troubles with Intel MPI library

October 1, 2018, 8:59 am

Latest and popular articles on Intel Technologies

≫ Next: Bugged MPICH 3.3b2 used in Parallel Studio 2019 initial release

≪ Previous: What is the differences between “-genvall” and “-envall” ?

Hello, im student and have just started to learn HPC using Intel MPI lib. I created two virtual machines wich using CentOs. First of all i ran this code "mpirun -n 2 -ppn 1 ip1, ip2 hostname" and it worked well and when I tried to run test program on one node it worked. But when i tried to run test program from Intel MPI tests i got this error. Also i used MPICH, but without such troubles.

↧

Bugged MPICH 3.3b2 used in Parallel Studio 2019 initial release

October 2, 2018, 6:45 am

Latest and popular articles on Intel Technologies

≫ Next: Running coupled executables with different thread counts using LSF

≪ Previous: Troubles with Intel MPI library

Hi,

I just realized that the Parallel Studio 2019 initial release is using MPICH 3.3b2 which is a buggy release as reported here:

https://lists.mpich.org/pipermail/discuss/2018-April/005447.html

I confirm that the tag upper limit initialization is not fixed in this release (see mpich commit: c597c8d79deea22) and is a problem for all PETSc users.

What a bad choice for an official release!!!

Please consider releasing with MPICH 3.3b3 instead, or no MPI at all!

Eric

↧

Running coupled executables with different thread counts using LSF

October 3, 2018, 1:38 pm

Latest and popular articles on Intel Technologies

≫ Next: ifort not reporting on outer loops re parallelisation capabilities & dependencies

≪ Previous: Bugged MPICH 3.3b2 used in Parallel Studio 2019 initial release

Under LSF how can I run mutiple executables with different thread counts and still use the nodes efficiently?

Currently I have to do

#BSUB -R [ptile=7]

#BSUB -R affinity[core(4)]

mpirun -n 8 -env OMP_NUM_THREADS=2 ./hellope : -n 12 -env OMP_NUM_THREADS=1 ./hellope : -n 8 -env OMP_NUM_THREADS=4 ./hellope

This will yield a otal number of 60 threads. For a node with 28 processors it would take 3 nodes, but since I have one executable with 4 threads I have to tile it to allow for the maximum thread count so as to not overlap processes on cores. This means I need to use another node even thougn I don't need all of the cores. Is there a way I can pack this onto a node? or place the tasks where I want them? I thought maybe something with using I_MPI_JOB_RESPECT_PROCESS_PLACEMENT and maybe defining the affinity somewhere else besides the job card? Any thoughts would be appreciated.

↧

ifort not reporting on outer loops re parallelisation capabilities & dependencies

October 4, 2018, 10:46 am

Latest and popular articles on Intel Technologies

≫ Next: MPI: I_MPI_NUMVERSION set to 0

≪ Previous: Running coupled executables with different thread counts using LSF

hi all, just to say I've an enquiry on the Compiler forum

https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux...

re why the FORTRAN compiler appears to not consider outer DO loops for auto-parallisation (or doesn't report either way why it can / cannot)

input welcome! m

↧

MPI: I_MPI_NUMVERSION set to 0

October 22, 2018, 9:40 am

Latest and popular articles on Intel Technologies

≫ Next: Assertion Failure, Intel MPI (Linux), 2019

≪ Previous: ifort not reporting on outer loops re parallelisation capabilities & dependencies

Why in mpi.h is the I_MPI_NUMVERSION set to 0? The comments indicate it should be set to a non-zero value corresponding to the numerically expanded version string I_MPI_VERSION. I have checked Intel MPI 2018.0.3 and 2018.0.4 in Windows and they are both set to zero. This would be an incredibly useful compile time preprocessor define if it were set properly.

Thanks, John

↧

Assertion Failure, Intel MPI (Linux), 2019

October 20, 2018, 1:01 pm

Latest and popular articles on Intel Technologies

≫ Next: Debugging 'Too many communicators'-Error

≪ Previous: MPI: I_MPI_NUMVERSION set to 0

Intel MPI 2019 on Linux was installed and tested with several MPI programs (gcc, g++, gfortran from GCC 8.2), with no issues, using the following environment setup.

export I_MPI_DEBUG=5
export I_MPI_LIBRARY_KIND=debug
export I_MPI_OFI_LIBRARY_INTERNAL=1
. ~/intel/compilers_and_libraries_2019.0.117/linux/mpi/intel64/bin/mpivars.sh

I did create a symlink 'mpifort' pointing to mpifc (for compatibility with the mpich/OpenMPI way of doing things).

I've been trying to get OpenCoarrays-2.2.0 (opencoarrays.org) working with Intel MPI 2019, on Linux, for gfortran (GCC 8.2) to implement a coarray-fortran (caf) development implementation. Since OpenCoarrays is developed and tested against the mpich MPI implementation, I was optimistic that Intel MPI could work too, based on mpich ABI compatibility.

The install.sh script that can be used to build OpenCoarrays finds the expected ULFM routines (see fault-tolerance.org), and builds libcaf_mpi.so with the compiler variable, -DUSE_FAILED_IMAGES defined.

-- Looking for signal.h - found
-- Looking for SIGKILL
-- Looking for SIGKILL - found
-- Looking for include files mpi.h, mpi-ext.h
-- Looking for include files mpi.h, mpi-ext.h - not found
-- Looking for MPIX_ERR_PROC_FAILED
-- Looking for MPIX_ERR_PROC_FAILED - found
-- Looking for MPIX_ERR_REVOKED
-- Looking for MPIX_ERR_REVOKED - found
-- Looking for MPIX_Comm_failure_ack
-- Looking for MPIX_Comm_failure_ack - found
-- Looking for MPIX_Comm_failure_get_acked
-- Looking for MPIX_Comm_failure_get_acked - found
-- Looking for MPIX_Comm_shrink
-- Looking for MPIX_Comm_shrink - found
-- Looking for MPIX_Comm_agree
-- Looking for MPIX_Comm_agree - found

However, when attempting to execute code compiled (with mpifc) from coarray fortran under mpirun, there is a failed assertion, as shown below. This output is from building the mpi_caf.o and caf_auxiliary.o object files that comprise libcaf_mpi.so, when compiled with -g and linked to a coarray fortran program (using -fcoarray=lib) also compiled with -g (and other relevant settings obtained from caf -show, mpicc -show, and mpifc -show). See "Assertion failed in file ../../src/mpid/ch4/src/ch4_comm.h at line 89: 0".

[bmaggard@localhost oca]$ mpiexec.hydra -genv I_MPI_DEBUG=5 -gdb -n 1 ./a.out
mpigdb: attaching to 17651 ./a.out localhost.localdomain
[0] (mpigdb) start
[0]     The program being debugged has been started already.
[0]     Start it from the beginning? (y or n) [answered Y; input not from terminal]
[0]     Temporary breakpoint 1 at 0x402737: file pi_caf.f90, line 1.
[0]     Starting program: /home/bmaggard/oca/a.out
[bmaggard@localhost oca]$ [0]   [Thread debugging using libthread_db enabled]
[0]     Using host libthread_db library "/lib64/libthread_db.so.1".
[0]     [New Thread 0x7ffff42bf700 (LWP 17694)]
[0]     [New Thread 0x7ffff3abe700 (LWP 17695)]
[0]     Detaching after fork from child process 17696.
[0]     Assertion failed in file ../../src/mpid/ch4/src/ch4_comm.h at line 89: 0
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(+0xbb298e) [0x7ffff6a9f98e]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(MPL_backtrace_show+0x18) [0x7ffff6a9fafd]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(MPIR_Assert_fail+0x5c) [0x7ffff6101e0b]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(+0x2fc72c) [0x7ffff61e972c]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(+0x2fc832) [0x7ffff61e9832]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(MPIX_Comm_agree+0x518) [0x7ffff61ea221]
[0]     /home/bmaggard/oca/a.out() [0x40399b]
[0]     /home/bmaggard/oca/a.out() [0x404083]
[0]     /home/bmaggard/oca/a.out() [0x402716]
[0]     /home/bmaggard/oca/a.out() [0x416fc5]
[0]     /lib64/libc.so.6(__libc_start_main+0x7a) [0x7ffff4f150aa]
[0]     /home/bmaggard/oca
[0]     Abort(1) on node 0: Internal error
[0]     [Thread 0x7ffff3abe700 (LWP 17695) exited]
[0]     [Thread 0x7ffff42bf700 (LWP 17694) exited]
[0]     [Inferior 1 (process 17680) exited with code 01]
[0] (mpigdb) mpigdb: ending..
mpigdb: kill 17651

The same assertion failure was observed under Win64, and there is a bit more information indicating where to look:

[0] MPI startup(): libfabric version: 1.6.1a1-impi
[0] MPI startup(): libfabric provider: sockets
[0] MPI startup(): Rank    Pid      Node name      Pin cpu
[0] MPI startup(): 0       8364     pe-mgr-laptop  {0,1,2,3,4,5,6,7}
Assertion failed in file c:\iusers\jenkins\workspace\ch4-build-windows\impi-ch4-build-windows-builder\\src\mpid\ch4\src\ch4_comm.h at line 89: 0
No backtrace info available
Abort(1) on node 0: Internal error

Inspecting the mpich source code (https://github.com/pmodels/mpich, tag v3.3b2) of src/mpid/ch4/src/ch4_comm.h shows the following (lines 88-97).

MPL_STATIC_INLINE_PREFIX int MPID_Comm_revoke(MPIR_Comm * comm_ptr, int is_remote)
{
    MPIR_FUNC_VERBOSE_STATE_DECL(MPID_STATE_MPID_COMM_REVOKE);
    MPIR_FUNC_VERBOSE_ENTER(MPID_STATE_MPID_COMM_REVOKE);

    MPIR_Assert(0);

    MPIR_FUNC_VERBOSE_EXIT(MPID_STATE_MPID_COMM_REVOKE);
    return 0;
}

If I comment out the part of the OpenCoarrays-2.2.0 build system (in src/mpi/CMakeFiles.txt) that adds the -DUSE_FAILED_IMAGES definition when building libcaf_mpi.so, then 44 of the first 51 OpenCoarrays-2.2.0 test cases pass with Intel MPI 2019 (none of which use failed images) proving the concept that Intel MPI could work. All 78 tests (including those using failed images) pass with mpich-3.3b3, but mpich is the OpenCoarrays development MPI.

I would like to learn more about this assertion failure, and how the '#ifdef USE_FAILED_IMAGES' in OpenCoarrays-2.2.0/src/mpi/mpi_caf.c interact to cause this assertion to fail. I also wanted to bring this to the Intel MPI developer(s) attention as they work toward release of 2019, Update 1.

↧

Debugging 'Too many communicators'-Error

October 26, 2018, 7:54 am

Latest and popular articles on Intel Technologies

≫ Next: IntelMPI DAPL Question

≪ Previous: Assertion Failure, Intel MPI (Linux), 2019

I have a large code, that fails with the Error:

Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc4027cf0, color=0, key=0, new_comm=0x7ffdb50f2bd0) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc401bcf1, color=1, key=0, new_comm=0x7ffed5aa4fd0) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc4027ce9, color=0, key=0, new_comm=0x7ffe37e477d0) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc401bcf1, color=1, key=0, new_comm=0x7ffd511ac4d0) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0)

I and would like to debug it. I can reproduce this error in totalview.

My first idea is to the stacktrace at the point of the Error. It I set a breakpoint to the call of "Get_contextid_sparse_group" or "Comm_split_impl", the error occurs before the breakpoint and totalview just closes.

If I set it to "Comm_split" i have so many breakpoint, that I can't find the correct one. How can I set a breakpoint in IntelMPI's errorhandeling routine. Some routine must print this "Too many communicators" error-message. Can I set my break-point there?

My second idea is to monitor the number of communicators somehow. The line

Too many communicators (0/16384 free on this process; ignore_id=0)

indicates, that MPI knows how many communicators are free at any given time. How can I, as a developer, monitor this number? Is there a function I call returning the number of current communicators?

I am open for other ideas on how to track down this "communicator leak"

↧

IntelMPI DAPL Question

October 26, 2018, 9:39 am

Latest and popular articles on Intel Technologies

≫ Next: Intel MPI benchmark fails when # bytes > 128: IMB-EXT

≪ Previous: Debugging 'Too many communicators'-Error

Dear MPI team,

I started receiving these messages from a node after I restarted a slowly moving MPI job.

I can tell these originate from IntelMPI. Do you have any suggestions as to what may be triggering them?

gl0396:SCM:4a7f:aaae7d40: 18 us(18 us):  open_hca: device mlx4_0 not found
gl0396:SCM:4a7f:aaae7d40: 16 us(16 us):  open_hca: device mlx4_0 not found
gl0397:UCM:493a:aaae7d40: 48102 us(48102 us):  create_ah: ERR Invalid argument
[359:gl0397][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:247] error(0x30000): ofa-v2-mlx5_0-1u: could not connect DAPL endpoints: DAT_INSUFFICIENT_RESOURCES()
gl0397:UCM:493a:aaae7d40: 48130 us(28 us): UCM connect: snd ERR -> cm_lid 0 cm_qpn ac1009c0 r_psp 4a7f p_sz=24
[356:gl0394][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:247] error(0x30000): ofa-v2-mlx5_0-1u: could not connect DAPL endpoints: DAT_INSUFFICIENT_RESOURCES()

Thank you!

Michael

↧

Intel MPI benchmark fails when # bytes > 128: IMB-EXT

November 1, 2018, 12:54 am

Latest and popular articles on Intel Technologies

≫ Next: MPI without MPIRUN ;point to point using Multiple EndPoints

≪ Previous: IntelMPI DAPL Question

Hi Guys,

I just installed Linux and Intel MPI to two machines:

(1) Quite old (~8 years old) SuperMicro server, which has 24 cores (Intel Xeon X7542 X 4). 32 GB memory. OS: CentOS 7.5

(2) New HP ProLiant DL380 server, which has 32 cores (Intel Xeon Gold 6130 X 2). 64 GB memory. OS: OpenSUSE Leap 15

After installing OS and Intel MPI, I compiled intel MPI benchmark and ran it:

$ mpirun -np 4 ./IMB-EXT

It is quite surprising to me that I find the same error when running IMB-EXT and IMB-RMA, though I have a different OS and everything (even GCC version used to compile Intel MPI benchmark is different -- in CentOS, I used GCC 6.5.0, and in OpenSUSE, I used GCC 7.3.1).

On the CentOS machine, I get:

#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
#    MODE: AGGREGATE
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.05         0.00
            4         1000        30.56         0.13
            8         1000        31.53         0.25
           16         1000        30.99         0.52
           32         1000        30.93         1.03
           64         1000        30.30         2.11
          128         1000        30.31         4.22

and on the OpenSUSE machine, I get

#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
#    MODE: AGGREGATE
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.04         0.00
            4         1000        14.40         0.28
            8         1000        14.04         0.57
           16         1000        14.10         1.13
           32         1000        13.96         2.29
           64         1000        13.98         4.58
          128         1000        14.08         9.09

When I don't use mpirun (which means there is only one process to run IMB-EXT), the benchmark runs through, but Unidir_Put needs >=2 processes, so doesn't help so much, and I also find that the functions with MPI_Put and MPI_Get is extremely slower than I expected (from my experience). Also, using MVAPICH on the OpenSUSE machine did not help.

Is there any chance that I am missing something in installing MPI, or configuring and installing OS in these machines?

Thanks a lot in advance,

Jae

↧

MPI without MPIRUN ;point to point using Multiple EndPoints

November 12, 2018, 8:08 pm

Latest and popular articles on Intel Technologies

≫ Next: New MPI error with Intel 2019.1, unable to run MPI hello world

≪ Previous: Intel MPI benchmark fails when # bytes > 128: IMB-EXT

Hello,

I want to create a cluster dynamically, with say 5 nodes . I want to have members join with communicate and accept.

(something like - https://stackoverflow.com/questions/43858346/trying-to-start-another-pro...)

Sample code attached (its code from the stackflow page)

Now instead of creating one point to point communication, I want to use multiple threads each having its own communicator using end point.

But I am unable to understand ,how it can be done .

Can someone please provide an example.

regards

Attachment	Size
Download MPI_Connect_And_Accept.txt	1.17 KB

↧

New MPI error with Intel 2019.1, unable to run MPI hello world

November 13, 2018, 8:51 am

Latest and popular articles on Intel Technologies

≫ Next: Severe Memory Leak with 2019 impi

≪ Previous: MPI without MPIRUN ;point to point using Multiple EndPoints

After upgrading to update 1 of Intel 2019 we are not able to run even an MPI hello world example. This is new behavior and e.g. a spack installed gcc 8.20 and OpenMPI have no trouble on this system. This is a single workstation and only shm needs to work. For non-mpi use the compilers work correctly. Presumably dependencies have changed slightly in this new update?

$ cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 7.5 (Maipo)
$ source /opt/intel2019/bin/compilervars.sh intel64
$ mpiicc -v
mpiicc for the Intel(R) MPI Library 2019 Update 1 for Linux*
Copyright 2003-2018, Intel Corporation.
icc version 19.0.1.144 (gcc version 4.8.5 compatibility)
$ cat mpi_hello_world.c
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
  // Initialize the MPI environment
  MPI_Init(NULL, NULL);

  // Get the number of processes
  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  // Get the rank of the process
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  // Print off a hello world message
  printf("Hello world from processor %s, rank %d out of %d processors\n",
	 processor_name, world_rank, world_size);

  // Finalize the MPI environment.
  MPI_Finalize();
}
$ mpiicc ./mpi_hello_world.c
$ ./a.out
Abort(1094543) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(639)......:
MPID_Init(860).............:
MPIDI_NM_mpi_init_hook(689): OFI addrinfo() failed (ofi_init.h:689:MPIDI_NM_mpi_init_hook:No data available)
$ export I_MPI_FABRICS=shm:ofi
$ export I_MPI_DEBUG=666
$ ./a.out
[0] MPI startup(): Imported environment partly inaccesible. Map=0 Info=0
[0] MPI startup(): libfabric version: 1.7.0a1-impi
Abort(1094543) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(639)......:
MPID_Init(860).............:
MPIDI_NM_mpi_init_hook(689): OFI addrinfo() failed (ofi_init.h:689:MPIDI_NM_mpi_init_hook:No data available)

↧