I have a couple of users that are experiencing intermittent failures using Intel MPI (typically versions 4.1.0 or 4.1.3) on RedHat 6.4 and 6.7 systems using Mellanox OFED 2.0 and 3.1 respectively.
The error messages being seen are as follows:
[80:node1] unexpected reject event from 16:node2
Assertion failed in file ../../dapl_conn_rc.c at line 992: 0
or
[0:node1] unexpected DAPL event 0x4003
Assertion failed in file ../../dapl_init_rc.c at line 1332: 0
These errors are happening extremely intermittently on both systems. I believe that the jobs are relying on default values for I_MPI_FABRICS (shm:dapl) and I_MPI_DAPL_PROVIDER (should be ofa-v2-mlx4_0-1 on both systems).
It seems like these are DAPL layer errors. Any ideas on what might cause these sorts of intermittent failures?
Thanks!