Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 930

Intel MPI intermittent failures

$
0
0

I have a couple of users that are experiencing intermittent failures using Intel MPI (typically versions 4.1.0 or 4.1.3) on RedHat 6.4 and 6.7 systems using Mellanox OFED 2.0 and 3.1 respectively.

The error messages being seen are as follows:

[80:node1] unexpected reject event from 16:node2
Assertion failed in file ../../dapl_conn_rc.c at line 992: 0

 

or

 

[0:node1] unexpected DAPL event 0x4003

Assertion failed in file ../../dapl_init_rc.c at line 1332: 0

These errors are happening extremely intermittently on both systems. I believe that the jobs are relying on default values for  I_MPI_FABRICS (shm:dapl) and I_MPI_DAPL_PROVIDER (should be ofa-v2-mlx4_0-1 on both systems).

It seems like these are DAPL layer errors.  Any ideas on what might cause these sorts of intermittent failures?

Thanks!

 


Viewing all articles
Browse latest Browse all 930

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>