For a specific code, which has a range of calls using impi, there is a probability of about 0.1% of obtaining incorrect results when using impi versions later than 5.0.2.044 on machines with AVX512 registers (Intel Xeon Gold 6130 2.10 GHz). The problem is highly reproducible, although because it is a low probability error it requires 1K or more repeat runs and quantifying the exact rate is difficult. (It seems to be slightly more common when there is other activity on the node.) It is not a catastrophic error, it just ends up with slightly incorrect results being produced.
The problem does not occur on Xeon E5-2650 machines (AVX), or other older E5410 that we have.
There are zero indications that this has anything to do with errors on our ib. We have spent significant time with the cluster vendor and there are no indicators.
I regressed the ifort compiler, and even removed any mkl scalapack (using the Netlib version) so I am 99.9999% certain that it is impi related; only regressing the impi version removed the problem.
Of course, because it is a low probability error it may well have been undetected.
N.B., I originally posted this to one of the Intel support links, but for some reason they wanted it to go to a Forum rather than providing me with direct support.