I have a problem where in about 1-2% of mpiruns (0.1-0.2% of ssh processes launched my mpiexec.hydra) one of the ssh processes fails to launch and becomes a zombie. As a consequence the overall process will hang forever.
With setenv I_MPI_DEBUG 1 and -verbose added to the mpirun command I get some information (see Bug.txt attached). The node that in this case failed to start is qnode0708, and if you wade through the file you will see no "Start PMI_proxy 5".
At this moment I do not know if this is an impi issue (version 4.1 is being used), a ssh race condition (this appears to be possible), something with the large cluster I am using or what. Two specific questions:
a) Has anyone seen anything like this?
b) Is there a way to launch with "ssh -v", which might be informative. I cannot find anything about how to do this.
N.B., 99.99% certain that this is nothing to do with the code being run, compilation or anything else. In fact the failure occurs equally for three different mpi executables which are very different.