Hey Forum,
I'm trying to run on the Azure cloud using Intel's MPI implementation, but there is a problem. Everything works as expected when run on one Agent (8 processors), however anything with 2 or more Agents fails on MPI_Init() roughly 25% of the time. The failure is instantaneous (see output below). I was also able to reproduce this crash with a simple point to point send between all processors. I'm unable to reproduce the issue on my local system.
Are there any known issues with Intel MPI on Azure's virtual machines? Any idea on why it may crash on initialization only some of the time?
The current solution has been simply to use microsoft's MPI library, but I would really like to figure out what the source of the problem is.
Thank you kindly.
Error output:
Master Agent: 10
Information
Secondary Agent: 2
Information
Secondary Agent: 16
Information
Secondary Agent: 12
Information
Fatal error in MPI_Init: Other MPI error, error stack:
Error
job aborted:
Debug
rank: node: exit code[: error message]
Debug
MPIR_Init_thread(658)......................:
Error
MPID_Init(195).............................: channel initialization failed
Error
0: Agent10: 123
Debug
MPIDI_CH3_Init(104)........................:
Error
1: Agent10: 1
Debug
2: Agent10: 1
Debug
MPID_nem_tcp_post_init(345)................:
Error
MPID_nem_newtcp_module_connpoll(3102)......:
Error
3: Agent10: 1
Debug
recv_id_or_tmpvc_info_success_handler(1330): read from socket failed - No error
Error
4: Agent10: 1: process 4 exited without calling finalize
Debug
Fatal error in MPI_Init: Other MPI error, error stack:
Error
5: Agent10: 1: process 5 exited without calling finalize
Debug
6: Agent10: 1: process 6 exited without calling finalize
Debug
MPIR_Init_thread(658)................:
Error
7: Agent10: 1: process 7 exited without calling finalize
Debug
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
8: Agent2: 123
Debug
MPID_nem_tcp_post_init(345)..........:
Error
9: Agent2: 1: process 9 exited without calling finalize
Debug
MPID_nem_newtcp_module_connpoll(3102):
Error
10: Agent2: 123
Debug
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
11: Agent2: 123
Debug
Fatal error in MPI_Init: Other MPI error, error stack:
Error
12: Agent2: 123
Debug
MPIR_Init_thread(658)................:
Error
13: Agent2: 1: process 13 exited without calling finalize
Debug
MPID_Init(195).......................: channel initialization failed
Error
14: Agent2: 123
Debug
MPIDI_CH3_Init(104)..................:
Error
15: Agent2: 123
Debug
MPID_nem_tcp_post_init(345)..........:
Error
16: Agent12: 123
Debug
MPID_nem_newtcp_module_connpoll(3102):
Error
17: Agent12: 123
Debug
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
18: Agent12: 123
Debug
19: Agent12: 123
Debug
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
20: Agent12: 123
Debug
MPID_Init(195).......................: channel initialization failed
Error
21: Agent12: 123
Debug
MPIDI_CH3_Init(104)..................:
Error
22: Agent12: 123
Debug
23: Agent12: 123
Debug
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
24: Agent16: 1: process 24 exited without calling finalize
Debug
25: Agent16: 1: process 25 exited without calling finalize
Debug
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
26: Agent16: 1: process 26 exited without calling finalize
Debug
MPIR_Init_thread(658)................:
Error
27: Agent16: 1: process 27 exited without calling finalize
Debug
MPID_Init(195).......................: channel initialization failed
Error
28: Agent16: 1: process 28 exited without calling finalize
Debug
MPIDI_CH3_Init(104)..................:
Error
29: Agent16: 1: process 29 exited without calling finalize
Debug
MPID_nem_tcp_post_init(345)..........:
Error
30: Agent16: 1: process 30 exited without calling finalize
Debug
MPID_nem_newtcp_module_connpoll(3102):
Error
31: Agent16: 1: process 31 exited without calling finalize
Debug
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in MPI_Init: Other MPI error, error stack:
Error
MPIR_Init_thread(658)................:
Error
MPID_Init(195).......................: channel initialization failed
Error
MPIDI_CH3_Init(104)..................:
Error
MPID_nem_tcp_post_init(345)..........:
Error
MPID_nem_newtcp_module_connpoll(3102):
Error
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
Error
Fatal error in PMPI_Isend: Other MPI error, error stack:
Error
PMPI_Isend(161).................: MPI_Isend(buf=000000D8B7D7F89C, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD, request=000000D8B7D7F8A0) failed
Error
MPIDI_CH3_EagerContigIsend(554).: failure occurred while attempting to send an eager message
Error
MPID_nem_newtcp_iSendContig(440):
Error
MPIU_SOCKW_Writev(454)..........: Unable to write to a socket, An existing connection was forcibly closed by the remote host.
Error
(errno 10054)
Error
Fatal error in PMPI_Isend: Other MPI error, error stack:
Error
PMPI_Isend(161).................: MPI_Isend(buf=0000001C452DFA5C, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD, request=0000001C452DFA60) failed
Error
MPIDI_CH3_EagerContigIsend(554).: failure occurred while attempting to send an eager message
Error
MPID_nem_newtcp_iSendContig(440):
Error
MPIU_SOCKW_Writev(454)..........: Unable to write to a socket, An existing connection was forcibly closed by the remote host.
Error
(errno 10054)
Error