I have a large code, that fails with the Error:
Fatal error in PMPI_Comm_split: Other MPI error, error stack: PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc4027cf0, color=0, key=0, new_comm=0x7ffdb50f2bd0) failed PMPI_Comm_split(508)................: fail failed MPIR_Comm_split_impl(260)...........: fail failed MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0) Fatal error in PMPI_Comm_split: Other MPI error, error stack: PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc401bcf1, color=1, key=0, new_comm=0x7ffed5aa4fd0) failed PMPI_Comm_split(508)................: fail failed MPIR_Comm_split_impl(260)...........: fail failed MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0) Fatal error in PMPI_Comm_split: Other MPI error, error stack: PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc4027ce9, color=0, key=0, new_comm=0x7ffe37e477d0) failed PMPI_Comm_split(508)................: fail failed MPIR_Comm_split_impl(260)...........: fail failed MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0) Fatal error in PMPI_Comm_split: Other MPI error, error stack: PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc401bcf1, color=1, key=0, new_comm=0x7ffd511ac4d0) failed PMPI_Comm_split(508)................: fail failed MPIR_Comm_split_impl(260)...........: fail failed MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0)
I and would like to debug it. I can reproduce this error in totalview.
My first idea is to the stacktrace at the point of the Error. It I set a breakpoint to the call of "Get_contextid_sparse_group" or "Comm_split_impl", the error occurs before the breakpoint and totalview just closes.
If I set it to "Comm_split" i have so many breakpoint, that I can't find the correct one. How can I set a breakpoint in IntelMPI's errorhandeling routine. Some routine must print this "Too many communicators" error-message. Can I set my break-point there?
My second idea is to monitor the number of communicators somehow. The line
Too many communicators (0/16384 free on this process; ignore_id=0)
indicates, that MPI knows how many communicators are free at any given time. How can I, as a developer, monitor this number? Is there a function I call returning the number of current communicators?
I am open for other ideas on how to track down this "communicator leak"