Quantcast
Channel: Clusters and HPC Technology
Viewing all 930 articles
Browse latest View live

intel mpi crash at many ranks

$
0
0

Hi,

We're testing intel mpi (intel19, patch1) on CentOS7.5 - it is a Linux cluster with infiniband network.

Testing intel mpi benchmark, found that it works good for small scales (400 mpi ranks using 10nodes) but for larger scales like 100 nodes  (100*40 = 4000 mpi ranks), it crashes yielding message shown in the bottom.. I recompiled libopenfabric but it doesn't improve the situation. I_MPI_DEBUG 5 doesn't give us the details either - would there be any way to track the cause of crash? fi_info results shown below for reference. Any comments are appreciated.

Thanks,

BJ

PS1.

$ fi_info 
provider: verbs;ofi_rxm
    fabric: IB-0xfe80000000000000
    domain: mlx5_0
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM
provider: verbs
    fabric: IB-0xfe80000000000000
    domain: mlx5_0
    version: 1.0
    type: FI_EP_MSG
    protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
    fabric: IB-0xfe80000000000000
    domain: mlx5_0-dgram
    version: 1.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_IB_UD
 

PS2. Crash message ( mpirun  -np 4000 -genv I_MPI_DEBUG 5  -machinefile hosts ./IMB-EXT ) 

[proxy:0:

# Bidir_Get
# Bidir_Put
# Accumulate
Abort(743005711) on node 3856 (rank 3856 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3856, new_comm=0x27f9f44) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3856]: readline failed
Abort(407461391) on node 3872 (rank 3872 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3872, new_comm=0xc81e9e4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(407461391) on node 2782 (rank 2782 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2782, new_comm=0x1e978b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(1011441167) on node 3906 (rank 3906 in comm 0): Fatal error in PMPI_Comm_s
plit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3906, new_comm=0xb944f14) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3906]: readline failed
Abort(810114575) on node 3907 (rank 3907 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3907, new_comm=0xc1eff14) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(608787983) on node 3306 (rank 3306 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3306, new_comm=0x2014034) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3306]: readline failed
Abort(541679119) on node 2542 (rank 2542 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2542, new_comm=0x2aeb954) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2542]: readline failed
Abort(743005711) on node 3380 (rank 3380 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3380, new_comm=0x1879a04) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3380]: readline failed
Abort(273243663) on node 3782 (rank 3782 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3782, new_comm=0x257b8e4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(4808207) on node 1072 (rank 1072 in comm 0): Fatal error in PMPI_Comm_spli
t: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=1072, new_comm=0x1e9d794) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_1072]: readline failed
[cli_3782]: readline failed
Abort(273243663) on node 1664 (rank 1664 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=1664, new_comm=0xb14a534) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Connection timed out)
[cli_1664]: readline failed
Abort(71917071) on node 2942 (rank 2942 in comm 0): Fatal error in PMPI_Comm_spl
it: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2942, new_comm=0x28c68b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2942]: readline failed
Abort(474570255) on node 2958 (rank 2958 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2958, new_comm=0x2527ff4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(474570255) on node 3552 (rank 3552 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3552, new_comm=0xc3fa3e4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3552]: readline failed
Abort(139025935) on node 3630 (rank 3630 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3630, new_comm=0x1859ff4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3630]: readline failed
Abort(541679119) on node 3634 (rank 3634 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3634, new_comm=0x1d2d8b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(474570255) on node 2822 (rank 2822 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2822, new_comm=0x32a68b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2822]: readline failed
Abort(474570255) on node 2704 (rank 2704 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2704, new_comm=0xbf11584) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2704]: readline failed
Abort(810114575) on node 2100 (rank 2100 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2100, new_comm=0x141f8b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2100]: readline failed
Abort(1011441167) on node 3348 (rank 3348 in comm 0): Fatal error in PMPI_Comm_s
plit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3348, new_comm=0xb111504) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3348]: readline failed
Abort(4808207) on node 3446 (rank 3446 in comm 0): Fatal error in PMPI_Comm_spli
t: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3446, new_comm=0x2c95724) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3446]: readline failed
Abort(608787983) on node 3450 (rank 3450 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3450, new_comm=0x2c84724) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(340352527) on node 3824 (rank 3824 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3824, new_comm=0x1b4a8a4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3824]: readline failed
Abort(474570255) on node 3937 (rank 3937 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3937, new_comm=0xb9e7eb4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3937]: readline failed
Abort(340352527) on node 3979 (rank 3979 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3979, new_comm=0xbea1d94) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3979]: readline failed
Abort(810114575) on node 3826 (rank 3826 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3826, new_comm=0x32af8b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(1011441167) on node 3982 (rank 3982 in comm 0): Fatal error in PMPI_Comm_s
plit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3982, new_comm=0xd005f74) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(407461391) on node 3975 (rank 3975 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3975, new_comm=0xc95ddb4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(4808207) on node 3572 (rank 3572 in comm 0): Fatal error in PMPI_Comm_spli
t: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3572, new_comm=0x1552874) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3572]: readline failed
[proxy:0:82@atom84] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:82@atom84] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:82@atom84] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:82@atom84] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:82@atom84] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[proxy:0:88@atom90] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:88@atom90] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:88@atom90] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:88@atom90] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:88@atom90] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[proxy:0:70@atom72] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:70@atom72] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:70@atom72] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:70@atom72] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:70@atom72] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[proxy:0:21@atom23] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:21@atom23] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:21@atom23] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:21@atom23] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:21@atom23] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:40@atom42] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:40@atom42] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:40@atom42] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:40@atom42] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:40@atom42] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:64@atom66] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:64@atom66] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:64@atom66] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:64@atom66] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:64@atom66] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:58@atom60] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:58@atom60] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:58@atom60] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:58@atom60] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:58@atom60] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:34@atom36] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:34@atom36] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:34@atom36] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:34@atom36] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:34@atom36] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:46@atom48] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:46@atom48] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:46@atom48] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:46@atom48] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:46@atom48] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:28@atom30] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): [proxy:0:14@atom16] HYD_sock_write (../../../../../src/
pm/i_hydra/libhydra/sock/hydra_sock_intel.c:353): write error (Bad file descript
or)
[proxy:0:14@atom16] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:14@atom16] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:14@atom16] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:14@atom16] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
write error (Bad file descriptor)
[proxy:0:28@atom30] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:28@atom30] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:28@atom30] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:28@atom30] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:7@atom9] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hy
dra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:7@atom9] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/proxy_
cb.c:33): error reading command
[proxy:0:7@atom9] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/proxy
/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:7@atom9] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/li
bhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:7@atom9] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): error
 waiting for event


intel mpi at 4000 ranks

$
0
0

Hi, we're testing intel mpi on Centos7.5 with infiniband connections.

Using intel mpi benchmark, small scale tests (10node, 400 mpi ranks)  looks OK while 100 nodes (4000 ranks) job crashes.  FI_LOG_LEVEL=debug yielded a following message:

libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)
libfabric:verbs:fabric:fi_ibv_set_default_attr():1085<info> Ignoring provider default value for tx rma_iov_limit as it is greater than the value supported by domain: mlx5_0

Would there be any way to trace the cause of the issues? Any comments are appreciated.

Thanks,

BJ

Where can I download MPI runtime redistributable as a separate package

$
0
0

Hi, I am having difficulty locating runtime redistributables package (.tz/.tar.gz) on your website. Can anyone point the me download location.

How to use MCDRAM in Hybrid Mode on Theta

$
0
0

Hi,

I have a question about how to how to use MCDRAM in Hybrid Mode. For example, when using MCDRAM in Hybrid Mode, if I call the Cache path that MCDRAM uses as a cache, and call the HBM path that MCDRAM uses as addressable memory. Can I only allocate the data on Cache Path or only allocate the data on HBM Path by using numactl -m like Flat Mode? I assume by default when using the MCDRAM in Hybrid Mode, the data will be only allocated on Cach Path. And if we adding the tag numactl -m, the data can be allocated in HBM path only.  I don't know if my guess is right or not. Any suggestions or commands are welcome.   

I appreciate all your feedback, Thank you! 

HPCC benchmark "Begin of MPIRandomAccess section" hangs

$
0
0

Hello,

I am trying to run the HPCC-1.5.0 benchmark on the cluster using the intel-2019 compilers and mpi. I was able to successfully compile the hpcc code and run it on up to 4 cores on the head node. But if I increase the number of cores, the benchmark seems to hang at the "Begin of MPIRandomAccess section" (which is the very first benchmark test). I can run the same code successfully using intel-2013 compilers and mpi. Has anyone else faced anything similar or have any pointers to what could be happening and how to fix it? Any help is greatly appreciated!

Thank you

Krishna

Intel MPI - Unable to run on Microsoft Server 2016

$
0
0

We are trying to run parallel on a single node using Intel MPI - 2018.0.124 and getting the following error. 

..\hydra\pm\pmiserv\pmiserv_cb.c (834): connection to proxy 0 at host XXX-NNNN failed
..\hydra\tools\demux\demux_select.c (103): callback returned error status
..\hydra\pm\pmiserv\pmiserv_pmci.c (507): error waiting for event
..\hydra\ui\mpich\mpiexec.c (1148): process manager error waiting for completion

 We have checked hydra-service status and found that to be working. 

mpiexec also seems to be working ok. 

mpiexec -n 2 hostname - returns the localhost name

mpiexec -validate - returns success

We have also checked that the hydra service is running the version we want and it is the only version in the machine. 

Is there anything we can do to check why the runs fail?

Thanks!

 

IMB Alltoall hang with Intel Parallel Studio 2018.0.3

$
0
0

Hi,

   When running IMB Alltoall at 32 ranks/node on 100 nodes, job stalls before printing the 0-byte data. Processes seem to be in sched_yield() when traced. With 2, 4, 8, or 16 ranks/node, job runs fine.

   Cluster is dual-socket Skylake, 18 cores/socket. ibv_devinfo shows as below. Running Centos 7.4. We've been having reproducible trouble with Intel MPI and high rank counts on our system, but are still troubleshooting whether it's a fabric or an MPI issue.

   Job launched with

srun -n 3200 --cpu-bind=verbose --ntasks-per-socket=16 src/IMB-MPI1 -npmin 3200 Alltoall

 

Thanks; Chris

 

[cchang@r4i2n26 ~]$ ibv_devinfo -v hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.21.1000 node_guid: 506b:4b03:002b:e41e sys_image_guid: 506b:4b03:002b:e41e vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: SGI_P0001721_X phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe17e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN XRC Unknown flags: 0xe16e0000 device_cap_exp_flags: 0x5048F8F100000000 EXP_DC_TRANSPORT EXP_CROSS_CHANNEL EXP_MR_ALLOCATE EXT_ATOMICS EXT_SEND NOP EXP_UMR EXP_ODP EXP_RX_CSUM_TCP_UDP_PKT EXP_RX_CSUM_IP_PKT EXP_DC_INFO EXP_MASKED_ATOMICS EXP_RX_TCP_UDP_PKT_TYPE EXP_PHYSICAL_RANGE_MR Unknown flags: 0x200000000000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) log atomic arg sizes (mask) 0x8 masked_log_atomic_arg_sizes (mask) 0x3c masked_log_atomic_arg_sizes_network_endianness (mask) 0x34 max fetch and add bit boundary 64 log max atomic inline 5 max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 hca_core_clock: 156250 max_klm_list_size: 65536 max_send_wqe_inline_klms: 20 max_umr_recursion_depth: 4 max_umr_stride_dimension: 1 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT max_size: 0xFFFFFFFFFFFFFFFF rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND dc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ xrc_odp_caps: NO SUPPORT raw_eth_odp_caps: NO SUPPORT max_dct: 262144 max_device_ctx: 1020 Multi-Packet RQ is not supported rx_pad_end_addr_align: 64 tso_caps: max_tso: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps ooo_caps: ooo_rc_caps = 0x0 ooo_xrc_caps = 0x0 ooo_dc_caps = 0x0 ooo_ud_caps = 0x0 sw_parsing_caps: supported_qp: tag matching not supported tunnel_offloads_caps: Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 2000 port_lmc: 0x00 link_layer: InfiniBand max_msg_sz: 0x40000000 port_cap_flags: 0x2651e848 max_vl_num: 4 (3) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 8 subnet_timeout: 18 init_type_reply: 0 active_width: 4X (2) active_speed: 25.0 Gbps (32) phys_state: LINK_UP (5) GID[ 0]: fec0:0000:0000:0000:506b:4b03:002b:e41e

General question of Intel Trace Analyzer and Collector

$
0
0

Hi:

I'm new to here and may need to ask some question about this tool:  Intel Trace Analyzer and Collector 

Is this software intel Xeno exclusive or it can be multi-platform.

Also if our company want to purchase this tool, where should I ask?

 

Many thanks

Chi


"dapl fabric is not available and fallback fabric is not enabled"

$
0
0

HI Support team,

I try to use  Intel MPI  "with DAPL fabric " to run a molding simulation software on Infiniband/RDMA Fabric.

But get error - "dapl fabric is not available and fallback fabric is not enabled"

 

Detail info:

Cluster  nodes:
CPU: Intel Xeon E5-1620
RAM: 32 GB
NIC: Mellanox ConnectX-5 VPI adapter
Driver: WinOF-2 v2.10.50010
OS: Windows Server 2016 Standard

Test Case1:
Using Microsoft MPI - Microsoft HPC Pack 2016 Update 2 + fixes , the Results is works well on infiniband fabric.

Test Case2:
Replace MS_MPI with Intel MPI(IMPI) - Intel MPI 2018 , the Results - Get error - "dapl fabric is not available and fallback fabric is not enabled" when execute Intel MPI command. My command is as below:

Command: c:\Users\Administrator>"C:\Program Files\Intel MPI 2018\x64\impiexec.exe -genv I_MPI_DEBUG 5 -DAPL -host2 192.168.191.21 192.168.181.22 1 \\IBCN3\Moldex3D_R17]Bin\IMB-MPI1.exe

Please advise,

 

BRs,

Jeffrey

Job terminates abnormally on HPC using intel mpi

$
0
0

Hello all,

I have recently installed a program cp2k on HPC using "Intel(R) Parallel Studio XE 2017 Update 4 for Linux", after successful installation when I'm running the executable using mpirun -machinefile $PBS_NODEFILE -n 40 ./cp2k $var >& out I get the following error message at the end of the output file and my job terminates:-

rank = 33, revents = 25, state = 8
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 7

I'm using the following JOB script :-

#!/bin/bash
#PBS -N test
#PBS -q mini
#PBS -l nodes=2:ppn=20
cd $PBS_O_WORKDIR
export I_MPI_FABRICS=shm:tcp
export I_MPI_MPD_TMPDIR /scratch/$USER
EXEC=~/cp2k-6.1/exe/Linux-x86-64-intelx/cp2k.popt
cp $EXEC cp2k
mpirun -machinefile $PBS_NODEFILE -n 40 ./cp2k $var >& out
rm cp2k

I shall be grateful for the help.

Thank you,
Raghav

Running executable compiled with MPI without using mpirun

$
0
0

Hello,

I'm having trouble running the abinit (8.10.2) executable (electronic structure program; www.abinit.org) after compiling with the Intel 19 Update 3 compilers and with MPI enabled (64 bit intel linux).

If I compile with either the gnu tools or the Intel tools (icc, ifort), and without MPI enabled, I can directly run the abinit executable with no errors.

If I compile with the gnu tools and MPI enabled (using openmpi), I can still run the abinit executable direclty (without using mpirun) without errors.

If I compile with the Intel tools (mpiicc, mpiifort) and MPI enabled (using intel MPI), and then try to run the abinit executable directly (without mpirun), then it fails with the following error when trying to read in the input file (t01.input):

abinit < t01.input > OUT-traceback
forrtl: severe (24): end-of-file during read, unit 5, file /proc/26824/fd/0
Image                 PC                              Routine                 Line           Source             
libifcoremt.so.5   00007F0847FAC7B6  for__io_return        Unknown  Unknown
libifcoremt.so.5   00007F0847FEAC00  for_read_seq_fmt   Unknown  Unknown
abinit                  000000000187BC1F  m_dtfil_mp_iofn1_   1363        m_dtfil.F90
abinit                  0000000000407C49  MAIN__                    251          abinit.F90
abinit                  0000000000407942  Unknown                 Unknown  Unknown
libc-2.27.so         00007F08459E4B97  __libc_start_main     Unknown  Unknown
abinit                  000000000040782A  Unknown                 Unknown  Unknown

If I compile with the Intel tools and MPI enabled and run the abinit executable with "mpirun -np 1 abinit < t01.input > OUT-traceback" then reading the input file succeeds.

Running the MPI enabled executable without mpirun succeeds when compiled with the gnu tools, but not when compiled with the intel tools.

A colleague of mine compiled abinit with MPI enabled using the Intel 17 compiler and IS able to run the abinit executable without mpirun.

I am using Intel Parallel Studio XE Cluster Edition Update 3 and source psxevars.sh to set the environment before compiling/running with intel. The output of mpiifort -V is:

mpiifort -V
Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.3.199 Build 20190206
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

Any ideas on what is causing this forrtl crash?

Thanks for any suggestions.

I_MPI_WAIT_MODE replacement in Intel MPI?

Errors when compiling MUMPS

$
0
0

Hi,

I installed MUMPS  using intel parallel libraries, but when I run exaples it shows fatal errors. Can you help me with the problem? Thanks in advance.

 

Here is the error:

[mlin4@min-workstation examples]$ ./dsimpletest < input_simpletest_real
Abort(1094543) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(649)......: 
MPID_Init(863).............: 
MPIDI_NM_mpi_init_hook(705): OFI addrinfo() failed (ofi_init.h:705:MPIDI_NM_mpi_init_hook:No data available)
 

Here is the make.inc file:

#
#  This file is part of MUMPS 5.1.2, released
#  on Mon Oct  2 07:37:01 UTC 2017
#
#Begin orderings

# NOTE that PORD is distributed within MUMPS by default. It is recommended to
# install other orderings. For that, you need to obtain the corresponding package
# and modify the variables below accordingly.
# For example, to have Metis available within MUMPS:
#          1/ download Metis and compile it
#          2/ uncomment (suppress # in first column) lines
#             starting with LMETISDIR,  LMETIS
#          3/ add -Dmetis in line ORDERINGSF
#             ORDERINGSF  = -Dpord -Dmetis
#          4/ Compile and install MUMPS
#             make clean; make   (to clean up previous installation)
#
#          Metis/ParMetis and SCOTCH/PT-SCOTCH (ver 6.0 and later) orderings are recommended.
#

#SCOTCHDIR  = ${HOME}/scotch_6.0
#ISCOTCH    = -I$(SCOTCHDIR)/include
#
# You have to choose one among the following two lines depending on
# the type of analysis you want to perform. If you want to perform only
# sequential analysis choose the first (remember to add -Dscotch in the ORDERINGSF
# variable below); for both parallel and sequential analysis choose the second 
# line (remember to add -Dptscotch in the ORDERINGSF variable below)

#LSCOTCH    = -L$(SCOTCHDIR)/lib -lesmumps -lscotch -lscotcherr
#LSCOTCH    = -L$(SCOTCHDIR)/lib -lptesmumps -lptscotch -lptscotcherr

LPORDDIR = $(topdir)/PORD/lib/
IPORD    = -I$(topdir)/PORD/include/
LPORD    = -L$(LPORDDIR) -lpord

LMETISDIR = /home/mlin4/metis/build/Linux-x86_64/libmetis
IMETIS    = /home/mlin4/metis/include

# You have to choose one among the following two lines depending on
# the type of analysis you want to perform. If you want to perform only
# sequential analysis choose the first (remember to add -Dmetis in the ORDERINGSF
# variable below); for both parallel and sequential analysis choose the second 
# line (remember to add -Dparmetis in the ORDERINGSF variable below)

LMETIS    = -L$(LMETISDIR) -lmetis
#LMETIS    = -L$(LMETISDIR) -lparmetis -lmetis

# The following variables will be used in the compilation process.
# Please note that -Dptscotch and -Dparmetis imply -Dscotch and -Dmetis respectively.
# If you want to use Metis 4.X or an older version, you should use -Dmetis4 instead of -Dmetis
# or in addition with -Dparmetis (if you are using parmetis 3.X or older).
#ORDERINGSF = -Dscotch -Dmetis -Dpord -Dptscotch -Dparmetis
#ORDERINGSF  = -Dpord -Dmetis -Dparmetis
ORDERINGSF  = -Dpord -Dmetis
ORDERINGSC  = $(ORDERINGSF)

LORDERINGS = $(LMETIS) $(LPORD) $(LSCOTCH)
IORDERINGSF = $(ISCOTCH)
IORDERINGSC = $(IMETIS) $(IPORD) $(ISCOTCH)

#End orderings
########################################################################
################################################################################

PLAT    =
LIBEXT  = .a
OUTC    = -o 
OUTF    = -o 
RM = /bin/rm -f
CC = mpiicc
FC = mpiifort
FL = mpiifort
AR = ar vr 
#RANLIB = ranlib
RANLIB  = echo
# Make this variable point to the path where the Intel MKL library is
# installed. It is set to the default install directory for Intel MKL.
MKLROOT=/home/mlin4/opt/intel/mkl/lib/intel64
LAPACK = -L$(MKLROOT) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core
SCALAP = -L$(MKLROOT) -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64

LIBPAR = $(SCALAP) $(LAPACK)

INCSEQ = -I$(topdir)/libseq
LIBSEQ  = $(LAPACK) -L$(topdir)/libseq -lmpiseq

LIBBLAS = -L$(MKLROOT) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core 
LIBOTHERS = -lpthread

#Preprocessor defs for calling Fortran from C (-DAdd_ or -DAdd__ or -DUPPER)
CDEFS   = -DAdd_

#Begin Optimized options
OPTF    = -O -nofor_main -DBLR_MT -qopenmp # or -openmp for old compilers
OPTL    = -O -nofor_main -qopenmp
OPTC    = -O -qopenmp
#End Optimized options
 
INCS = $(INCPAR)
LIBS = $(LIBPAR)
LIBSEQNEEDED =

 

 

Best,

Min

 

MPI Processes - socket mapping and threads per process - core mapping

$
0
0

Hi,
I have a 2 socket 20 cores per socket (ntel(R) Xeon(R) Gold 6148 CPU) node .
I wish to launch 1 process per socket and 20 threads per process and if possible - all threads should be pinned to their respective cores.

earlier i used to run intel binaries on cray machine with similar cores , and the syntax was - 
aprun –n (mpi tasks) –N (tasks per node) –S (tasks per socket) –d (thread depth) <executable> , example - 

OMP_NUM_THREADS=20
aprun -n4 -N2 -S1 -d $OMP_NUM_THREADS ./a.out
 

node 0 socket 0 process#0 nprocs 4 thread id  0  nthreads 20  core id  0
node 0 socket 0 process#0 nprocs 4 thread id  1  nthreads 20  core id  1
....
node 0 socket 0 process#0 nprocs 4 thread id 19  nthreads 20  core id 19
node 0 socket 1 process#1 nprocs 4 thread id  0  nthreads 20  core id 20
...
node 0 socket 0 process#1 nprocs 4 thread id 19  nthreads 20  core id 39
....
node 1 socket 0 process#1 nprocs 4 thread id 19  nthreads 20  core id 39

 

How can i achieve the same/equivalent effect using intel's mpirun?

How could I download previous version of Intel MPI library?

$
0
0

Hi,

I hope I can download a Intel MPI library of 4.1.0 version for 32bit application on Windows but I find nothing on this websit of Intel.  I don't know if there is any backups about that Intel MPI library on the weksit of Intel. I would appreciate it if some one could share me some clues.

Please,Please,Please, I am waiting for that.


amplxe: Error: Ftrace is already in use.

$
0
0

Hi,
i am trying to run vtune amplifier 2019 to collect system-overview as - 

export NPROCS=36
export OMP_NUM_THREADS=1
mpirun -genv OMP_NUM_THREADS $OMP_NUM_THREADS -np $NPROCS  amplxe-cl -collect system-overview  -result-dir /home/puneet/run_node02_impi2019_profiler_systemoverview/profiles/attempt1_p${NPROCS}_t${OMP_NUM_THREADS}  -quiet $INSTALL_ROOT/main/wrf.exe

I had collected hpc-performance data without any issue. Afterwards , i ran aforementioned command but had to kill it (result dir was incorrect.). when i re-ran the amplxe-cl, i am getting following error messages - 
 

amplxe: Error: Ftrace is already in use. Make sure to stop previous collection first.
amplxe: Error: Ftrace is already in use. Make sure to stop previous collection first.
amplxe: Error: Ftrace is already in use. Make sure to stop previous collection first.

I have tried deleting the /home/puneet/run_node02_impi2019_profiler_systemoverview/profiles/* and i have also rebooted the node.
even then those error messages are showing up.
Please advice.
 

I_MPI_AUTH_METHOD not working with IMPI 19 Update 3

$
0
0

Hi,

I moved from Intel MPI 2018 Update 4 to Intel MPI 2019 Update 3 and it seems that the new version ignores setting the user authorization method for mpirun via environment variable I_MPI_AUTH_METHOD=delegate . Setting it directly with mpiexec -delegate still works fine. Can anyone confirm that?

https://software.intel.com/en-us/mpi-developer-reference-windows-user-au...

Thanks and kind regards,

Volker Jacht

Bug report: mpicc illegally pre-pends my LIBRARY_PATH

$
0
0

When trying to link my application using the mpicc wrapper of Intel MPI 2018.4, it prepends several paths to my LIBRARY_PATH. I set this variable to use a custom library instead of the one installed on my system. However, since the path of the system's library is also pre-pended my program is silently linked against the wrong library. There is nothing I can do about this but specifying the library path explicitly via an -L option during linking, but I don't want to do this!

In my opinion, all wrapper scripts should only POST-pend their paths to the environment variables! The user-specified paths must always win!

Georg

 

Wrong MPICH_NUMVERSION in mpi.h on Windows?

$
0
0

Hi,

we've recently moved to Intel 19 Update 3 MPI and noticed that the MPICH_VERSION macro in mpi.h is wrong/confusing on Windows and is also different than on Linux:

2019.3.203/intel64/include/mpi.h:504 (windows)
#define MPICH_VERSION "0.0"
#define MPICH_NUMVERSION 300

on Linux:

#define MPICH_VERSION "3.3b3"
#define MPICH_NUMVERSION 30300103

The latter case is much more reasonably and works fine, while the first case seems to be broken.

In the end, that 3 digit version number makes PETSc assume that it is using an old MPICH version and therefore compares "0.0" with MPI_Get_library_version(), which will eventually fail on Windows.
https://bitbucket.org/petsc/petsc/src/f03f29e6b9f50a9f9419f7d348de13f7c6...

Can you please fix the version number of windows?

Thank you and kind regards,

Volker Jacht

Can not find impi.dll on Windows

$
0
0

Hello again,

I have Intel MPI 2019 Update 3 SDK installed on Windows and set up a "hello world" example like this:
https://software.intel.com/en-us/mpi-developer-guide-windows-configuring...

But when I try to start my program with mpiexec.exe, it can not find impi.dll.

After some research I noticed that in prior versions impi.dll was located in the same folder as mpiexec (C:\Program Files (x86)\IntelSWTools\mpi\2019.3.203\intel64\bin) and it works right out of the box, but in the current version it seems to be missing?

I have also noticed that all the library symlinks in C:\Program Files (x86)\IntelSWTools\mpi\2019.3.203\intel64\lib\ are missing, too.

Thanks for your help and kind regards,

Volker Jacht

Viewing all 930 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>