Quantcast
Channel: Clusters and HPC Technology
Viewing all 930 articles
Browse latest View live

Intel Cluster Checker collection issue

$
0
0

Dear All,

I'm using Intel(R) Cluster Checker 2017 Update 2 (build 20170117), installed locally on master node in /opt/intel as part of Intel Parallel Studio XE.

However, when running clck-collect I get the following error for all connected computenodes.

[root@master ~]# clck-collect -a -f nodelist
computenode02: bash: /opt/intel/clck/2017.2.019/libexec/clck_run_provider.sh: No such file or directory
pdsh@master: computenode02: ssh exited with exit code 127

computenode01: bash: /opt/intel/clck/2017.2.019/libexec/clck_run_provider.sh: No such file or directory
pdsh@master: computenode01: ssh exited with exit code 127

Please guide if this is an issue with installation of parllel studio, or I'm missing something.


How do disable intra-node comminucation

$
0
0

I would like to test the network latency/bandwidth of each node that I am running on in parallel. I think the simplest way to do this would be to have each node test itself. 

 

My question is: How can I force all the IntelMPI TCP communication to go through the network adapter, and not use the optimized node-local communication?
 

Any advice would be greatly appreciated.

 

Best Regards,

John

MPI ISend/IRecv deadlock on AWS EC2

$
0
0

Hi, I'm encountering an unexpected deadlock in this Fortran test program, compiled using Parallel Studio XE 2017 Update 4 on an Amazon EC2 cluster (Linux system).

$ mpiifort -traceback nbtest.f90 -o test.x

 

On one node, the program runs just fine, but any more and it deadlocks, leading me to suspect a internode comm failure, but my knowledge in this area is lacking. FYI, the test code is hardcoded to be run on 16 cores.

Any help or insight is appreciated!

Danny

Code

program nbtest

  use mpi
  implicit none

  !***____________________ Definitions _______________
  integer, parameter :: r4 = SELECTED_REAL_KIND(6,37)
  integer :: irank

  integer, allocatable :: gstart1(:)
  integer, allocatable :: gend1(:)
  integer, allocatable :: gstartz(:)
  integer, allocatable :: gendz(:)
  integer, allocatable :: ind_fl(:)
  integer, allocatable :: blen(:),disp(:)

  integer, allocatable :: ddt_recv(:),ddt_send(:)

  real(kind=r4), allocatable :: tmp_array(:,:,:)
  real(kind=r4), allocatable :: tmp_in(:,:,:)

  integer :: cnt, i, j
  integer :: count_send, count_recv

  integer :: ssend
  integer :: srecv
  integer :: esend
  integer :: erecv
  integer :: erecv2, srecv2

  integer :: mpierr, ierr, old, typesize, typesize2,typesize3
  integer :: mpi_requests(2*16)
  integer :: mpi_status_arr(MPI_STATUS_SIZE,2*16)

  character(MPI_MAX_ERROR_STRING) :: string
  integer :: resultlen
  integer :: errorcode
!***________Code___________________________
  !*_________initialize MPI__________________
  call MPI_INIT(ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,irank,ierr)
  call MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN,ierr)

  allocate(gstart1(0:15), &
       gend1(0:15), &
       gstartz(0:15), &
       gendz(0:15))


  gstart1(0) = 1
  gend1(0) = 40
  gstartz(0) = 1
  gendz(0) = 27

  do i = 2, 16
     gstart1(i-1) = gend1(i-2) + 1
     gend1(i-1)   = gend1(i-2) + 40
     gstartz(i-1) = gendz(i-2) + 1
     gendz(i-1)   = gendz(i-2) + 27
  end do

  allocate(ind_fl(15))
  cnt = 1
  do i = 1, 16
     if ( (i-1) == irank ) cycle
     ind_fl(cnt) = (i - 1)
     cnt = cnt + 1
  end do
  cnt = 1
  do i = 1, 16
     if ( (i-1) == irank ) cycle
     ind_fl(cnt) = (i - 1)
     cnt = cnt + 1
  end do

  !*_________new datatype__________________
  allocate(ddt_recv(16),ddt_send(16))
  allocate(blen(60), disp(60))
  call mpi_type_size(MPI_REAL,typesize,ierr)

  do i = 1, 15
     call mpi_type_contiguous(3240,MPI_REAL, &
          ddt_send(i),ierr)
     call mpi_type_commit(ddt_send(i),ierr)

     srecv2 = (gstartz(ind_fl(i))-1)*2+1
     erecv2 = gendz(ind_fl(i))*2
     blen(:) = erecv2 - srecv2 + 1
     do j = 1, 60
        disp(j) = (j-1)*(852) + srecv2 - 1
     end do

     call mpi_type_indexed(60,blen,disp,MPI_REAL, &
          ddt_recv(i),ierr)
     call mpi_type_commit(ddt_recv(i),ierr)
     old = ddt_recv(i)
     call mpi_type_create_resized(old,int(0,kind=MPI_ADDRESS_KIND),&
          int(51120*typesize,kind=MPI_ADDRESS_KIND),&
          ddt_recv(i),ierr)
     call mpi_type_free(old,ierr)
     call mpi_type_commit(ddt_recv(i),ierr)

  end do


  allocate(tmp_array(852,60,40))
  allocate(tmp_in(54,60,640))
  tmp_array = 0.0_r4
  tmp_in = 0.0_r4

  ssend = gstart1(irank)
  esend = gend1(irank)
  cnt = 0

  do i = 1, 15
     srecv = gstart1(ind_fl(i))
     erecv = gend1(ind_fl(i))

     ! Calculate the number of bytes to send (for MPI_SEND)
     count_send = erecv - srecv + 1
     count_recv = esend - ssend + 1
     cnt = cnt + 1

     call mpi_irecv(tmp_array,count_recv,ddt_recv(i), &
          ind_fl(i),ind_fl(i),MPI_COMM_WORLD,mpi_requests(cnt),ierr)

     cnt = cnt + 1
     call mpi_isend(tmp_in(:,:,srecv:erecv), &
          count_send,ddt_send(i),ind_fl(i), &
          irank,MPI_COMM_WORLD,mpi_requests(cnt),ierr)

  end do

  call mpi_waitall(cnt,mpi_requests(1:cnt),mpi_status_arr(:,1:cnt),ierr)

  if (ierr /=  MPI_SUCCESS) then
     do i = 1,cnt
        errorcode = mpi_status_arr(MPI_ERROR,i)
        if (errorcode /= 0 .AND. errorcode /= MPI_ERR_PENDING) then
           call MPI_Error_string(errorcode,string,resultlen,mpierr)
           print *, "rank: ",irank, string
           !call MPI_Abort(MPI_COMM_WORLD,errorcode,ierr)
        end if
     end do
  end if

  deallocate(tmp_array)
  deallocate(tmp_in)

  print *, "great success"

  call MPI_FINALIZE(ierr)

end program nbtest

 

Running gdb on one of the processors during the deadlock:

 

(gdb) bt

#0  0x00002acb4c6bf733 in __select_nocancel () from /lib64/libc.so.6

#1  0x00002acb4b496a2e in MPID_nem_tcp_connpoll () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#2  0x00002acb4b496048 in MPID_nem_tcp_poll () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#3  0x00002acb4b350020 in MPID_nem_network_poll () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#4  0x00002acb4b0cc5f2 in PMPIDI_CH3I_Progress () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#5  0x00002acb4b50328f in PMPI_Waitall () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#6  0x00002acb4ad1d53f in pmpi_waitall_ (v1=0x1e, v2=0xb0c320, v3=0x0, ierr=0x2acb4c6bf733 <__select_nocancel+10>) at ../../src/binding/fortran/mpif_h/waitallf.c:275

#7  0x00000000004064b0 in MAIN__ ()

#8  0x000000000040331e in main ()

 

Output log after I kill the job:

$ mpirun -n 16 ./test.x

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source

test.x             000000000040C12A  Unknown               Unknown  Unknown

libpthread-2.17.s  00002BA8B42F95A0  Unknown               Unknown  Unknown

libmpi.so.12       00002BA8B3303EBF  PMPIDI_CH3I_Progr     Unknown  Unknown

libmpi.so.12       00002BA8B373B28F  PMPI_Waitall          Unknown  Unknown

libmpifort.so.12.  00002BA8B2F5553F  pmpi_waitall          Unknown  Unknown

test.x             00000000004064B0  MAIN__                    129  nbtest.f90

test.x             000000000040331E  Unknown               Unknown  Unknown

libc-2.17.so       00002BA8B4829C05  __libc_start_main     Unknown  Unknown

test.x             0000000000403229  Unknown               Unknown  Unknown

(repeated 15 times, once for each processor)

Output with I_MPI_DEBUG = 6

[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 3  Build 20170405 (id: 17193)

[0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation.  All rights reserved.

[0] MPI startup(): Multi-threaded optimized library

[12] MPI startup(): cannot open dynamic library libdat2.so.2

[7] MPI startup(): cannot open dynamic library libdat2.so.2

[10] MPI startup(): cannot open dynamic library libdat2.so.2

[13] MPI startup(): cannot open dynamic library libdat2.so.2

[4] MPI startup(): cannot open dynamic library libdat2.so.2

[9] MPI startup(): cannot open dynamic library libdat2.so.2

[14] MPI startup(): cannot open dynamic library libdat2.so.2

[5] MPI startup(): cannot open dynamic library libdat2.so.2

[11] MPI startup(): cannot open dynamic library libdat2.so.2

[15] MPI startup(): cannot open dynamic library libdat2.so.2

[6] MPI startup(): cannot open dynamic library libdat2.so.2

[8] MPI startup(): cannot open dynamic library libdat2.so.2

[0] MPI startup(): cannot open dynamic library libdat2.so.2

[3] MPI startup(): cannot open dynamic library libdat2.so.2

[2] MPI startup(): cannot open dynamic library libdat2.so.2

[4] MPI startup(): cannot open dynamic library libdat2.so

[7] MPI startup(): cannot open dynamic library libdat2.so

[8] MPI startup(): cannot open dynamic library libdat2.so

[9] MPI startup(): cannot open dynamic library libdat2.so

[6] MPI startup(): cannot open dynamic library libdat2.so

[10] MPI startup(): cannot open dynamic library libdat2.so

[13] MPI startup(): cannot open dynamic library libdat2.so

[0] MPI startup(): cannot open dynamic library libdat2.so

[15] MPI startup(): cannot open dynamic library libdat2.so

[3] MPI startup(): cannot open dynamic library libdat2.so

[12] MPI startup(): cannot open dynamic library libdat2.so

[4] MPI startup(): cannot open dynamic library libdat.so.1

[14] MPI startup(): cannot open dynamic library libdat2.so

[7] MPI startup(): cannot open dynamic library libdat.so.1

[5] MPI startup(): cannot open dynamic library libdat2.so

[8] MPI startup(): cannot open dynamic library libdat.so.1

[1] MPI startup(): cannot open dynamic library libdat2.so.2

[6] MPI startup(): cannot open dynamic library libdat.so.1

[9] MPI startup(): cannot open dynamic library libdat.so.1

[10] MPI startup(): cannot open dynamic library libdat.so.1

[0] MPI startup(): cannot open dynamic library libdat.so.1

[12] MPI startup(): cannot open dynamic library libdat.so.1

[4] MPI startup(): cannot open dynamic library libdat.so

[11] MPI startup(): cannot open dynamic library libdat2.so

[3] MPI startup(): cannot open dynamic library libdat.so.1

[13] MPI startup(): cannot open dynamic library libdat.so.1

[5] MPI startup(): cannot open dynamic library libdat.so.1

[15] MPI startup(): cannot open dynamic library libdat.so.1

[5] MPI startup(): cannot open dynamic library libdat.so

[7] MPI startup(): cannot open dynamic library libdat.so

[1] MPI startup(): cannot open dynamic library libdat2.so

[9] MPI startup(): cannot open dynamic library libdat.so

[8] MPI startup(): cannot open dynamic library libdat.so

[11] MPI startup(): cannot open dynamic library libdat.so.1

[6] MPI startup(): cannot open dynamic library libdat.so

[10] MPI startup(): cannot open dynamic library libdat.so

[14] MPI startup(): cannot open dynamic library libdat.so.1

[11] MPI startup(): cannot open dynamic library libdat.so

[13] MPI startup(): cannot open dynamic library libdat.so

[15] MPI startup(): cannot open dynamic library libdat.so

[12] MPI startup(): cannot open dynamic library libdat.so

[0] MPI startup(): cannot open dynamic library libdat.so

[14] MPI startup(): cannot open dynamic library libdat.so

[1] MPI startup(): cannot open dynamic library libdat.so.1

[3] MPI startup(): cannot open dynamic library libdat.so

[1] MPI startup(): cannot open dynamic library libdat.so

[2] MPI startup(): cannot open dynamic library libdat2.so

[2] MPI startup(): cannot open dynamic library libdat.so.1

[2] MPI startup(): cannot open dynamic library libdat.so

[4] MPI startup(): cannot load default tmi provider

[7] MPI startup(): cannot load default tmi provider

[5] MPI startup(): cannot load default tmi provider

[9] MPI startup(): cannot load default tmi provider

[0] MPI startup(): cannot load default tmi provider

[6] MPI startup(): cannot load default tmi provider

[10] MPI startup(): cannot load default tmi provider

[3] MPI startup(): cannot load default tmi provider

[15] MPI startup(): cannot load default tmi provider

[8] MPI startup(): cannot load default tmi provider

[1] MPI startup(): cannot load default tmi provider

[14] MPI startup(): cannot load default tmi provider

[11] MPI startup(): cannot load default tmi provider

[2] MPI startup(): cannot load default tmi provider

[12] MPI startup(): cannot load default tmi provider

[13] MPI startup(): cannot load default tmi provider

[12] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[4] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

[9] ERROR - load_iblibrary(): [15] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[5] ERROR - load_iblibrary(): [0] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[10] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

[1] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

 

[3] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[13] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[7] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[2] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[6] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[8] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[11] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[14] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[0] MPI startup(): shm and tcp data transfer modes

[1] MPI startup(): shm and tcp data transfer modes

[2] MPI startup(): shm and tcp data transfer modes

[3] MPI startup(): shm and tcp data transfer modes

[4] MPI startup(): shm and tcp data transfer modes

[5] MPI startup(): shm and tcp data transfer modes

[7] MPI startup(): shm and tcp data transfer modes

[9] MPI startup(): shm and tcp data transfer modes

[8] MPI startup(): shm and tcp data transfer modes

[6] MPI startup(): shm and tcp data transfer modes

[10] MPI startup(): shm and tcp data transfer modes

[11] MPI startup(): shm and tcp data transfer modes

[12] MPI startup(): shm and tcp data transfer modes

[13] MPI startup(): shm and tcp data transfer modes

[14] MPI startup(): shm and tcp data transfer modes

[15] MPI startup(): shm and tcp data transfer modes

[0] MPI startup(): Device_reset_idx=1

[0] MPI startup(): Allgather: 4: 1-4 & 0-4

[0] MPI startup(): Allgather: 1: 5-11 & 0-4

[0] MPI startup(): Allgather: 4: 12-28 & 0-4

[0] MPI startup(): Allgather: 1: 29-1694 & 0-4

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}

[0] MPI startup(): Allgather: 4: 1695-3413 & 0-4

[0] MPI startup(): Allgather: 1: 3414-513494 & 0-4

[0] MPI startup(): Allgather: 3: 513495-1244544 & 0-4

[0] MPI startup(): Allgather: 4: 0-2147483647 & 0-4

[0] MPI startup(): Allgather: 4: 1-16 & 5-16

[0] MPI startup(): Allgather: 1: 17-38 & 5-16

[0] MPI startup(): Allgather: 3: 0-2147483647 & 5-16

[0] MPI startup(): Allgather: 4: 1-8 & 17-2147483647

[0] MPI startup(): Allgather: 1: 9-23 & 17-2147483647

[0] MPI startup(): Allgather: 4: 24-35 & 17-2147483647

[0] MPI startup(): Allgather: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Allgatherv: 1: 0-3669 & 0-4

[0] MPI startup(): Allgatherv: 4: 3669-4949 & 0-4

[0] MPI startup(): Allgatherv: 1: 4949-17255 & 0-4

[0] MPI startup(): Allgatherv: 4: 17255-46775 & 0-4

[0] MPI startup(): Allgatherv: 3: 46775-836844 & 0-4

[0] MPI startup(): Allgatherv: 4: 0-2147483647 & 0-4

[0] MPI startup(): Allgatherv: 4: 0-10 & 5-16

[0] MPI startup(): Allgatherv: 1: 10-38 & 5-16

[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 5-16

[0] MPI startup(): Allgatherv: 4: 0-8 & 17-2147483647

[0] MPI startup(): Allgatherv: 1: 8-21 & 17-2147483647

[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Allreduce: 5: 0-6 & 0-8

[0] MPI startup(): Allreduce: 7: 6-11 & 0-8

[0] MPI startup(): Allreduce: 5: 11-26 & 0-8

[0] MPI startup(): Allreduce: 4: 26-43 & 0-8

[0] MPI startup(): Allreduce: 5: 43-99 & 0-8

[0] MPI startup(): Allreduce: 1: 99-176 & 0-8

[0] MPI startup(): Allreduce: 6: 176-380 & 0-8

[0] MPI startup(): Allreduce: 2: 380-2967 & 0-8

[0] MPI startup(): Allreduce: 1: 2967-9460 & 0-8

[0] MPI startup(): Allreduce: 2: 0-2147483647 & 0-8

[0] MPI startup(): Allreduce: 5: 0-95 & 9-16

[0] MPI startup(): Allreduce: 1: 95-301 & 9-16

[0] MPI startup(): Allreduce: 2: 301-2577 & 9-16

[0] MPI startup(): Allreduce: 6: 2577-5427 & 9-16

[0] MPI startup(): Allreduce: 1: 5427-10288 & 9-16

[0] MPI startup(): Allreduce: 2: 0-2147483647 & 9-16

[0] MPI startup(): Allreduce: 6: 0-6 & 17-2147483647

[0] MPI startup(): Allreduce: 5: 6-11 & 17-2147483647

[0] MPI startup(): Allreduce: 6: 11-452 & 17-2147483647

[0] MPI startup(): Allreduce: 2: 452-2639 & 17-2147483647

[0] MPI startup(): Allreduce: 6: 2639-5627 & 17-2147483647

[0] MPI startup(): Allreduce: 1: 5627-9956 & 17-2147483647

[0] MPI startup(): Allreduce: 2: 9956-2587177 & 17-2147483647

[0] MPI startup(): Allreduce: 3: 0-2147483647 & 17-2147483647

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}

[0] MPI startup(): Alltoall: 4: 1-16 & 0-8

[0] MPI startup(): Alltoall: 1: 17-69 & 0-8

[0] MPI startup(): Alltoall: 2: 70-1024 & 0-8

[0] MPI startup(): Alltoall: 2: 1024-52228 & 0-8

[0] MPI startup(): Alltoall: 4: 52229-74973 & 0-8

[0] MPI startup(): Alltoall: 2: 74974-131148 & 0-8

[0] MPI startup(): Alltoall: 3: 131149-335487 & 0-8

[0] MPI startup(): Alltoall: 4: 0-2147483647 & 0-8

[0] MPI startup(): Alltoall: 4: 1-16 & 9-16

[0] MPI startup(): Alltoall: 1: 17-40 & 9-16

[0] MPI startup(): Alltoall: 2: 41-497 & 9-16

[0] MPI startup(): Alltoall: 1: 498-547 & 9-16

[0] MPI startup(): Alltoall: 2: 548-1024 & 9-16

[0] MPI startup(): Alltoall: 2: 1024-69348 & 9-16

[0] MPI startup(): Alltoall: 4: 0-2147483647 & 9-16

[0] MPI startup(): Alltoall: 4: 0-1 & 17-2147483647

[0] MPI startup(): Alltoall: 1: 2-4 & 17-2147483647

[0] MPI startup(): Alltoall: 4: 5-24 & 17-2147483647

[0] MPI startup(): Alltoall: 2: 25-1024 & 17-2147483647

[0] MPI startup(): Alltoall: 2: 1024-20700 & 17-2147483647

[0] MPI startup(): Alltoall: 4: 20701-57414 & 17-2147483647

[0] MPI startup(): Alltoall: 3: 57415-66078 & 17-2147483647

[0] MPI startup(): Alltoall: 4: 0-2147483647 & 17-2147483647

[0] MPI startup(): Alltoallv: 2: 0-2147483647 & 0-2147483647

[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647

[0] MPI startup(): Barrier: 0: 0-2147483647 & 0-2147483647

[0] MPI startup(): Bcast: 4: 1-29 & 0-8

[0] MPI startup(): Bcast: 7: 30-37 & 0-8

[0] MPI startup(): Bcast: 4: 38-543 & 0-8

[0] MPI startup(): Bcast: 6: 544-1682 & 0-8

[0] MPI startup(): Bcast: 4: 1683-2521 & 0-8

[0] MPI startup(): Bcast: 6: 2522-30075 & 0-8

[0] MPI startup(): Bcast: 7: 30076-34889 & 0-8

[0] MPI startup(): Bcast: 4: 34890-131072 & 0-8

[0] MPI startup(): Bcast: 6: 131072-409051 & 0-8

[0] MPI startup(): Bcast: 7: 0-2147483647 & 0-8

[0] MPI startup(): Bcast: 4: 1-13 & 9-2147483647

[0] MPI startup(): Bcast: 1: 14-25 & 9-2147483647

[0] MPI startup(): Bcast: 4: 26-691 & 9-2147483647

[0] MPI startup(): Bcast: 6: 692-2367 & 9-2147483647

[0] MPI startup(): Bcast: 4: 2368-7952 & 9-2147483647

[0] MPI startup(): Bcast: 6: 7953-10407 & 9-2147483647

[0] MPI startup(): Bcast: 4: 10408-17900 & 9-2147483647

[0] MPI startup(): Bcast: 6: 17901-36385 & 9-2147483647

[0] MPI startup(): Bcast: 7: 36386-131072 & 9-2147483647

[0] MPI startup(): Bcast: 7: 0-2147483647 & 9-2147483647

[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}

[0] MPI startup(): Gather: 2: 1-3 & 0-8

[0] MPI startup(): Gather: 3: 4-4 & 0-8

[0] MPI startup(): Gather: 2: 5-66 & 0-8

[0] MPI startup(): Gather: 3: 67-174 & 0-8

[0] MPI startup(): Gather: 2: 175-478 & 0-8

[0] MPI startup(): Gather: 3: 479-531 & 0-8

[0] MPI startup(): Gather: 2: 532-2299 & 0-8

[0] MPI startup(): Gather: 3: 0-2147483647 & 0-8

[0] MPI startup(): Gather: 2: 1-141 & 9-16

[0] MPI startup(): Gather: 3: 142-456 & 9-16

[0] MPI startup(): Gather: 2: 457-785 & 9-16

[0] MPI startup(): Gather: 3: 786-70794 & 9-16

[0] MPI startup(): Gather: 2: 70795-254351 & 9-16

[0] MPI startup(): Gather: 3: 0-2147483647 & 9-16

[0] MPI startup(): Gather: 2: 1-89 & 17-2147483647

[0] MPI startup(): Gather: 3: 90-472 & 17-2147483647

[0] MPI startup(): Gather: 2: 473-718 & 17-2147483647

[0] MPI startup(): Gather: 3: 719-16460 & 17-2147483647

[0] MPI startup(): Gather: 2: 0-2147483647 & 17-2147483647

[0] MPI startup(): Gatherv: 2: 0-2147483647 & 0-16

[0] MPI startup(): Gatherv: 2: 0-2147483647 & 17-2147483647

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}

[0] MPI startup(): Reduce_scatter: 5: 0-5 & 0-4

[0] MPI startup(): Reduce_scatter: 1: 5-192 & 0-4

[0] MPI startup(): Reduce_scatter: 3: 192-349 & 0-4

[0] MPI startup(): Reduce_scatter: 1: 349-3268 & 0-4

[0] MPI startup(): Reduce_scatter: 3: 3268-71356 & 0-4

[0] MPI startup(): Reduce_scatter: 2: 71356-513868 & 0-4

[0] MPI startup(): Reduce_scatter: 5: 513868-731452 & 0-4

[0] MPI startup(): Reduce_scatter: 2: 731452-1746615 & 0-4

[0] MPI startup(): Reduce_scatter: 5: 1746615-2485015 & 0-4

[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 0-4

[0] MPI startup(): Reduce_scatter: 5: 0-5 & 5-16

[0] MPI startup(): Reduce_scatter: 1: 5-59 & 5-16

[0] MPI startup(): Reduce_scatter: 5: 59-99 & 5-16

[0] MPI startup(): Reduce_scatter: 3: 99-198 & 5-16

[0] MPI startup(): Reduce_scatter: 1: 198-360 & 5-16

[0] MPI startup(): Reduce_scatter: 3: 360-3606 & 5-16

[0] MPI startup(): Reduce_scatter: 2: 3606-4631 & 5-16

[0] MPI startup(): Reduce_scatter: 3: 0-2147483647 & 5-16

[0] MPI startup(): Reduce_scatter: 5: 0-22 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 1: 22-44 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 5: 44-278 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 3: 278-3517 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 5: 3517-4408 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Reduce: 4: 4-5 & 0-4

[0] MPI startup(): Reduce: 1: 6-59 & 0-4

[0] MPI startup(): Reduce: 2: 60-188 & 0-4

[0] MPI startup(): Reduce: 6: 189-362 & 0-4

[0] MPI startup(): Reduce: 2: 363-7776 & 0-4

[0] MPI startup(): Reduce: 5: 7777-151371 & 0-4

[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-4

[0] MPI startup(): Reduce: 4: 4-60 & 5-16

[0] MPI startup(): Reduce: 3: 61-88 & 5-16

[0] MPI startup(): Reduce: 4: 89-245 & 5-16

[0] MPI startup(): Reduce: 3: 246-256 & 5-16

[0] MPI startup(): Reduce: 4: 257-8192 & 5-16

[0] MPI startup(): Reduce: 3: 8192-1048576 & 5-16

[0] MPI startup(): Reduce: 3: 0-2147483647 & 5-16

[0] MPI startup(): Reduce: 4: 4-8192 & 17-2147483647

[0] MPI startup(): Reduce: 3: 8192-1048576 & 17-2147483647

[0] MPI startup(): Reduce: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647

[0] MPI startup(): Scatter: 2: 1-7 & 0-16

[0] MPI startup(): Scatter: 3: 8-9 & 0-16

[0] MPI startup(): Scatter: 2: 10-64 & 0-16

[0] MPI startup(): Scatter: 3: 65-372 & 0-16

[0] MPI startup(): Scatter: 2: 373-811 & 0-16

[0] MPI startup(): Scatter: 3: 812-115993 & 0-16

[0] MPI startup(): Scatter: 2: 115994-173348 & 0-16

[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-16

[0] MPI startup(): Scatter: 1: 1-1 & 17-2147483647

[0] MPI startup(): Scatter: 2: 2-76 & 17-2147483647

[0] MPI startup(): Scatter: 3: 77-435 & 17-2147483647

[0] MPI startup(): Scatter: 2: 436-608 & 17-2147483647

[0] MPI startup(): Scatter: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2147483647

[5] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[1] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[7] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[2] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[6] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[3] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[13] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[4] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[9] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[14] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[11] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[15] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[8] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[12] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[10] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[0] MPI startup(): Rank    Pid      Node name      Pin cpu

[0] MPI startup(): 0       10691    ip-10-0-0-189  0

[0] MPI startup(): 1       10692    ip-10-0-0-189  1

[0] MPI startup(): 2       10693    ip-10-0-0-189  2

[0] MPI startup(): 3       10694    ip-10-0-0-189  3

[0] MPI startup(): 4       10320    ip-10-0-0-174  0

[0] MPI startup(): 5       10321    ip-10-0-0-174  1

[0] MPI startup(): 6       10322    ip-10-0-0-174  2

[0] MPI startup(): 7       10323    ip-10-0-0-174  3

[0] MPI startup(): 8       10273    ip-10-0-0-104  0

[0] MPI startup(): 9       10274    ip-10-0-0-104  1

[0] MPI startup(): 10      10275    ip-10-0-0-104  2

[0] MPI startup(): 11      10276    ip-10-0-0-104  3

[0] MPI startup(): 12      10312    ip-10-0-0-158  0

[0] MPI startup(): 13      10313    ip-10-0-0-158  1

[0] MPI startup(): 14      10314    ip-10-0-0-158  2

[0] MPI startup(): 15      10315    ip-10-0-0-158  3

[0] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[0] MPI startup(): I_MPI_DEBUG=6

[0] MPI startup(): I_MPI_HYDRA_UUID=bb290000-2b37-e5b2-065d-050000bd0a00

[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=1

[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 1,2 2,3 3

shared memory initialization failure

$
0
0

Hi all,

Running our MPI application on a newly setup RHEL 7.3 system using SGE, we obtain the following error:

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1817)......: fail failed
MPIR_Comm_commit(711): fail failed
(unknown)(): Other MPI error

With I_MPI_DEBUG=1000 the following error is reported:

[0] I_MPI_Init_shm_colls_space(): Cannot create shm object: /shm-col-space-69142-2-55D0EBDD4B46E errno=Permission denied
[0] I_MPI_Init_shm_colls_space(): Something goes wrong in shared memory initialization (Permission denied)

Usually Intel MPI creates shm objects in /dev/shm. Does anybody know why the library tries to create them in /?

Cheers,
Pieter

mpirun command does not distribute jobs to compute nodes

$
0
0

Dear Folks,

I have Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.1.117 Build 20121010 in my system. I am trying to submit a job using mpirun to my machine having following hosts: 

weather
compute-0-0
compute-0-1
compute-0-2
compute-0-3
compute-0-4
compute-0-5
compute-0-6
compute-0-7

after running mpdboot (as mpdboot -v -n 9 -f ~/hostfile -r ssh) I am using the command: mpirun -np 72 -f ~/hostfile ./wrf.exe &

after submitting the job, it fails with some error after 10-15 min. I checked the top command on the compute nodes and did not see any process running as wrf.exe in the mean time. Please suggest if I am making any mistake or there is something else which is inhibiting me to submit jobs on the compute nodes.

Thank you in anticipation.

Dhirendra

ITAC -- Naming generated .stf file to differentiate runs

$
0
0

Hello, 

I am using ITAC from the 2017.05 Intel Parallel Cluster Studio. I issue a number of mpirun command lines with ITAC tracing enabled. I am trying though to assign specific names to the generated .stf files so that I can associate the .stf files of a particular run with the corresponding mpirun command.  

How can I do this? 

Is there any option as we have for the statistics with the I_MPI_STATS_FILE?

Can I do something like  

mpiexec.hydra ... -stf-file-name MPIapp_$(date +%F_%T) ... ./MPIapp 

Thank you!

Michael

mpirun command does not send jobs to compute nodes

$
0
0

Dear Folks,

I have Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.1.117 Build 20121010 in my system. I am trying to submit a job using mpirun to my machine having following hosts: 

weather
compute-0-0
compute-0-1
compute-0-2
compute-0-3
compute-0-4
compute-0-5
compute-0-6
compute-0-7

after running mpdboot (as mpdboot -v -n 9 -f ~/hostfile -r ssh) I am using the command: mpirun -np 72 -f ~/hostfile ./wrf.exe &

after submitting the job, it fails with some error after 10-15 min. I checked the top command on the compute nodes and did not see any process running as wrf.exe in the mean time. Please suggest if I am making any mistake or there is something else which is inhibiting me to submit jobs on the compute nodes.

Thank you in anticipation.

Dhirendra

intel mpi cross os launch error

$
0
0

Env:

node1 : window 10                             (192.168.137.1)

node2 : debian8  virtual machine.      (192.168.137.3)

 

test app: the test.cpp included with intel mpi package

 

1,  Launch from windows side(node1),  1 process (just node 1):   

mpiexec -demux select -bootstrap=service -genv I_MPI_FABRICS=shm:tcp -n 1 -host localhost test

get output:

node1:

 

Hello world: rank 0 of 1 running on DESKTOP-J4KRVVD

2,  Launch from windows side(node1),  1 process (just node 2):   

mpiexec -demux select -bootstrap=service -genv I_MPI_FABRICS=shm:tcp -host 192.168.137.3 -hostos linux -n 1 -path /opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/test test

get output:

node1:

Hello world: rank 0 of 1 running on vm-build-debian8

3, Launch from windows side(node1),  two processes(1 at node1, 1 at node2):   

mpiexec -demux select -bootstrap=service -genv I_MPI_FABRICS=shm:tcp -host 192.168.137.3 -hostos linux -n 1 -path /opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/test test : -n 1 -host localhost test

get error:

node1:

rank = 1, revents = 29, state = 1
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0

 

node2:

[hydserv@vm-build-debian8] stdio_cb (../../tools/bootstrap/persist/persist_server.c:170): assert (!closed) failed
[hydserv@vm-build-debian8] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[hydserv@vm-build-debian8] main (../../tools/bootstrap/persist/persist_server.c:339): demux engine error waiting for event

 

If i try to turn on verbose output with -v or -genv I_MPI_HYDRA_DEBUG=on,  even test 2 will fail with errors below,  so don't know what's wrong?   or how to find out what's wrong?      

node1:

[mpiexec@DESKTOP-J4KRVVD] STDIN will be redirected to 1 fd(s): 4

[mpiexec@DESKTOP-J4KRVVD] ..\hydra\utils\sock\sock.c (420): write error (Unknown error)
[mpiexec@DESKTOP-J4KRVVD] ..\hydra\tools\bootstrap\persist\persist_launch.c (52): assert (sent == hdr.buflen) failed
[mpiexec@DESKTOP-J4KRVVD] ..\hydra\tools\demux\demux_select.c (103): callback returned error status
[mpiexec@DESKTOP-J4KRVVD] ..\hydra\pm\pmiserv\pmiserv_pmci.c (501): error waiting for event
[mpiexec@DESKTOP-J4KRVVD] ..\hydra\ui\mpich\mpiexec.c (1147): process manager error waiting for completion

node2:

[hydserv@vm-build-debian8] stdio_cb (../../tools/bootstrap/persist/persist_server.c:170): assert (!closed) failed
[hydserv@vm-build-debian8] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[hydserv@vm-build-debian8] main (../../tools/bootstrap/persist/persist_server.c:339): demux engine error waiting for event

 


pbs system said: 'MPI startup(): ofa fabric is not available and fallback fabric is not enabled''

$
0
0

I've been using PBS system for testing my code. I have a PBS script to run my binary code.  But when I get:

> [0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled

And I read this site: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...

However, those methods can not sovle the problem. My code can run in host node and other node, but the code can not run by PBS system.

What can I do for this problem?

Thanks.

P.S.

This my PBS script:

#!/bin/sh

#PBS -N job_1
#PBS -l nodes=1:ppn=12
#PBS -o example.out
#PBS -e example.err
#PBS -l walltime=3600:00:00
#PBS -q default_queue

echo -e --------- `date` ----------

echo HomeDirectory is $PWD
echo
echo Current Dir is $PBS_O_WORKDIR
echo


cd $PBS_O_WORKDIR

echo "------------This is the node file -------------"
cat $PBS_NODEFILE
echo "-----------------------------------------------"

np=$(cat $PBS_NODEFILE | wc -l)
echo The number of core is $np
echo
echo

cat $PBS_NODEFILE > $PBS_O_WORKDIR/mpd.host

mpdtrace  >/dev/null 2>&1
if [ "$?" != "0" ]
then
        echo -e
        mpdboot -n 1 -f mpd.host -r ssh
fi

mpirun -np 12 ./run_test

 

Fata Error using MPI in Linux

$
0
0

Hi,

I'm using a virtual Linux Ubuntu machine (Linux-VirtualBox 4.4.0-101-generic #124-Ubuntu SMP Fri Nov 10 18:29:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux), with 8GB RAM.

For a process on Matlab, the software requires Intel MPI runtime package v4.1.3.045 or superior. Instead, I've installed the 2018.1.163 version, being not sure about the 2018 number version.

Using 8 cores in the processing, the software went in error, with the following error:
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(224)...................: MPI_Recv(buf=0x7f566d59c040, count=9942500, MPI_FLOAT, src=3, tag=5, MPI_COMM_WORLD, status=0x7ffc43a72b60) failed
PMPIDI_CH3I_Progress(658).......: fail failed
MPID_nem_handle_pkt(1450).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(302): fail failed
dcp_recv(165)...................: Internal MPI error!  Cannot read from remote process
 Two workarounds have been identified for this issue:
 1) Enable ptrace for non-root users with:
    echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
 2) Or, use:
    I_MPI_SHM_LMT=shm

 

Reducing the number of cores to 4, the process hangs for more than 3 hours and I'm not sure it is still working.

What could be the problem?

thank you

Pietro

 

drastic reduction in performance when compute node running at half load

$
0
0

We have compute nodes with 24 cores( 48 threads) and 64 GB RAM (2x32GB). When I run a sample code (matrix multiplication)in one of the compute node in one thread, it takes only 4 seconds. But when I starting more runs (copy of the same program) in the same compute node, the time taken increases drastically. When the number of programs running reaches 24 (I gave maximum 24 since physically only 24 cores are present), the time taken becomes like around 40 seconds ( 10 times less). When I checked the temperature, it is below 40 deg Celsius.

When I searched in the Internet about this issue, I found some people saying that it may be due to slowing down of transfer of data from ram to processor when we run many programs. I was not satisfied with this comment, because the compute nodes are designed to run at maximum load with out much decrease in performance. Also, we are using only 1GB of memory even with 24 programs running. Since we are getting performance reduction of about 1/10, I guess the problem is something else.

Slowdown of message exchange by multiple oders of magnitude due to dynamic connection

$
0
0

Hello,

We develop MPI algorithms on the SuperMUC supercomputer [1]. We compile our algorithms with Intel MPI 2018. Unfortunately, it seems like the message transfer between two processes which have not exchanged a message before is slower than the message transfer between two processes which have already exchanged a message before by a factor of up to 1000.

I want to give several examples:

1.: Let benchmark A perform the following operations: "First, execute a barrier on MPI_COMM_WORLD. Second, start a timer. Third, process 0 sends 256 messages of size 32 kB each. Message i is sent to process i + 1. Finally, stop the timer." The first execution of benchmark A takes about 686 microseconds on an instance of 2048 processes (2048 cores on 128 nodes). Subsequent executions of A just take 0.85 microseconds each.

Insight: If we perform a communication pattern (here 'partial' broadcast) the first time, the execution is slower than subsequent executions by a factor of about 800. Unfortunately, if we execute benchmark A again with a different communication partner, e.g., process 1 sends messages to process 1..256, benchmark A is slow again. Thus, an initial warm up phase which executes benchmark A once does not speed up communication in general.

2.: Let benchmark B perform the following operations: "First, execute a barrier on MPI_COMM_WORLD. Second, start a timer. Third, invoke MPI_Alltoall with messages of size 32 kB each. Finally, stop the timer." The first execution of benchmark B takes about 42.41 seconds(!) on an instance of 2048 processes (2048 cores on 128 nodes). The second execution of B just takes 0.12 seconds.

Insight: If we perform a communication pattern (here MPI_Alltoall) the first time, the execution is slower than subsequent executions by a facto of about 353. Unfortunately, the first MPI_Alltoall is unbelievable slow and gets even much slower on larger machine instances.

3. Let benchmark C perform the following operations: "First, execute a barrier on MPI_COMM_WORLD. Second, start a timer, Third, execute an all-to-all collective operation with messages of size 32 kB each. Finally, stop the timer." The all-to-all collective operation we use in benchmark C is an own implementation of the MPI_Alltoall interface. We now execute benchmark C first and then benchmark A afterwards. Benchmark C takes about 40 seconds and benchmark A takes about 0.85 seconds.

Insight: The first execution of our all-to-all implementation performs similar to the first execution of MPI_Alltoall. Surprisingly, the subsequent execution of benchmark A is executed very fast (0.85 second), compared to the case where we do not have a preceding all-to-all. It seems like the all-to-all collective operation sets up the connections between each process which results in a fast execution of benchmark A. However, as the all-to-all collective operation (MPI_Alltoall as well) is unbelievable slow, we don't want to execute the all-to-all collective operation as a warm up on large scale.

We already figured out that the environment variable I_MPI_USE_DYNAMIC_CONNECTIONS=no avoids these slow running times on small scale (up to 2048 cores). However, setting I_MPI_USE_DYNAMIC_CONNECTIONS  to  the value 'no' does not have any effect on larger machine instances (number of cores > 2048).

We think that these benchmarks give interesting insights into the running time of Intel MPI. Our supercomputer might be configured incorrectly. We tried to adjust several environment variables but did not find a satisfying configuration. We also want to mention that those fluctuations in running time does not occur with IBM MPI on that machine.

If you have further suggestions to handle this problem, please let us know. If required we run additional benchmarks, apply different configurations, and provide debug output, e.g.  I_MPI_DEBUG=xxx.

Best,

Michael A.

[1] https://www.lrz.de/services/compute/supermuc/systemdescription/

HPCC benchmark HPL results degrade as more cores are used

$
0
0

I have a 6-node cluster consisting of 12 cores per node with a total of 72 cores.

When running the HPCC benchmark on 6 cores - 1 core per node, 6 nodes - HPL results is 1198.87 GFLOPS.  However, running HPCC on all available cores of the 6-node cluster, for a total of 72 cores, HPL results is 847.421 GFLOPS.

MPI Library Used: Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)

Options to mpiexec.hydra:
-print-rank-map
-pmi-noaggregate
-nolocal
-genvall
-genv I_MPI_DEBUG 5
-genv I_MPI_HYDRA_IFACE ens2f0
-genv I_MPI_FABRICS shm:tcp
-n 72
-ppn 12
-ilp64
--hostname filename

Any ideas?

Thanks in advance.

 

Error Loading libmpifort.so.12

$
0
0

Hello,

I am trying to run my executable (PDES Simulator - ROSS) on a slurm cluster and I add the module mpich/ge/gcc/64/3.2rc2 for mpi support.

But I got "while loading shared libraries: libmpifort.so.12: cannot open shared object file: No such file or directory" error.

Which module should I add for libmpifort.so.12 ? Or should mpich/ge/gcc/64/3.2rc2 have it already that maybe I do a mistake while loading this module. "module list" shows me I have it though. Also "which libmpifort.so.12" gives me no such library message.

Thank you in advance.

Sincerely,

Ali.

Cannot correctly write to a file with MPI-IO and indexed type

$
0
0

Hello,

I am trying to write in a file using MPI-IO, with an indexed type used to set a view of the file (see code sample attached).
The program launch but I do not get the result I was expecting.

When I output the content of the file via "od -f TEST", I get:
0000000 0 0 1 3

But I was expecting:
0000000 0 2 1 3

I am using Intel MPI 2008 update 1 with gfortran.

Regards

AttachmentSize
Downloadapplication/octet-streamidx2.f2.02 KB

peak floating point operations of Intel Xeon E5345

$
0
0

I want to find the peak floating point operations of  Intel Xeon E5345 prcessor. i searched about that and found it 9.332 GFlop/s . I want to make sure . There is a formula(please correct me if I am wrong) :

Flops/s = #instructions per second * clock cycle

The clock cycle is 2.33 GHz( I ma not sure ) and I did not find the number of instructions per second the machine can perform.

Any idea

Timers of PCH

$
0
0

Hi,

As far as I know, Intel PCH provides some high resolution timers.

I want to know how many timers are provide and how to use them under Linux.

Thank you.

 

problems using PSM2 "Trying to connect to a HFI (subnet id - 0)on a different subnet - 1023 "

$
0
0

hello,

I installed the Omni-path driver ( IntelOPA-Basic.RHEL74-x86_64.10.6.1.0.2.tgz ) on two identical KNL/F servers with Centos  ( CentOS Linux release 7.4.1708 (Core) )

I executed the MPI Benchmark provided by intel using PSM2:

mpirun -PSM2 -host 10.0.0.5 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv : -host 10.0.0.6 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv

And the execution return the following error:

[silvio@phi05 ~]$ mpirun -PSM2 -host 10.0.0.5 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv : -host 10.0.0.6 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv
init_provider_list: using configuration file: /opt/intel/compilers_and_libraries_2018.1.163/linux/mpi/intel64/etc/tmi.conf
init_provider_list: valid configuration line: psm2 1.3 libtmip_psm2.so ""
init_provider_list: using configuration file: /opt/intel/compilers_and_libraries_2018.1.163/linux/mpi/intel64/etc/tmi.conf
init_provider_list: valid configuration line: psm 1.2 libtmip_psm.so ""
init_provider_list: valid configuration line: mx 1.0 libtmip_mx.so ""
init_provider_list: valid configuration line: psm2 1.3 libtmip_psm2.so ""
init_provider_list: valid configuration line: psm 1.2 libtmip_psm.so ""
init_provider_list: valid configuration line: mx 1.0 libtmip_mx.so ""
tmi_psm2_init: tmi_psm2_connect_timeout=180
init_provider_lib: using provider: psm2, version 1.3
tmi_psm2_init: tmi_psm2_connect_timeout=180
init_provider_lib: using provider: psm2, version 1.3
phi05.11971 Trying to connect to a HFI (subnet id - 0)on a different subnet - 1023 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 11971 RUNNING AT 10.0.0.5
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 11971 RUNNING AT 10.0.0.5
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

I search this message on google (Trying to connect to a HFI (subnet id - 0)on a different subnet - 1023 ) and the only reference is the following source code:

https://github.com/01org/opa-psm2/blob/master/ptl_ips/ips_proto_connect.c

How do i put the two fabrics in the same subnet?

When i change to Infiniband it works:

mpirun -IB -host 10.0.0.5 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv : -host 10.0.0.6 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv

#-----------------------------------------------------------------------------

# Benchmarking Sendrecv 
# #processes = 2 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        17.79        17.79        17.79         0.00
            1         1000        18.11        18.11        18.11         0.11
            2         1000        18.05        18.05        18.05         0.22
            4         1000        18.08        18.08        18.08         0.44
            8         1000        18.05        18.05        18.05         0.89
           16         1000        18.06        18.06        18.06         1.77
           32         1000        18.99        18.99        18.99         3.37
           64         1000        19.05        19.07        19.06         6.71
          128         1000        19.20        19.20        19.20        13.33
          256         1000        19.96        19.97        19.97        25.64
          512         1000        20.22        20.22        20.22        50.63
         1024         1000        20.38        20.39        20.39       100.44
         2048         1000        24.70        24.71        24.70       165.78
         4096         1000        25.98        25.98        25.98       315.31
         8192         1000        55.57        55.59        55.58       294.75
        16384         1000        61.89        61.90        61.90       529.33
        32768         1000       112.95       113.01       112.98       579.89
        65536          640       158.22       158.23       158.22       828.37
       131072          320       297.40       297.50       297.45       881.16
       262144          160       599.27       600.30       599.78       873.38
       524288           80     31394.80     31489.45     31442.13        33.30
      1048576           40     28356.10     28414.67     28385.39        73.81
      2097152           20     31387.65     31661.40     31524.53       132.47
      4194304           10     38455.80     40408.99     39432.39       207.59

 

mpi calling external mpi program failed

$
0
0

I tried to use MPI to parallel running external command line program.

 

So I write `run.f90`

 

    program run
          use mpi
          implicit none
          integer::num_process,rank,ierr;

          call MPI_Init(ierr);

          call MPI_Comm_rank(MPI_COMM_WORLD, rank,ierr);
          call MPI_Comm_size(MPI_COMM_WORLD, num_process,ierr);

          call execute_command_line('./a.out')

          call MPI_Finalize(ierr);

    end program 

 

 

I compile it using intel compiler like

 

   

 mpiifort run.f90 -o run.out

 

Now if `./a.out` is normal non-MPI program like

 

    

program test
    implicit none
    print*,'hello'
    end

then 

 

    mpiexec.hydra -n 4 ./run.out

 

works fine.

 

However, if `./a.out` is also a MPI program like

 

    program mpi_test
              use mpi
              implicit none
              integer::num_process,rank,ierr;

              call MPI_Init(ierr);

              call MPI_Comm_rank(MPI_COMM_WORLD, rank,ierr);
              call MPI_Comm_size(MPI_COMM_WORLD, num_process,ierr);

              print*,rank

              call MPI_Finalize(ierr);

        end program 

Then, I got error after running "mpiexec.hydra -n 4 ./run.out"

===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 19519 RUNNING AT i02n18

=   EXIT CODE: 13

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

 

===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 19522 RUNNING AT i02n18

=   EXIT CODE: 13

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

   Intel(R) MPI Library troubleshooting guide:

      https://software.intel.com/node/561764

What is wrong?

Performance issues with Omni Path

$
0
0

Hi all,

I installed two Omni Path Fabric cards on two Xeon Servers. 

Following the instructions present in this web site: https://software.intel.com/en-us/articles/using-intel-omni-path-architec... 

The performance tests in this link shows that the network achieved 100 Gb/s - (4194304           10       360.39       360.39       360.39     23276.25)

I the network i deployed i achieved half of this performance (     4194304           10       661.40       661.40       661.40     12683.17):  

Is there some configuration needed to achieve 100 Gb/s using Omni Path?

 

Here is the complete output of benchmark execute:

mpirun -PSM2 -host 10.0.0.3 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv : -host 10.0.0.1 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv

[silvio@phi03 ~]$ mpirun -PSM2 -host 10.0.0.3 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv : -host 10.0.0.1 -n 1 /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-1 part    
#------------------------------------------------------------
# Date                  : Fri Feb  2 11:14:01 2018
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-693.17.1.el7.x86_64
# Version               : #1 SMP Thu Jan 25 20:13:58 UTC 2018
# MPI Version           : 3.1
# MPI Thread Environment: 

# Calling sequence was: 

# /opt/intel/impi/2018.1.163/bin64/IMB-MPI1 Sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# #processes = 2 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000         1.92         1.92         1.92         0.00
            1         1000         1.85         1.85         1.85         1.08
            2         1000         1.84         1.84         1.84         2.17
            4         1000         1.84         1.84         1.84         4.35
            8         1000         1.76         1.76         1.76         9.10
           16         1000         2.07         2.07         2.07        15.44
           32         1000         2.06         2.07         2.07        30.98
           64         1000         2.02         2.02         2.02        63.46
          128         1000         2.08         2.08         2.08       123.26
          256         1000         2.11         2.11         2.11       242.41
          512         1000         2.25         2.25         2.25       454.30
         1024         1000         3.56         3.56         3.56       575.46
         2048         1000         4.19         4.19         4.19       976.91
         4096         1000         5.16         5.16         5.16      1586.69
         8192         1000         7.15         7.15         7.15      2290.80
        16384         1000        14.32        14.32        14.32      2288.44
        32768         1000        20.77        20.77        20.77      3154.69
        65536          640        26.08        26.09        26.09      5024.04
       131072          320        34.77        34.77        34.77      7538.32
       262144          160        53.03        53.03        53.03      9886.58
       524288           80        93.55        93.55        93.55     11208.78
      1048576           40       172.25       172.28       172.26     12173.26
      2097152           20       355.15       355.21       355.18     11808.02
      4194304           10       661.40       661.40       661.40     12683.17

# All processes entering MPI_Finalize

 

 

Thanks in advance!

Silvio

Viewing all 930 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>