Quantcast
Channel: Clusters and HPC Technology
Viewing all 930 articles
Browse latest View live

Bug in MPI_File_read_all MPI_File_write_all soubroutines

$
0
0

Dear Intel support team,

Maybe I was not clear enough in my previous topic https://software.intel.com/en-us/comment/1907393#comment-1907393 where two different problems were discussed. Here I would like to concentrate on the issue that was not answered in that topic.

According to the MPI 3.0 standard MPI_File_read_all takes two following parameters:

INTEGER, INTENT(IN) :: count ;       TYPE(MPI_Datatype),INTENT(IN) :: datatype

count is 4 bytes integer that can have value of 2147483647. Therefore, It should be possible to Read or write such amount of array elements with the type that I want, which we can provide using the second variable datatype. However, it does not work. For example, if I want to read count=2147483647 elements; datatype=MPI_CHARACTER (1byte) - everything is fine. However, if I would like to read count=2147483647 elements; and datatype=MPI_INTEGER (4 bytes) the subroutine crushes. In the standard it is written that such suboutine should be able to read  count=2147483647 elements of any data type but in Intel Implementation it does not work reducing the number of elements that can be read by a factor of the datatype. For example if I will make direved datatype of size 2147483647 bytes - I will be able only to read count=1 element. Again the standard says that I can read 2147483647 elements of any size  (not only of 1 bytes elements). 

I hope I was clear enough this time. Could you confirm this bug?

Regards, Serhiy

Thread Topic: 

Bug Report

Intel MPI with QoS option

$
0
0

Does Intel MPI have a runtime option for using an Infiniband Quality of Service?

For instance OpenMPI has:

mpirun --mca btl_openib_service_level N

I would be grateful if someone could steer me towards a similar option in Intel MPI

Memory registration cache feature in DAPL -> random failure in simple code

$
0
0

  MPI w/ DAPL user ** beware **
  
  In our open source finite element code, we have encountered a simple
  manager-worker code section than fails randomly while moving arrays (blocks)
  of double precision data from worker ranks to manager rank 0.
  
  The failures occur (consistently) with DAPL but never with tcp over IB (IPoIB).
  
  After much effort, the culprit was found to be the memory registration
  cache feature in DAPL.
  
  This feature/bug is ON by default ** even though ** the manual states:
   
   "The cache substantially increases performance, but may lead 
    to correctness issues in certain situations."
    
    From: Intel® MPI Library for Linux OS Developer Reference (2017). pg 95
  
  Once we set this option OFF, the code runs successfully for all test cases over
  large and small numbers of cluster nodes. The DAPL performance is still
  at least 2x better than IPoIB.
  
    export I_MPI_DAPL_TRANSLATION_CACHE=0
  
  Recommendation to Intel MPI group:
  
     Set I_MPI_DAPL_TRANSLATION_CACHE=0 as the DEFAULT. Encourage developers
     to explore setting this option ON ** if ** their code works properly
     with OFF.
     
  Specifics:
  
     - Intel ifort 17.0.2
     - Intel MPI 17.0.2
     - Ohio Supercomputer Center, Owens Cluster.
         RedHat 7.3
         Mellanox EDR (100Gbps) Infiniband
         Broadwell/Haswell cluster nodes.

 Code section that randomly fails:

    -> Blocks are ALLOCATEd with variable size in a Fortran
       derived type (itself also allocated to the number of blocks).
       All blocks on rank 0 created before this code below is entered.

  sync worker ranks to this point
  
  if rank = 0 then
  
     loop sequentially over all blocks to be moved
        if rank 0 owns block -> next block
        send worker who owns block the block number (MPI_SEND)
        receive block from worker (MPI_RECV)
     end loop
     
     loop over all workers
       send block = 0 to signal we are done moving blocks
     end loop
     
  else ! worker code
  
     loop
        post MPI_RECV to get a block number 
        if block number = 0 -> done
        if worker does not own this block, the manager made an error !
        send root the entire block -> MPI_SEND
     end loop
     
  end if
              

Cache management instructions for Intel Xeon Processors

$
0
0

Hi all,

I looked around but couldn't see any cache management instructions for Xeon processors (I am working on Xeon E7-8860 v4). I found that we can use _mm_clevict for MIC architecture.

Is there a similar way to do this on Xeon E7-8860 v4? What I am looking to do is, reduce priority of some cache line so that it will be one of the first ones to get evicted. For instance;

 

int* arr = new int[ length ];

for ( int i = 0; i < length; ++i )

{

   // use arr[i]

   if ( ( i - 1 ) % CACHE_LINE_SIZE == 0 )

      reduce_priority( arr[ i - 1] ); // reduces the priority of the cache line in which a[ i - 1 ] resides in

}

 

If not, can I achieve this by different means?

 

Any suggestion will be greatly appreciated.

Thanks a lot.

Matara Ma Sukoy

Thread Topic: 

Question

INTERNAL ERROR with SLURM and PMI2

$
0
0

I was pleasantly surprised to read that PMI2 & SLURM is supported by Intel MPI in the 2017 release. I tested it, but it fails immediately on my setup.  I'm using intel parallel studio 2017 update 4 & SLURM 15.08.13. A simple MPI-program doesn't work:

[donners@int1 pmi2]$ cat mpi.f90
program test
  use mpi
  implicit none

  integer ierr,nprocs,rank

  call mpi_init(ierr)
  call mpi_comm_size(MPI_COMM_WORLD,nprocs,ierr)
  call mpi_comm_rank(mpi_comm_world,rank,ierr)
  if (rank .eq. 0) then
    print *,'Number of processes: ',nprocs
  endif
  print*,'I am rank ',rank
  call mpi_finalize(ierr)

end
[donners@int1 pmi2]$ mpiifort mpi.f90
[donners@int1 pmi2]$ ldd ./a.out
    linux-vdso.so.1 =>  (0x00007ffcc0364000)
    libmpifort.so.12 => /opt/intel/parallel_studio_xe_2017_update4/compilers_and_libraries/linux/mpi/intel64/lib/libmpifort.so.12 (0x00002ad7432a9000)
    libmpi.so.12 => /opt/intel/parallel_studio_xe_2017_update4/compilers_and_libraries/linux/mpi/intel64/lib/release_mt/libmpi.so.12 (0x00002ad743652000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002ad744397000)
    librt.so.1 => /lib64/librt.so.1 (0x00002ad74459c000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ad7447a4000)
    libm.so.6 => /lib64/libm.so.6 (0x00002ad7449c1000)
    libc.so.6 => /lib64/libc.so.6 (0x00002ad744c46000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ad744fda000)
    /lib64/ld-linux-x86-64.so.2 (0x00002ad743086000)
[donners@int1 pmi2]$ I_MPI_PMI2=yes srun -n 1 --mpi=pmi2 ./a.out

INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPID_Init:2104
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1716)......: channel initialization failed
MPID_Init(2104)......: fail failed
srun: error: tcn1467: task 0: Exited with exit code 15
srun: Terminating job step 3270641.0

[donners@int1 pmi2]$ srun --version
slurm 15.08.13-Bull.1.0

The same problem occurs on a system with SLURM 17.02.3 (at TACC). What might be the problem here?

With regards,

John

 

failed generate trace file with mpirun

$
0
0

Hi, 

My application is a python program and mpi is called as mpi4py(built with intel mpi), and it needs to be killed during the runing(it needs a long time, we only profile a little). I use LD_PRELOAD=libVTfs.so mpirun -trace -n 2 python My application, it didn't generate stf as doc said. It only generate a file folder which contains stat-0.bin and stat-1.bin(filesize=0), any wrong with my configure? I already source mpivars.sh, itacvars. Thanks very much!

Zone: 

Thread Topic: 

Question

performance of Iallreduce on xeon phi

$
0
0

Hi, 

We are trying to use non blocking api(Iallreduce) on computation intensive program, we tried on two nodes(xeon phi) and find two nodes are not balance with intel trace analyzer tool, it said that one node spent more time on Iallreduce(sum?), We want to know whether we can create a thread and let the iallreduce/sum do in one specific core and let it parallel with user code(openmp)? or is there api or config in intel mpi can do this job? thanks

 

Zone: 

Thread Topic: 

Question

Parallel jobs running on same processors.

$
0
0

Hello,

I just got a KNL system which has 68 cores with 4 threads each. So basically, it should run 272 jobs. I submit my first job using mpiexec and used 64 of them. Then I submitted another one with 64. But when I checked the CPU usage, I found that the two jobs were running on the same 64 threads and left the rest empty. What kind of environmental variable should I set? Or what kind of job schedulers are recommended? Btw, we did not purchase the PBS because it is commercial and out our budget.

Thanks!

 

Zone: 

Thread Topic: 

Question

openmp application performance dropped with I_MPI_ASYNC_PROGRESS=enable

$
0
0

Hi,

I tried MPI/openmp process pining, it seems that When I use non-blocking api(Iallreduce) and specific I_MPI_ASYNC_PROGRESS like the following command, it I set I_MPI_ASYNC_PROGRESS=enable, then application will spent much more time on libiomp.so(kmp_hyper_barrier_release), and vmlinux also got a little hotter, compare with (I_MPI_ASYNC_PROGRESS=disable), is there any issue with my configuration? I use vtune and it shows that all the cores are pin in the right cores. the only difference is core 67 is used by MPI communication thread. 

========command=================

mpirun    -n 2 -ppn 1    -genv OMP_PROC_BIND=true -genv  I_MPI_ASYNC_PROGRESS= -genv I_MPI_ASYNC_PROGRESS_PIN=67 -genv I_MPI_PIN_PROCS=0-66 -genv OMP_NUM_THREADS=67  -genv I_MPI_PIN_DOMAIN=sock -genv I_MPI_FABRICS=ofi -f ./hostfile   python train_imagenet_cpu.py  --arch alex --batchsize 256 --loaderjob 68  --epoch 100 --train_root /home/jiangzho/imagenet/ILSVRC2012_img_train --val_root /home/jiangzho/imagenet/ILSVRC2012_img_val --communicator naive /home/jiangzho/train.txt /home/jiangzho/val.txt

Zone: 

Thread Topic: 

Question

Separate processes on separate cores

$
0
0

I'm using MPI to run processes that are nearly independent. They only talk at the very end, for an MPI_GATHER operation. My machine has a 4-core, 8-thread CPU. I run it with:

mpirun -n 101 ./a.out

When I do so, I see (from htop) that it utilises 100% of all the threads. How do I bind it to just the cores? (I tries '-map-by core')

Also, I see that all the processes seeem to be running concurrently (with ~ 3 - 8 % per process). Wouldn't it be more efficient if each process got 100% till each reaches the point of GATHERing ?

install Intel Studio Cluster Edition after installing Composer Edition

$
0
0

Hi all,

I am a student and was using Intel Studio XE Composer edition for the past year. Recently I realized Intel® Trace Analyzer and Collector is also available for students with Cluster Edition. I wish to install only this tool without having to uninstall my previous installation of Composer Edition. When I attempt to customize my installation I get the following error msg:

Product files will be installed here:

/opt/intel/

"The install directory path cannot be changed because at least one software product component was detected as having already been installed on the system".

Any help how can I solve this issue and install only Trace Analyzer is highly appreciated. 

Note that I have many jobs running currently, which I assume will be killed if I uninstall current Studio version. So I highly prefer to not kill my jobs. 

 

 

Issue with MPI_Iallreduce and MPI_IN_PLACE

$
0
0

Hi, 

I'm having some issues with using MPI_Iallreduce and MPI_IN_PLACE with FORTRAN (I haven't tested with C at this point), and I'm unclear if I'm doing something wrong w.r.t the standard. I've created a simple code that I can use to duplicate the issue:

Program Test
Use mpi
Implicit None
Integer, Dimension(0:19) :: test1, test2
Integer :: i, request, ierr, rank
Logical :: complete
Integer :: status(MPI_STATUS_SIZE)

Call MPI_Init(ierr)

do i =0,19
  test1(i) = i
end do

Call MPI_Iallreduce( test1, MPI_IN_PLACE, 20, MPI_INT, MPI_SUM, MPI_COMM_WORLD, request, ierr )
if(ierr /= MPI_Success) print *, "failed"

Call  MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
if(ierr /= MPI_Success) print *, "failed"

Call MPI_Wait(request, status, ierr)
if(ierr /= MPI_Success) print *, "failed"
do i = 0, 1
if(rank == i) print *, rank , test1
Call MPI_Barrier(MPI_COMM_WORLD, ierr)
end do

End Program Test

I've executed with 2 ranks using this MPI vesrion:

bash-4.1$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2017 Update 2 Build 20170125 (id: 16752)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.

and the output is not as expected i.e. rank, 0, 2, 4, .... but instead:

           0           0           1           2           3           4
           5           6           7           8           9          10
          11          12          13          14          15          16
          17          18          19
           1           0           1           2           3           4
           5           6           7           8           9          10
          11          12          13          14          15          16
          17          18          19

i.e. the reduction sum never occurs. If instead of MPI_IN_PLACE I reduce to test2 then the code works correctly.

Am I violating the standard in some way or is there a workaround?

Thanks

Aidan Chalk

Performance degration with larger message on knl(>128M)

$
0
0

Hi, 

    When I ran with IMPI benchmark, it always got an obvious performance drop when buffer size>128MB with OFI, is this reasonable or there is some configuration? Thanks

mpirun -genv I_MPI_STATS=ipm    -genv I_MPI_FABRICS=tmi -n 2  -ppn 1 -f hostfile IMB-MPI1 -msglog 20:29 -iter 20000,1000 uniband -time 1000000  -mem 2

#---------------------------------------------------
# Benchmarking Uniband 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions   Mbytes/sec      Msg/sec
            0        20000         0.00       607266
      1048576         1000      8356.72         7970
      2097152          500      8847.71         4219
      4194304          250      9295.13         2216
      8388608          125      9205.23         1097
     16777216           62      9498.35          566
     33554432           31      9577.55          285
     67108864           15      9564.31          143
    134217728            7      9523.83           71
   268435456            3      2700.73           10   <-----------performance dropped much
    536870912            1      3514.32            7

 

Fine-grain time synchronization among HPC nodes

$
0
0

Hi all,

I need to profile an HPC application on multiple nodes with very low overhead impact. In the application code, I need to monitor MPI synchronization points (barrier, alltoall, etc.). I'm using invariant TSC (RDTSC/RDTSCP instruction) because I cannot rely on clock_gettime() due high overheads of syscalls. I knew that TSCs should be synchronized among cores and sockets on the same node, hence I should have no problems for intra-node timing synchronization.
But I have the following concerns:

1) How can I synchronize TSCs among different nodes with a very fine-grain accuracy (sub-microsecond accuracy)? I think that developers of "Intel Trace Analyzer and Collector" should had similar problems.

2) I suppose that TSCs on different nodes increment always at a fixed nominal frequency. Do you think that invariant clock oscillators can have little drifts? I suppose to yes, but in this case for long application runs, profilers on different nodes can produce inconsistent inter-node timing information. Moreover, If TSCs are affected to clock drifts, I cannot transform time stamp in seconds.

My target system is an HPC machine composed to double-socket Broadwell nodes interconneted with an Omni-Path network.

Thanks to all in advance,
Daniele

Intel MPI installation problem

$
0
0

Hi,

I am trying to install Intel MPI on Windows Server 2012 R2 SERVERSTANDARDCORE but during installation occurs error 1603 connected with installation error 0x80040154: wixCreateInternetShortcuts: failed to create an instance of IUniformResourceLocatorW, and failed to create Internet shortcut. Do you have any idea how I can troubleshoot this?

Thanks for help,

Patrycja

 


No mpiicc or mpiifort with composer_xe/2016.0.109 ?

$
0
0

I started a new job and our company has composer_xe/2016.0.109 .When I load the module I do not get any mpiicc or mpiifort compiler? Does one need to have cluster edition for those?

Measuring data movement from DRAM to KNL memory

$
0
0

Dear All,

I am implementing and testing LOBPCG algorithm on KNL machine for some big sparse matrices. For the performance report, I need to measure how much data is transferred from DRAM to KNL memory. I am wondering if there is a simple way of doing this. Any help or idea is appreciated.

Regards,

Fazlay

BCAST error for message size greater than 2 GB

$
0
0

Hello,

I'm using Intel Fortran 16.0.1 and Intel MPI 5.1.3 and I'm getting an error with bcast as follows:

Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2231)........: MPI_Bcast(buf=0x2b460bcc0040, count=547061260, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1798)...:
MPIR_Bcast(1826)........:
I_MPIR_Bcast_intra(2007): Failure during collective
MPIR_Bcast_intra(1592)..:
MPIR_Bcast_binomial(253): message sizes do not match across processes in the collective routine: Received -32766 but expected -2106722256

I'm broadcasting Integer array (4 byte) of size 547061260. Is there an upper limit on the message size? The bcast works fine for smaller counts.

Thanks!

MPI_Alltoall error when running more than 2 cores per node

$
0
0

We have 6 Intel(R) Xeon(R) CPU D-1557 @ 1.50GHz nodes, each containing 12 cores.  hpcc version 1.5.0 has been compiled with Intel's MPI and MLK.  We are able to run hpcc successfully when configuring mpirun for 6 nodes and 2 cores per node.  However, attempting to specify more than 2 cores per nodes (we have 12) causes the error "invalid error code ffffffff (Ring Index out of range) in MPIR_Alltoall_intra:204"

Any ideas as to what could be causing this issue?

The following environment variables have been set:
I_MPI_FABRICS=tcp
I_MPI_DEBUG=5
I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,4,5,6,7,8,9,10,11

The MPI library version is:
Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)

hosts.txt contains a list of 6 hostnames

The line below shows how mpirun is specified to execute hpcc on all 6 nodes, 3 cores per node:
mpirun -print-rank-map -n 18 -ppn 3  --hostfile hosts.txt  hpcc

INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPIR_Alltoall_intra:204
Fatal error in PMPI_Alltoall: Other MPI error, error stack:
PMPI_Alltoall(974)......: MPI_Alltoall(sbuf=0x7fcdb107f010, scount=2097152, dtype=USER<contig>, rbuf=0x7fcdd1080010, rcount=2097152, dtype=USER<contig>, comm=0x84000004) failed
MPIR_Alltoall_impl(772).: fail failed
MPIR_Alltoall(731)......: fail failed
MPIR_Alltoall_intra(204): fail failed

Thanks!

 

cluster error: /mpi/intel64/bin/pmi_proxy: No such file or directory found

$
0
0

Hi,

I've installed Intel parallel studio cluster edition in single node installation configuration on the master node cluster of 8 nodes with 8 processors each. I've performed the pre-requisite steps before installation and verified shell connectivity also running the .sshconnectivity and creating machines.LINUX file which gave the result as suggesting all 8 nodes are found as follows:

*******************************************************************************
Node count = 8
Secure shell connectivity was established on all nodes.
See the log output listing "/tmp/sshconnectivity.aditya.log" for details.
Version number: $Revision: 259 $
Version date: $Date: 2012-06-11 23:26:12 +0400 (Mon, 11 Jun 2012) $
*******************************************************************************

machines.LINUX file has the following hostnames:

octopus100.ubi.pt
compute-0-0.local 
compute-0-1.local 
compute-0-2.local 
compute-0-3.local 
compute-0-4.local 
compute-0-5.local 
compute-0-6.local 

I started the installation and installed all the modules in /export/apps/intel directory which can be accessed by all nodes as suggested by the administrator of the cluster. After completing the installation I've added the compilers environmental variable psxevar.sh and mpivars.sh to the bash script as advised in the getting started manual. I then prepared the hostfile with all the nodes of the cluster for running in the mpi environment and verifies the shell connectivity by running .sshconnectivity form the installation directory and it worked like earlier and detected all nodes successfully.

i wanted to check the cluster configuration, so I compiled and executed the test.c program in the mpi/test directory of the instalation. I compiled well but when I executed myprog it returned the error: /mpi/intel64/bin/pmi_proxy: No such file or directory found as follows: 

[aditya@octopus100 Desktop]$ mpiicc -o myprog test.c
[aditya@octopus100 Desktop]$ mpirun -n 2 -ppn 1 -f ./hostfile ./myprog
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
bash: /export/apps/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/pmi_proxy: No such file or directory
^C[mpiexec@octopus100.ubi.pt] Sending Ctrl-C to processes as requested
[mpiexec@octopus100.ubi.pt] Press Ctrl-C again to force abort
[mpiexec@octopus100.ubi.pt] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@octopus100.ubi.pt] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@octopus100.ubi.pt] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@octopus100.ubi.pt] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@octopus100.ubi.pt] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@octopus100.ubi.pt] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

later I referred trouble shooting manual then it suggested running a non-mpi  for hostname and it returned the same error as follows:

[aditya@octopus100 Desktop]$ mpirun -ppn 1 -n 2 -hosts compute-0-0.local, compute-0-1.local hostname
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
bash: /export/apps/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/pmi_proxy: No such file or directory
^C[mpiexec@octopus100.ubi.pt] Sending Ctrl-C to processes as requested
[mpiexec@octopus100.ubi.pt] Press Ctrl-C again to force abort
[mpiexec@octopus100.ubi.pt] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@octopus100.ubi.pt] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@octopus100.ubi.pt] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@octopus100.ubi.pt] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@octopus100.ubi.pt] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@octopus100.ubi.pt] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

When I included the master ode octopus100.ubi.pt it worked only for that node but the rest nodes are not able to run the mpi commands I guess. I think may it is an environmental problem as the cluster nodes are not able to perform mpi communications with the master node.

Please help me resolve this issue so that I can perform some simulations on the cluster.

Thanks,

Aditya

 

Viewing all 930 articles
Browse latest View live