MPI performance problem on inter-switch connection

January 7, 2017, 10:49 am

Latest and popular articles on Intel Technologies

≫ Next: Signal propagation with Intel MPI

≪ Previous: How do I download older versions of MPI?

Hi,

I have a cluster with 32 machines. The first 25 machines are on the first rack and the rest 7 machines are on the second rack.
Each rack has a 1Gbps Ethernet switch.

I run a MPI application which uses 32 machines (1 process per host machine).
When I used the network performance benchmark tool like 'iperf' to measure the network speed between the machines, there is no problem (all point-to-point connection within 32 machines can exploit the full bandwidth).

In my application (MPI_Send/MPI_Recv), each mpi process sends a few 4MB sized data to the other machines. (so it is not the message size problem)
I found that the communication speed between the first 25 machines and the next 7 machines was very poor (~ 10 ~ 20 MB/sec)
(The communication speed within the first 25 machines and the next 7 machines are fast; 100 ~ 110 MB/sec)

What is the possible cause here? Is the latency killing it?
What can I do here to improve the performance?

Is there any suggested optimization?

↧

Signal propagation with Intel MPI

January 7, 2017, 5:37 am

Latest and popular articles on Intel Technologies

≫ Next: libfabric.so.1 not found warning

≪ Previous: MPI performance problem on inter-switch connection

Our lab recently added another HPC cluster and updated most other clusters to RHEL-7 (from 6). The PBS system was also updated to a newer supported version. My codes implement signal handlers to affect a controlled shutdown with or without checkpoints. I have been using the 2015 version of the Intel tool chain, in particular Intel MPI to run the jobs. I use SIGUSR1 and SIGUSR2, and before the upgrade (late December 2016) this all worked correctly, now it does not. My code always works with OpenMPI as well (tested on many different systems, including large national lab computers).

After the upgrade SIGUSR1 is not passed to the MPI processes and SIGUSR2 crashes the code. I have tested with with a very small program that just catches the signals and prints something. This is independent of PBS (ie, running the job interactively from a head node on the cluster). Note that the Intel tool chain, and presumably the MPI libraries within it have not changed! I cannot find anything on using these signals, other than one web post from years ago (2013?) indicating its a "bug". Also, SIGINT and SIGTERM are passed, even though the manual (for hydra exec variables, https://software.intel.com/en-us/node/528782) indicates that they should not be. Changing the I_MPI_JOB_SIGNAL_PROPAGATION has no effect on this behavior.

1. Does anyone understand what is going on?

2. What is Intel MPI's position on passing USR1 and USR2 to the mpi processes? Why would these ever be blocked or not propagated? Even if something else used these signals (like a checkpoint library) my handler should catch the signals first and supercede all others.

3. Why do some hydra parameters not work as advertised?

tnx ... John G. Shaw

Laboratory for Laser Energetics
University of Rochester

↧

libfabric.so.1 not found warning

January 10, 2017, 8:53 am

Latest and popular articles on Intel Technologies

≫ Next: MPI 5.0 Runtime Environment silent uninstall

≪ Previous: Signal propagation with Intel MPI

One warning I get with I_MPI_DEBUG set to 5 is this odd one:

[0] MPID_nem_ofi_init(): cannot load default ofi library libfabric.so, error=libfabric.so.1: cannot open shared object file: No such file or directory

I *THINK* this is saying we don’t have the OFI library installed. Is this something to worry about or just ignore it?

Would OFI give any performance benefit on Omni-Path fabric? What little I know is that OFI helps with portability but nothing else.

↧

MPI 5.0 Runtime Environment silent uninstall

January 13, 2017, 12:06 am

Latest and popular articles on Intel Technologies

≫ Next: MPI Performance issue on bi-directional communication

≪ Previous: libfabric.so.1 not found warning

Hello,

I'm an administrator at our company and are facing the task to uninstall the MPI 5.0 Runtime-Environment on 50+ Windows machines. The MPI 5.0 Runtime-Environment comes with an FEM Software we use and I succesfully created the installation routine to be silent using this manual:

https://software.intel.com/sites/default/files/managed/45/20/intelmpi-20...

In case we want to uninstall the software from our machines again I would like to uninstall its requirements (where Intel MPI belongs to) as well.

Unfortunately, only the silent installation is described in the manual, but not the silent uninstallation. Does anyone know the parameters the setup.exe under C:\ProgramData\Intel\Installer\ParallelStudio\cache\{A5A3784F-E722-477F-875E-E7A6AAA5A771} supports? Is that the right executable? At least within the registry this executable is referred to within the uninstall string.

Thanks a lot for your help!

Tim

Thread Topic:

Help Me

↧

MPI Performance issue on bi-directional communication

January 16, 2017, 5:28 am

Latest and popular articles on Intel Technologies

≫ Next: How to limit the amount of memory to be used for Eager messaging protocol?

≪ Previous: MPI 5.0 Runtime Environment silent uninstall

Hi,

(I attached the performance measurement program written in C++)

I am experiencing performance issue during bi-directional MPI_Send/MPI_Recv operations.

The program runs two threads; (One for MPI_Send and the other for MPI_Recv).
- MPI_Recv receives any data from any source.
- MPI_Send sends data to the other nodes one at a time (starting from its own rank, rank+1, ..., 0, ... rank -1)

You can compile the attached file as follows:
$ mpiicpc -O3 -m64 -std=c++11 -mt_mpi -qopenmp ./mpi-test.cpp -o mpi-test

You can test it as follows:
$ mpiexec.hydra -genv I_MPI_PERHOST 1 -genv I_MPI_FABRICS tcp -n 2 -machinefile ./machine_list /home/TESTER/mpi-test
rank[0] --> rank[0] BW=2060.27 [MB/sec]
rank[0] --> rank[1] BW=56.38 [MB/sec]
rank[0] BW=219.21 [MB/sec]
rank[1] BW=217.20 [MB/sec]

$ mpiexec.hydra -genv I_MPI_PERHOST 1 -genv I_MPI_FABRICS tcp -n 4 -machinefile ./machine_list /home/TESTER/mpi-test
rank[0] --> rank[0] BW=2050.59 [MB/sec]
rank[0] --> rank[1] BW=112.35 [MB/sec]
rank[0] --> rank[2] BW=57.19 [MB/sec]
rank[0] --> rank[3] BW=109.64 [MB/sec]
rank[0] BW=218.28 [MB/sec]
rank[1] BW=219.17 [MB/sec]
rank[2] BW=220.75 [MB/sec]
rank[3] BW=221.17 [MB/sec]

What I am observing is that when the data transfer from rank-A to rank-B and from rank-B to rank-A occur simultaneously, the performance drops significantly (almost to half).
The cluster machines use Cent OS 7, 1gbps ethernet that supports full duplex transimission mode.

How can I resolve this issue?

- Does Intel MPI support full-duplex transmission mode between two ranks?

Attachment	Size
Download mpi-test.cpp	2.5 KB

↧

How to limit the amount of memory to be used for Eager messaging protocol?

January 16, 2017, 4:46 pm

Latest and popular articles on Intel Technologies

≫ Next: MPI_Info_set crashes

≪ Previous: MPI Performance issue on bi-directional communication

Intel MPI has a parameter 'I_MPI_EAGER_THRESHOLD',

My program crashes due to OOM due to the memory used for Eager messaging protocols.

However, I cannot make it to Randevous protocol due to performance reasons.
(I skip explaining the reasons here, sorry :D)

Is there any way to limit the amount of memory to be used for Eager messaging protocol?
I just want to know there exists such a way!

↧

MPI_Info_set crashes

January 18, 2017, 7:27 am

Latest and popular articles on Intel Technologies

≫ Next: MPI_ISend bug on Fortran allocatable character

≪ Previous: How to limit the amount of memory to be used for Eager messaging protocol?

Here is a small section of my test code

use mpi

.......

call mpi_init(info)
comm = mpi_comm_world

call MPI_COMM_SIZE(comm, nproc, info)

call MPI_COMM_RANK(comm, node_me, info)

len0 = 300
len1 = int8(len0)*10*100
size = len1
disp_unit = 8
size = size*disp_unit

call MPI_Info_create(info_nc,ierr)

call MPI_Info_set(info_nc, "alloc_shared_noncontig" , ".true." ,ierr)

call MPI_Win_allocate_shared(size, disp_unit, info_nc, &
comm, baseptr, win,ierr)

.......

It always crashes when calling subroutine MPI_Info_set with a message as follows

Unhandled exception at 0x000007FEE07FDA6E (impi.dll)

I used the lastest intel fortran compiler 17.0 and intel mpi library 2017 (2017.1.143). It was tested on MS Windows 7.

Thanks.

Qinghua

Zone:

Thread Topic:

Bug Report

↧

MPI_ISend bug on Fortran allocatable character

January 18, 2017, 9:11 am

Latest and popular articles on Intel Technologies

≫ Next: mpiifort running fine on some nodes and showing "open_hca: device mlx4_0 not found" for others

≪ Previous: MPI_Info_set crashes

I am encountering what I believe to be a bug associated with allocatable character arrays and MPI_ISend. Sending allocatable character array with MPI_ISend and receiving with MPI_Recv returns junk. See test program below, compiled with -assume realloc_lhs. Same code with GNU compiler and OpenMPI receives and prints the expected result. Are there any other compiler flags that I should be working with that would influence compiler behavior here? Statically declaring the send string produces the expected result. This is commented out in the example below.

Intel Fortran Linux 17.0.1 20161005
Intel MPI Linux 5.1.3

program main
    use mpi_f08
    implicit none

    integer                      :: ierr
    !character(11)               :: send_char
    character(:),   allocatable  :: send_char
    character(11)                :: recv_char
    type(mpi_request)            :: handle

    call MPI_Init(ierr)

    send_char = 'test string'

    call MPI_ISend(send_char,11,MPI_CHARACTER,0,0,MPI_COMM_WORLD,handle,ierr)
    call MPI_Recv(recv_char, 11,MPI_CHARACTER,0,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE,ierr)
    call MPI_Wait(handle,MPI_STATUS_IGNORE,ierr)

    print*, recv_char

end program main

Thread Topic:

Bug Report

↧

mpiifort running fine on some nodes and showing "open_hca: device mlx4_0 not found" for others

January 25, 2017, 12:08 am

Latest and popular articles on Intel Technologies

≫ Next: Problem executing HPCC and IMB benchmarks on IBM platform HPC

≪ Previous: MPI_ISend bug on Fortran allocatable character

Dear all,

Using mpiifort on a cluster results in : "open_hca: device mlx4_0 not found" for some group nodes while for others there is no error and mpiifort runs perfectly fine. All the nodes have the same hardware/software configuration. I already had a look at the similar topic at :

https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/393416

And applied the proposed solution of commenting the ofa-v2-mlx4_0-1 and ofa-v2-mlx4_0-2 lines in /etc/dat.conf, but it did not solve the issue.

Would you have any idea of what might be wrong ? I attach the error log as well as ibstat output if it can help :

$ ibstat

CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 1
        Firmware version: 2.31.5050
        Hardware version: 1
        Node GUID: 0xf45214030090c050
        System image GUID: 0xf45214030090c053
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0xf45214030090c051
                Link layer: InfiniBand

Many thanks in advance,

EdrisseDownload text/plain Download

↧

Problem executing HPCC and IMB benchmarks on IBM platform HPC

January 29, 2017, 4:08 am

Latest and popular articles on Intel Technologies

≫ Next: MPI processes got killed when the computer goes in sleeping mode

≪ Previous: mpiifort running fine on some nodes and showing "open_hca: device mlx4_0 not found" for others

I am a newbie on HPC world, and I know a little about MPI and Infiniband.

I want to execute the IMB 2017 and HPCC 1.5.0 benchmarks on our HPC to be sure that all is configured correctly.

we have 32 compute nodes, each one with 16 cores and 32GB of memory. each node have a qlogic infiniband card with one port at 40Gb/s.

the OS used is RHEL 6.5 with IBM Platform HPC 4.2.

Ofed used is the : IntelIB-OFED.RHEL6-x86_64.3.5.2.34

GCC : gcc version 4.4.7

I managed to compile IMB and HPCC with both IBM platform MPI (PMPI) and OpenMPI 2.0.1 (OMPI)

1) IMB When executing IMB benchmark with both PMPI and OMPI on infiniband links I get at most

-----------------------------------------------------
Benchmarking PingPong
# processes = 2
---------------------------------------------------

   #bytes #repetitions      t[usec]   Mbytes/sec
        0         1000         1.51         0.00
        1         1000         1.51         0.63
       ...         ...           ...         ...
  2097152           20       675.20      2962.09
  4194304           10      1320.45      3029.26

3029 MB/s of throughput, I expected more, something near 4000 Mb/s is this result correct?

2) HPCC I used this site http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/ to generate the test profile. When executing the benchmark, using less than 25 nodes, the test goes without problem, I didn't wait for the test to complete, but my problem is when I start the benchmark on all nodes, after 2 to 5 seconds I get this error message : compute014.6359Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)

and the benchmark exits and is killed, it's not the same node, each time it's another node. any idea?

This are the commands used to start the hpcc benchmark : OMPI : mpirun -np 512 --display-allocation --mca btl self,sm --mca mtl psm --hostfile hosts32 /shared/build/hpcc-150-blas-ompi-201/hpcc hpccinf.txt

PMPI : mpirun -np 512 -PSM -hostfile hosts32 /shared/build/hpcc-150-blas-pmpi/hpcc hpccinf.txt

If you need more information, let me know.

Regards

if you need more information, let me know.

Regards.

↧

MPI processes got killed when the computer goes in sleeping mode

January 30, 2017, 6:21 am

Latest and popular articles on Intel Technologies

≫ Next: Why runtime is invalid?

≪ Previous: Problem executing HPCC and IMB benchmarks on IBM platform HPC

I notice that all processes started under MPI are gone when the computer wakes up from a sleeping mode.

Is this normal behavior? Is there any setting that keeps the processes alive?

Dany

Zone:

Windows*

Thread Topic:

Question

↧

Why runtime is invalid?

January 31, 2017, 6:41 am

Latest and popular articles on Intel Technologies

≫ Next: why 'logic' value of KMP_AFFINITY is invalid?

≪ Previous: MPI processes got killed when the computer goes in sleeping mode

Hi, Intel technician,

when I set the environmental variable OMP_SCHEDULE = runtime, the following warning is given:

D:\users\dingjun\build\data>mx201610.exe -f D:\users\dingjun\build\data\mx8321x105x10.dat -log -jacdoms 32 -parasol 32 -o mx8321x105x10_32cores1job_i
mbdw2s11_windows_runtime_compact1_affinity_manualRun_1
OMP: Warning #42: OMP_SCHEDULE: "runtime" is an invalid value; ignored.
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 16 cores/pkg x 1 threads/core (32 total cores)

Could you tell me why the runtime value is invalid? according to your manual, it seems to be valid. please see the following:

Type

Effect

STATIC

Divides iterations into contiguous pieces by dividing the number of iterations by the number of threads in the team. Each piece is then dispatched to a thread before loop execution begins.

If chunk is specified, iterations are divided into pieces of a size specified by chunk. The pieces are statically dispatched to threads in the team in a round-robin fashion in the order of the thread number.

DYNAMIC

Can be used to get a set of iterations dynamically. It defaults to 1 unless chunk is specified.

If chunk is specified, the iterations are broken into pieces of a size specified by chunk. As each thread finishes a piece of the iteration space, it dynamically gets the next set of iterations.

GUIDED

Can be used to specify a minimum number of iterations. It defaults to 1 unless chunk is specified.

If chunk is specified, the chunk size is reduced exponentially with each succeeding dispatch. The chunk specifies the minimum number of iterations to dispatch each time. If there are less than chunk iterations remaining, the rest are dispatched.

AUTO¹

Delegates the scheduling decision until compile time or run time. The schedule is processor dependent. The programmer gives the implementation the freedom to choose any possible mapping of iterations to threads in the team.

RUNTIME¹

Defers the scheduling decision until run time. You can choose a schedule type and chunk size at run time by using the environment variable OMP_SCHEDULE.

¹No chunk is permitted for this type.

I look forward to hearing from you. Thanks.

best regards,

Dingjun

↧

why 'logic' value of KMP_AFFINITY is invalid?

January 31, 2017, 7:41 am

Latest and popular articles on Intel Technologies

≫ Next: memory problem in I_MPI_WAIT_MODE=1 with shm:ofa fabrics

≪ Previous: Why runtime is invalid?

Hi, INTEL technician,

When environmental variable KMP_AFFINITY=logic, the following warning is given:

D:\users\dingjun\tests>set KMP_AFFINITY=verbose,logic

D:\users\dingjun\tests>set OMP_SCHEDULE=static,1

D:\users\dingjun\tests>cd D:\users\dingjun\tests

D:\users\dingjun\tests>mx201510.exe -f D:\users\dingjun\tests\mx1041x105x10.dat -log -jacdoms 32 -parasol 32 -o mx1041x105x10_32cores1job_windows_static_logic_affinity_manualRun_1
OMP: Warning #58: KMP_AFFINITY: parameter invalid, ignoring "logic".
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 4 packages x 8 cores/pkg x 1 threads/core (32 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 8 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 9 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 10 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 11 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 12 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 13 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 14 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 15 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 16 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 17 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 18 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 19 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 20 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 21 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 22 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 23 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 24 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 25 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 26 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 27 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 28 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 29 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 30 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #147: KMP_AFFINITY: Internal thread 31 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}

Such warnings only occurred on the 4-sockets 32-cores Sandy Bridge computer. On 2-sockets 32-cores Broadwell computer, when KMP_AFFINITY = logic, there is no such a warning and it shows 'logic' value works for KMP_AFFINITY on Broadwell computers.

Could you tell me why this occurred?

Thanks,

Best regards,

Dingjun

↧

memory problem in I_MPI_WAIT_MODE=1 with shm:ofa fabrics

January 31, 2017, 7:52 pm

Latest and popular articles on Intel Technologies

≫ Next: MPI_Recv Hangs While Receiving Large Array

≪ Previous: why 'logic' value of KMP_AFFINITY is invalid?

Hi,

We found that when using the wait mode with shm:ofa fabrics, the processes of the MPI program use more memory than the other configurations. In some situations, the program crashes with memory exhaustion. The issue seems reproducible in a couple of programs, including WRF and CMAQ. We tried using the intel's HPL benchmark program to reproduce the problem, though not crashing, yet getting similar warning messages as follows:

================================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N        : 100000
NB       :     168
PMAP     : Row-major process mapping
P        :      24
Q        :       1
PFACT    :   Right
NBMIN    :       4
NDIV     :       2
RFACT    :   Crout
BCAST    : 1ringM
DEPTH    :       0
SWAP     : Mix (threshold = 64)
L1       : transposed form
U        : transposed form
EQUIL    : yes
ALIGN    :    8 double precision words

--------------------------------------------------------------------------------

[...]

Column=057624 Fraction=0.575 Mflops=296889.01
Column=059640 Fraction=0.595 Mflops=295321.12
Column=061656 Fraction=0.615 Mflops=293425.73
[15] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[3] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[18] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[0] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
Column=063504 Fraction=0.635 Mflops=291756.47
[6] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[12] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[21] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[13] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[4] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[9] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[1] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[22] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[7] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[10] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[14] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[5] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[11] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[2] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[16] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[8] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[19] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[23] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[17] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
Column=065520 Fraction=0.655 Mflops=290125.43
[20] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended

[...]

The test was done on CentOS6.7 with Intel MPI 5.0.2.

Thank you very much,

regards,

tofu

↧

MPI_Recv Hangs While Receiving Large Array

February 7, 2017, 1:41 am

Latest and popular articles on Intel Technologies

≫ Next: MPI/openMP segmentation error

≪ Previous: memory problem in I_MPI_WAIT_MODE=1 with shm:ofa fabrics

I'm noticing some strange behaviour with a very simple piece of MPI code:

#include <mpi.h>

int main(int argc, char* argv[])
{
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // We are assuming at least 2 processes for this task
    if (world_size != 2)
    {
        std::cout << "World size must be equal to 1"<< std::endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    int numberCounter = 10000;
    double number[numberCounter];

    if (world_rank == 0)
    {
        std::cout << world_rank << std::endl;
        MPI_Send(number, numberCounter, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);
    }
    else if (world_rank == 1)
    {
        std::cout << world_rank << std::endl;
        MPI_Recv(number, numberCounter, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }

    MPI_Finalize();
}

The above works fine provided that numberCounter is small (~1000). When the value is larger however (>10000), the code hangs and never reaches the end. I checked with MPI_Iprobe and it does show that rank 1 is receiving a message, but MPI_Recv always hangs.

What could be causing this? Can anyone else reproduce this behaviour?

Thread Topic:

Question

↧

MPI/openMP segmentation error

February 8, 2017, 6:51 pm

Latest and popular articles on Intel Technologies

≫ Next: mpsvars error

≪ Previous: MPI_Recv Hangs While Receiving Large Array

Hello,

when I run my program on two nodes, and set the number of threads larger than 1 (i.e., 2-16 threads), I've encountered the segmentation error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

if the openMP directives are commented (still using MPI) the error is gone.

any comment/advice is appreciated!!

↧

mpsvars error

February 10, 2017, 11:08 am

Latest and popular articles on Intel Technologies

≫ Next: cross-compilation

≪ Previous: MPI/openMP segmentation error

Dear All,

I have successfully installed parallel_studio_xe_2017_update1 in CentOS 7. Installed package is working good. However, I am getting the error when I source it from my .tcshrc file. Please suggest what could be an issue. My default SHELL is tcsh. Please see the error when I login to the terminal and source it through terminal. Please help.

Raju

Last login: Fri Feb 10 10:32:49 2017 from 146.114.195.2
Intel(R) Parallel Studio XE 2017 Update 1 for Linux*
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.
mpsvars.csh: ERROR: Could not find correct folders structure.
[XXXX@air ~]$ source /opt/intel/parallel_studio_xe_2017.1.043/bin/psxevars.csh
Intel(R) Parallel Studio XE 2017 Update 1 for Linux*
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.
[XXXX@air ~]$

Zone:

Server

Thread Topic:

Help Me

↧

cross-compilation

February 15, 2017, 5:15 pm

Latest and popular articles on Intel Technologies

≫ Next: mpirun errors

≪ Previous: mpsvars error

I would like to buy Parallel Studio XE if cross-compilation is possible.
I generate execution modules of 2 GB or more.
I actually compiled on Linux. And I confirmed that it is possible.
However, I need to do the execution on windous. Is this possible? I tried various options. However, good results were not obtained.
Please tell me whether it is possible or impossible. If you can, please let me know the procedure.
Thank you.

Zone:

Windows*

Thread Topic:

Help Me

↧

mpirun errors

February 17, 2017, 12:00 pm

Latest and popular articles on Intel Technologies

≫ Next: optimized executables for specific processors

≪ Previous: cross-compilation

Hi,

I've been using intel compilers for long time and I'm really impressed about them. Recently I created a small cluster and installed intel parallel studio_xe cluster edition. I wanted to use MPI for my scientific computing work. I installed everything correctly and configured all the environmental variables. The operating system of cluster is linux opensuse Leap. I set up manually a password-less ssh connection and tested and worked properly. mpiifort wrapper work fine. Everything work find until to point where I have to run mpirun command. When I give the mpirun following command

mpirun -n 2 -ppn 1 -f hosts ./a.out

it gives goes into a infinite loop and when break the process by force gives following messages,

^C[mpiexec@master] Sending Ctrl-C to processes as requested
[mpiexec@master] Press Ctrl-C again to force abort
[mpiexec@master] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@master] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@master] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@master] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@master] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@master] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

I had previously installed openMPI and mpich version. But I removed all of them and tried again but ended up with same results. I checked which mpirun is called by using the command 'which mpirun', the it gives me,

/home/lakshitha/intel/compilers_and_libraries_2017.1.132/linux/mpi/intel64/bin/mpirun

I've attached hosts file here which has the names of hosts. I tried several things but did not work. I really appreciate if somebody can sort this out.

Attachment	Size
Download hosts.txt	20 bytes

↧

optimized executables for specific processors

February 22, 2017, 11:17 am

Latest and popular articles on Intel Technologies

≫ Next: MPI_Allreduce is toooo slow

≪ Previous: mpirun errors

We have four clusters composed of nodes of different vintage Intel Xeon processors.

Intel(R) xeon(R) CPU E5-2697

Intel(R) Xeon(R) E5-2690

Intel(R) Xeon(R) x5675

Intel(R) Xeon(R) e5530

We are using 16U3 versions of the Intel ifort compiler.

Are there compilation optimization parameters I should use, looking for ultimate performance, that would produce an executable best for each machine?

Or is there one set of optimization parameters that should just as good an executable that I could compile on any machine and execute on another (or all) machines? Again, we are looking for ultimate performance as opposed to portability as a prime concern.

Zone:

Modern Code

Thread Topic:

Question

↧