Communication layers and eager thresholds

July 4, 2018, 2:22 am

Latest and popular articles on Intel Technologies

≫ Next: Unable to compile Timer sample (timertest.c, timerperformance.c)

≪ Previous: Zero-sized .stf file generated from ITAC

Hello,

I am bit confused about the run-time behaviour of one of our codes compiled with Intel ParallelStudio XE 2017.2.050 and run with the corresponding IntelMPI. Depending on the choice of communication layer and eager threshold settings, the hybrid-parallel code crashes - potentially with corrupted data (it seems to have wrong numbers in its data arrays which then lead to program errors) - or runs through.

What does work:

- I_MPI_FABRICS shm:tcp

- I_MPI_FABRICS shm:ofa with I_MPI_EAGER_THRESHOLD=2097152 (2MB)

What does not work:

- I_MPI_FABRICS shm:ofa without specification of or with a too low I_MPI_EAGER_THRESHOLD value (default should be 256kB)

The problem definitely seems to be the inter-node communication, as specifying only I_MPI_INTRANODE_EAGER_THRESHOLD does not help while switching from Infiniband (ofa) to Ethernet (tcp) does. Just for information, on other clusters (admittedly also with different Compiler/MPI Suite combinations) we have not seen the same issue. Do you have some general information on how such a behavior can be explained - without going into the details of this specific software? Is it possible that data is overwritten in too small message buffers or due to wrong memory addressing, or something along this path? Would you recommend to start on the code side or on the Infiniband configuration side for debugging, and do you have suggestions as to the possible cause? As you might guess by now I am not a software developer, especially when it comes to parallel applications. Therefore, I would greatly appreciate any help and hints.

Thanks and best regards,

Alexander

↧

Unable to compile Timer sample (timertest.c, timerperformance.c)

July 10, 2018, 4:48 am

Latest and popular articles on Intel Technologies

≫ Next: ..\hydra\pm\pmiserv\pmiserv_cb.c (781): connection to proxy failed

≪ Previous: Communication layers and eager thresholds

"This example provides functionality to test performance of the VT_timestamp() function."

"Ensure that the corresponding compiler, Intel MPI Library, and Intel Trace Analyzer and Collector are already in your PATH"

When trying to build (make) the example, all I can get is the following error message:

"You need to set VT_ROOT env variable to use Intel(R) Trace Collector"

What should I do to set up the proper environment to compile only this utility ?

The utility is attached to this message as a ZIP file.

Thanks.

Attachment	Size
Download timertest.zip	16.49 KB

↧

..\hydra\pm\pmiserv\pmiserv_cb.c (781): connection to proxy failed

July 11, 2018, 10:21 am

Latest and popular articles on Intel Technologies

≫ Next: MPI Linpack from MKL, SSE4_2, turbo, and Skylake: SSE 4.2 threads run at the AVX2 turbo frequency

≪ Previous: Unable to compile Timer sample (timertest.c, timerperformance.c)

Hi,

I am having issues with the Windows Intel MPI version, 5.1. I am trying to run a program that uses the Intel MPI on two Windows 7 computers.

This is the error from the mpi program I am running.

[mpiexec@hostname] ..\hydra\pm\pmiserv\pmiserv_cb.c (781): connection to proxy 1 at host 10.10.13.116 failed

[mpiexec@hostname] ..\hydra\tools\demux\demux_select.c (103): callback returned error status

[mpiexec@hostname] ..\hydra\pm\pmiserv\pmiserv_pmci.c (500): error waiting for event

[mpiexec@hostname] ..\hydra\ui\mpich\mpiexec.c (1130): process manager error waiting for completion

The following items have been checked.

1) \hydra_service.exe -status
hydra service running on hostname

2) Both machines have the same software version and setup.

3) .\mpiexec -ppn 1 -n 2 -v -f hostfile.txt hostname
returns the host name of the two computers which I am using.

4) Firewall is turned off on both computers.

Are there tools I can use to debug the proxy connection issue?

Thanks,

Andy

↧

MPI Linpack from MKL, SSE4_2, turbo, and Skylake: SSE 4.2 threads run at the AVX2 turbo frequency

July 26, 2018, 12:56 pm

Latest and popular articles on Intel Technologies

≫ Next: Bug Report for impi 2017 and 2018 versions on AVX512 niodes

≪ Previous: ..\hydra\pm\pmiserv\pmiserv_cb.c (781): connection to proxy failed

This is a follow-on from this related topic in the MKL Forum
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/782951

In that situation, there were a couple of version of Intel MPI 2017.* in which a couple of the threads in each MPI process running Linpack with SSE 4.2 instructions would run at the AVX-512 turbo frequency rather than at the non-AVX frequency. (The other threads would all run at the non-AVX frequency, as expected). I'm using the version of Linpack that comes with MKL; I find it at /opt/intel/compilers_and_libraries_2018.3.222/linux/mkl/benchmarks/mp_linpack/, for example.

That problem was avoided by using older or newer versions of MPI.

The current problem is more subtle, and is from, I suspect, Intel MPI using "light-weight" AVX-512 instructions on Skylake; perhaps for copying data. The wikichip page https://en.wikichip.org/wiki/intel/frequency_behavior says that "light-weight" AVX-512 instructions will run at the AVX 2.0 frequency.

Typically it wouldn't be a problem for MPI communication to run at a lower frequency, as the core's frequency will drop from the non-AVX frequency to the AVX 2.0 frequency while executing these instructions, but will return to the non-AVX frequency soon after.

The problem, however, is in jitter-sensitive environments. Even there it typically won't be a problem, because the MPI code will likely be in a section that is not sensitive to jitter; other cores should not be affected and the thread doing communication should soon return to its higher frequency..

However, Hewlett Packard Enterprise has a feature called Jitter Smoothing as part of its Intelligent System Tuning feature set (https://support.hpe.com/hpsc/doc/public/display?docId=a00018313en_us), When this feature is set to "Auto" mode, the server will notice changes to the operating frequency of the cores, and will decrease the frequency of all cores on all processors accordingly. The system soon gets to a stable state in which frequency changes no longer occur, with a resulting state of minimized jitter -- but at the lower frequency of the AVX 2.0 turbo frequency.

On my servers, I have this Jitter Smoothing feature enabled. When I run, for example, the MPI version of Linpack with the environment variable MKL_ENABLE_INSTRUCTIONS=SSE4_2 on Skylake with turbo enabled, I see the operating frequency gradually drop from the non-AVX frequency to the AVX2 frequency. This happens on all cores across the server, which is what Jitter Smoothing is expected to do. I do not see this happen when I run the OpenMP (non-MPI) version of Linpack; in that case all the CPUs stay at the non-AVX frequency for the duration of the workload.

As I said, I hypothesize that MPI is using some AVX-512 instructions, perhaps to copy data, and that's causing my performance problem. If my hypothesis is correct, my question is if there is a way to tell Intel MPI to not use AVX instructions, similar to the MKL_ENABLE_INSTRUCTIONS environment variable that Intel MKL uses.

↧

Bug Report for impi 2017 and 2018 versions on AVX512 niodes

August 3, 2018, 1:41 pm

Latest and popular articles on Intel Technologies

≫ Next: MPI error on 2019 Beta version

≪ Previous: MPI Linpack from MKL, SSE4_2, turbo, and Skylake: SSE 4.2 threads run at the AVX2 turbo frequency

For a specific code, which has a range of calls using impi, there is a probability of about 0.1% of obtaining incorrect results when using impi versions later than 5.0.2.044 on machines with AVX512 registers (Intel Xeon Gold 6130 2.10 GHz). The problem is highly reproducible, although because it is a low probability error it requires 1K or more repeat runs and quantifying the exact rate is difficult. (It seems to be slightly more common when there is other activity on the node.) It is not a catastrophic error, it just ends up with slightly incorrect results being produced.

The problem does not occur on Xeon E5-2650 machines (AVX), or other older E5410 that we have.

There are zero indications that this has anything to do with errors on our ib. We have spent significant time with the cluster vendor and there are no indicators.

I regressed the ifort compiler, and even removed any mkl scalapack (using the Netlib version) so I am 99.9999% certain that it is impi related; only regressing the impi version removed the problem.

Of course, because it is a low probability error it may well have been undetected.

N.B., I originally posted this to one of the Intel support links, but for some reason they wanted it to go to a Forum rather than providing me with direct support.

↧

MPI error on 2019 Beta version

August 3, 2018, 8:55 pm

Latest and popular articles on Intel Technologies

≫ Next: API for I_MPI_ADJUST Family

≪ Previous: Bug Report for impi 2017 and 2018 versions on AVX512 niodes

I am the admin of a small cluster, trying to help a user. The user has been able to compile and run this code on a Cray with Intel XE 2017, update 7.

On our local cluster, we only had 2017, update 1. So I decided to install the 2019 Beta version. The user was able to compile, but in running she received this error:

mpirun -np 16 /users/kings/navgem-x/bin/sst_interp

Abort(1618063) on node 1 (rank 1 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:

MPIR_Init_thread(613)..........:

MPID_Init(378).................:

MPIDI_NM_mpi_init_hook(1047)...:

MPIDI_OFI_create_endpoint(1873): OFI EP enable failed (ofi_init.h:1873:MPIDI_OFI_create_endpoint:Cannot allocate memory)

Abort(1618063) on node 12 (rank 12 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:

MPIR_Init_thread(613)..........:

MPID_Init(378).................:

MPIDI_NM_mpi_init_hook(1047)...:

MPIDI_OFI_create_endpoint(1873): OFI EP enable failed (ofi_init.h:1873:MPIDI_OFI_create_endpoint:Cannot allocate memory)

Any clue to what this means? Thank you.

↧

API for I_MPI_ADJUST Family

August 5, 2018, 12:49 pm

Latest and popular articles on Intel Technologies

≫ Next: API for I_MPI_ADJUST Family

≪ Previous: MPI error on 2019 Beta version

Hi,

I am wondering whether Intel MPI provides a C API for changing the algorithm ID of collectives during run-time?

As far as I see in the documentation, I can only set the I_MPI_ADJUST* environment variables to choose the desired algorithm.

Perhaps there is an unofficial way of doing this?

Thanks

↧

API for I_MPI_ADJUST Family

August 6, 2018, 1:28 pm

Latest and popular articles on Intel Technologies

≫ Next: I_MPI_ADJUST_BCAST: srun vs. mpiexec, performance differences

≪ Previous: API for I_MPI_ADJUST Family

Hi,

I would like to know whether the Intel MPI Library also has an API to select the algorithm for a given collective operation inside an MPI program?

As far as I understand the documentation, I can only use the I_MPI_ADJUST environment variables.

Is that correct or is there an API (perhaps undocumented)?

Thank you

↧

I_MPI_ADJUST_BCAST: srun vs. mpiexec, performance differences

August 10, 2018, 8:24 am

Latest and popular articles on Intel Technologies

≫ Next: Intel(R) MPI Library for Windows* OS, Version 2018 Update 2 Build 20180125 Fails to run in Windows 10

≪ Previous: API for I_MPI_ADJUST Family

Hi,

I was testing the I_MPI_ADJUST_BCAST flags on our OmniPath cluster (36 nodes). And I was using the OSU microbenchmarks for comparison and testing reasons.

1 Problem 1, mpiexec vs. srun when I_MPI_ADJUST_BCAST is unset

First, I get a SLURM allocation like this: salloc -N 36 –cpu-freq=HIGH -t 10:00:00 (The processors are now all running at the max. frequency.)

I do not set the I_MPI_ADJUST_BCAST, or I do:

unset I_MPI_ADJUST_BCAST

Now with srun I get this

srun --mpi=pmi2 --cpu-bind=core -N 36 --ntasks-per-node=32 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast

# OSU MPI Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
1                     417.27
2                     418.78
4                     417.23
8                     417.51
16                    420.89
32                    463.69
64                    553.98
128                   569.21
256                   606.14
512                   672.11
1024                  848.40
2048                 1009.13
4096                 1360.44
8192                 2081.46
16384                2907.58
32768                3046.97
65536                3184.76
131072               3416.80
262144               3873.73
524288               4500.13
1048576              5556.90

However, the same call with mpiexec gives me these values (same allocation)

mpiexec -machinefile ./machinefile_36_intel -n 1152 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast

# OSU MPI Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
1                       4.92
2                       4.94
4                       4.85
8                       5.01
16                      5.56
32                      5.49
64                      5.57
128                    15.20
256                    13.80
512                    14.22
1024                   15.55
2048                   17.12
4096                   20.43
8192                   27.16
16384                 131.68
32768                 143.90
65536                 170.50
131072                222.50
262144                328.82
524288                513.49
1048576               998.40

Thus, I assume that the Intel MPI library uses some optimized values for mpiexec (but there are none in the etc directories.). But it seems that the selection strategy significantly differs in both cases. Why?

2 Differences with I_MPI_ADJUST_BCAST=0

Similarly, there are also differences between srun and mpiexec when setting I_MPI_ADJUST_BCAST=0

export I_MPI_ADJUST_BCAST=0

srun

srun --mpi=pmi2 --cpu-bind=core -N 36 --ntasks-per-node=32 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast

# OSU MPI Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
1                       6.94
2                       6.94
4                       6.99
8                       6.91
16                     11.99
32                     12.28
64                     12.45
128                    12.61
256                    12.88
512                    13.52
1024                   15.46
2048                   18.42
4096                   23.35
8192                   32.81
16384                3157.92
32768                3229.47
65536                3411.76
131072               3447.75
262144               3517.37
524288               3783.26
1048576              4297.23

mpiexec

mpiexec -machinefile ./machinefile_36_intel -n 1152 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast

# OSU MPI Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
1                       5.69
2                       5.63
4                       5.76
8                       5.64
16                      6.54
32                      6.61
64                      6.64
128                    14.39
256                    14.63
512                    15.34
1024                   17.06
2048                   19.73
4096                   25.12
8192                   35.69
16384                1350.87
32768                1357.04
65536                1410.95
131072               3447.24
262144               3508.40
524288               3683.10
1048576              4053.10

Interestingly, the performance changes whether or not I am using I_MPI_ADJUST_BCAST=0. The reference guide states that "(>= 0 The default value of zero selects the reasonable settings)". So, it's not really a default value as "unset I_MPI_ADJUST_BCAST" produces different values. Why is that happening?

And, why are there performance differences depending on the use of srun or mpiexec? For example, with 16 Bytes, it is consistently slower with srun than when using mpiexec.

Strangely enough, there are also difference between srun and mpiexec when I set I_MPI_ADJUST_BCAST to any other fixed values, e.g., export I_MPI_ADJUST_BCAST=1

Can it really be caused by the different ways of launching the processes, pmi2 (srun) vs. ssh (mpiexec)?

3 The table format of mpitune

I ran a few tests with mpitune to see what values would work best for me. The log gives me the following information:

10'Aug'18 00:00:46     | TASK 40 from 507 : Launch line template with default parameters: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/bin/mpiexec.hydra   -perhost 32 -machinefile
 "/home/hunold/tmp/mpitune_temp/mpituner_qTPWzY_tmp"  -n 1152 -env I_MPI_FABRICS ofi  "IMB-MPI1" -npmin 1152 -iter 100 -root_shift 0 -sync 1 -imb_barrier 1 -msglen "/home/hunold/tmp/mpitune_temp/mpi
tuner_q_WnMb_tmp" Bcast
10'Aug'18 00:01:01     | TASK 40 from 507 : Report preparation successfully completeded. Printing... 
10'Aug'18 00:01:01     | TASK 40 from 507 : Spreadsheet for I_MPI_ADJUST_BCAST option:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|              |         |                                             Times for algorithms                                            |              |           |              |
| Message size | Initial |-------------------------------------------------------------------------------------------------------------|   Tuned vs   | Validated | Validated vs |
|   (bytes)    |   time  |  Alg 1  |  Alg 2  |  Alg 3  |  Alg 4  |  Alg 5  |  Alg 6  |  Alg 7  |  Alg 8  |  Alg 9  |  Alg 10 |  Alg 11 | initial (ex) |    time   | initial (ex) |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|            1 |    7.96 |   10.61 | 1310.25 | 1317.49 |    9.26 |   12.50 |   39.31 |  807.26 |    8.78 |   *7.77*|    8.36 |    8.03 |        0.98X |      8.49 |        1.07X |
|            2 |    7.49 |   10.44 | 1308.36 | 1321.40 |    9.38 |   12.39 |   39.37 |  763.48 |    8.85 |    7.72 |    7.98 |   *7.85*|        1.05X |      8.32 |        1.11X |
|            4 |    7.79 |   10.03 | 1308.57 | 1320.85 |    9.59 |   13.13 |   39.39 |  755.16 |    8.79 |    7.93 |    7.87 |   *7.76*|        1.00X |      7.54 |        0.97X |
|            8 |    7.23 |   10.08 | 1314.70 | 1321.07 |    9.00 |   13.07 |   39.82 |  759.66 |    8.54 |    8.07 |    7.88 |   *7.69*|        1.06X |      7.26 |        1.00X |
|           16 |    8.52 |   11.55 | 1317.22 | 1323.71 |   10.54 |   14.85 |   41.07 |  963.03 |    9.75 |   *8.48 |    8.21*|    8.34 |        0.96X |      8.14 |        0.96X |
|           32 |    8.23 |   17.10 | 1312.19 | 1322.06 |   15.17 |   16.18 |   41.80 | 2002.17 |   16.12 |    8.44 |   *7.91*|    8.22 |        0.96X |      8.14 |        0.99X |
|           64 |    8.94 |   13.04 | 1313.15 | 1322.47 |   11.15 |   17.76 |   42.65 | 1116.06 |   10.96 |   *8.31*|    8.85 |    8.23 |        0.93X |      8.38 |        0.94X |
|          128 |   26.71 |   22.49*| 1313.86 | 1326.43 |   28.37 |   33.92 |   58.99 | 1135.23 |   32.15 |  *26.23 |   26.74 |   26.04 |        0.84X |     25.68 |        0.96X |
|          256 |   31.72 |   23.12*| 1319.28 | 1332.02 |  *28.56 |   34.55 |   59.23 | 1147.76 |   32.55 |   26.82 |   26.98 |   26.81 |        0.73X |     23.02 |        0.73X |
|          512 |   28.79 |   23.88*| 1322.55 | 1340.19 |   29.06 |   37.10 |   63.43 | 1183.59 |   32.87 |   27.26 |  *27.72 |   27.94 |        0.83X |     23.66 |        0.82X |
|         1024 |   21.51 |   25.56 | 1335.18 | 1346.72 |   26.30 |   33.11 |   71.77 | 1356.15 |   35.04 |  *21.26*|   21.46 |   21.17 |        0.99X |     21.08 |        0.98X |
|         2048 |   24.12 |   29.25 | 1338.87 | 1350.27 |   29.65 |   35.48 |   72.93 | 1907.35 |   38.60 |  *23.92*|   24.13 |   24.09 |        0.99X |     24.01 |        1.00X |
|         4096 |   29.22 |   37.21 | 1387.39 | 1403.71 |   38.48 |   40.52 |   81.73 | 3033.93 |   49.60 |   27.83 |  *28.53*|   28.75 |        0.98X |     28.02 |        0.96X |
|         8192 |   36.75 |   49.39 | 1384.39 | 1404.95 |   49.85 |   50.34 |   91.12 | 4113.16 |   63.48 |  *37.07*|   36.85 |   36.46 |        1.01X |     37.05 |        1.01X |
|        16384 |  134.19 |   84.40*| 1496.26 | 1494.57 |  156.92 |  158.23 |  194.76 | 5402.68 |  117.79 |  133.16 | *132.46 |  134.09 |        0.63X |     84.83 |        0.63X |
|        32768 |  148.22 |  143.02*| 1886.85 | 1898.17 |  187.09 |  201.65 |  217.07 | 5395.04 |  200.72 |  148.02 |  149.20 | *149.19 |        0.96X |    145.48 |        0.98X |
|        65536 |  176.27 |  269.84 | 1996.62 | 1992.10 |  241.02 |  276.40 |  261.34 | 6700.34 |  391.33 |  175.88 |  175.16 | *174.58*|        0.99X |    175.44 |        1.00X |
|       131072 |  232.73 |  513.06 | 3446.27 | 3443.95 |  357.87 |  398.02 |  345.07 | 5645.96 |  742.91 |  229.82 |  229.54 | *230.49*|        0.99X |    230.70 |        0.99X |
|       262144 |  340.12 | 1014.85 | 3448.24 | 3451.58 |  594.88 |  613.78 |  502.43 | 6764.08 | 1433.04 |  340.40 | *339.58*|  338.81 |        1.00X |    340.36 |        1.00X |
|       524288 |  538.05 | 1977.55 | 3672.74 | 3670.65 | 1056.41 | 1060.79 |  846.06 | 6342.24 | 3007.40 | *534.47*|  533.28 |  542.45 |        0.99X |    537.60 |        1.00X |
|      1048576 | 1042.30 | 4202.97 | 4047.54 | 4066.88 | 2070.31 | 1895.42 | 1373.85 | 7872.20 | 6182.58 | 1049.38 | 1025.09 |*1043.05*|        1.00X |   1041.17 |        1.00X |
|          AVG |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |        0.95X |       n/a |        0.96X |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10'Aug'18 00:01:01     | TASK 40 from 507 : Initial settings were : I_MPI_ADJUST_BCAST=11:0-0;9:1-1;11:2-12;9:13-22;10:23-39;9:40-147;4:148-464;10:465-598;9:599-2048;10:2049-4096;9:4097-8192;10:8193-16384;11:16385-131072;10:131073-262144;9:262145-524288;11:524289-3727170;6:0-2147483647
10'Aug'18 00:01:01     | TASK 40 from 507 : Tuned settings are : I_MPI_ADJUST_BCAST=9:1-1;11:2-12;10:13-47;9:48-99;1:100-737;9:738-2048;10:2049-4096;9:4097-9844;1:9845-34761;11:34762-131072;10:131073-262144;9:262145-816658;11:816659-2147483647

The values that are marked in each row are not always the smallest values. Why is that so? And what is the meaning of one asterisk or two asterisks? What is the "Initial time"? And lastly, the documentation states an mpitune flag called "-zb" for zero based, saying that "Set zero as the base for all options before tuning." Which values are set to zero? Is that for the algorithm ID or for all values (segment sizes, etc.)?

It would be great if someone could help me to solve my three problems.

Thank you very much

-Sascha

↧

Intel(R) MPI Library for Windows* OS, Version 2018 Update 2 Build 20180125 Fails to run in Windows 10

August 12, 2018, 3:50 pm

Latest and popular articles on Intel Technologies

≫ Next: MPI strtok_r () from /lib64/libc.so.6

≪ Previous: I_MPI_ADJUST_BCAST: srun vs. mpiexec, performance differences

I installed parallel_studio_xe_2018_update2_cluster_edition_setup on Microsoft Windows [Version 10.0.17134.165], and I noticed mpi fails to run on my laptop. I built the simple mpi test (test.f90) that comes with the suit and it fails when running in parallel.

This is some env paths:

where mpiexec
C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.2.185\windows\mpi\intel64\bin\mpiexec.exe
C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\mpirt\mpiexec.exe
C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32_win\mpirt\mpiexec.exe

mpiexec --version
Intel(R) MPI Library for Windows* OS, Version 2018 Update 2 Build 20180125
Copyright 2003-2018 Intel Corporation.

This is the error that I receive when I run it

mpiexec -np 2 -localonly test.exe
[unset]: Error reading initack on 620
Error on readline:: No error
[unset]: write_line error; fd=620 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : No error
[unset]: Unable to write to PMI_fd
[unset]: write_line error; fd=620 buf=:cmd=barrier_in
:
system msg for write_line failure : No error
[unset]: write_line error; fd=620 buf=:cmd=get_ranks2hosts
:
system msg for write_line failure : No error
[unset]: expecting cmd="put_ranks2hosts", got cmd=""
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1743)......: channel initialization failed
MPID_Init(2144)......: PMI_Init returned -1
[unset]: write_line error; fd=620 buf=:cmd=abort exitcode=68204815
:
system msg for write_line failure : No error
[unset]: Error reading initack on 684
Error on readline:: No error
[unset]: write_line error; fd=684 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : No error
[unset]: Unable to write to PMI_fd
[unset]: write_line error; fd=684 buf=:cmd=barrier_in
:
system msg for write_line failure : No error
[unset]: write_line error; fd=684 buf=:cmd=get_ranks2hosts
:
system msg for write_line failure : No error
[unset]: expecting cmd="put_ranks2hosts", got cmd=""
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1743)......: channel initialization failed
MPID_Init(2144)......: PMI_Init returned -1
[unset]: write_line error; fd=684 buf=:cmd=abort exitcode=68204815
:
system msg for write_line failure : No error

I want to mention that this same test worked fine on other desktop machines with OS windows 10, but fails on others.

Could you please let me know if there is a fix for this issue.

Thanks !

Attachment	Size
Download test.f90	2.36 KB

↧

MPI strtok_r () from /lib64/libc.so.6

August 15, 2018, 4:07 pm

Latest and popular articles on Intel Technologies

≫ Next: how to find the firmware version of a Intel QLE7342 HCA

≪ Previous: Intel(R) MPI Library for Windows* OS, Version 2018 Update 2 Build 20180125 Fails to run in Windows 10

Hello,

When I use intel mpi mpiexec (which /opt/intel/compilers_and_libraries_2017.7.259/linux/mpi/intel64/bin/mpiexec) with mpi compilervars.sh setup (/opt/intel/compilers_and_libraries_2017.7.259/linux/bin/compilervars.sh intel64) I get "Program received signal SIGSEGV", Segmentation fault from 0x00007f26188aff62 in strtok_r () from /lib64/libc.so.6.

The program is using hello world from test.c supplied with the intel install.

If I use mpich with test.c (compiled using intel mpicc -o test test.c ) I get no errors running the test program.

I have disabled the firewall, and by commenting code, I know the program crashes at MPI_Init (&argc, &argv); Looking online, I see comments indicating strtok_r () cannot handle NULL, but replacing argc and argv with values didn't help.

Can someone point me in the right direction?

I'm using compilers_and_libraries_2017.7.259 running on Suse Linux Enterprise Server 12 SP2.

↧

how to find the firmware version of a Intel QLE7342 HCA

August 18, 2018, 12:23 am

Latest and popular articles on Intel Technologies

≫ Next: Explicit MPI/OpenMP process pinning with Intel MPI

≪ Previous: MPI strtok_r () from /lib64/libc.so.6

for compatibility with a storage system I need to know the firmware version of an Intel QLE7342 HCA but the command ibstat does not return it.It seems to be a known bug:

https://www.intel.com/content/www/us/en/support/articles/000006186/netwo...

how can I get the information?

Thanks

↧

Explicit MPI/OpenMP process pinning with Intel MPI

August 22, 2018, 7:34 am

Latest and popular articles on Intel Technologies

≫ Next: Lost message with MPI_Improbe

≪ Previous: how to find the firmware version of a Intel QLE7342 HCA

I have an MPI/OpenMP application. I would like to launch e.g. 23 MPI processes on 2 nodes (12 cores per node) where one process (e.g. rank 14) should have 2 OMP threads while all the others have a single thread. I want the rank 14 process pinned to 2 cores and the other ranks pinned to a single core. The job is run under Torque/PBS. How can I do that?

↧

Lost message with MPI_Improbe

August 24, 2018, 11:59 am

Latest and popular articles on Intel Technologies

≫ Next: MPI_Iallgather & MPI_Iallgatherv in Intel MPI

≪ Previous: Explicit MPI/OpenMP process pinning with Intel MPI

Hi,

The attached code (compiled with mpic++) will work as expected up to 15 MPI process, but will loop with 16 or more.

Apparently MPI_Improbe fails to find the message:

[alainm@pollux test]$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright 2003-2018 Intel Corporation.
[alainm@pollux test]$

[alainm@pollux test]$ mpirun -n 15 ./truc
Proc 0 sent msg 2 to Proc 1
Proc 0 got msg 2 from proc 14
Proc 1 sent msg 2 to Proc 2
[...]
Proc 14 sent msg 2 to Proc 0
Proc 14 got msg 2 from proc 13
[alainm@pollux test]$

[alainm@pollux test]$ mpirun -n 16 ./truc | grep "Proc 10" # same output with other procs
Proc 9 sent msg 2 to Proc 10
Proc 10 sent msg 2 to Proc 11
Proc 10 haven't got msg 2 from proc 9 yet
Proc 10 haven't got msg 2 from proc 9 yet
Proc 10 haven't got msg 2 from proc 9 yet
[...]
Proc 10 haven't got msg 2 from proc 9 yet

[...]

The code work just fine with Open MPI.

Fun fact, running the code whit -check_mpi trigger the bug even with 2 proc.

Am I doing something wrong ?

Thanks,

Alain

Attachment	Size
Download probe_bug.cpp	1.15 KB

↧

MPI_Iallgather & MPI_Iallgatherv in Intel MPI

August 28, 2018, 6:24 am

Latest and popular articles on Intel Technologies

≫ Next: Trace Analyzer CLI and thread aggregation

≪ Previous: Lost message with MPI_Improbe

Hi there,

How can I figure out what's the earliest version of Intel MPI that supports some specific MPI functions, such as MPI_Iallgather and MPI_Iallgatherv? I googled a bit but didn't find the answer.

I'm asking because we have an application making use of these two functions and we would like to let the user know what versions of MPI to use.

Thanks,

Victor

↧

Trace Analyzer CLI and thread aggregation

August 30, 2018, 1:42 am

Latest and popular articles on Intel Technologies

≫ Next: 2019 beta update 1 - tcp fabric unknown

≪ Previous: MPI_Iallgather & MPI_Iallgatherv in Intel MPI

Hi,

I want to output a .csv file from a tracefile I've previously generated.

Because I'm especially interested in the timings across all nodes, I want to use the "All_nodes" aggregation.

If I'm not mistaken, I'm supposed to use the -t<ID> option for that. However, I cannot figure out how to get the correct ID. So far what I've done is to look in the GUI and find the ID there in the ID column it the "Process Aggregation" window. The ID seems to be (at least) dependent on the number of processes, though.

So, what is the correct way of finding the right value for the -t<ID> option? Note that I want to automatically generate those .csv files and then feed them into a R script. Therefore, manually looking up the ID in the GUI isn't an option.

Thanks,

Fabian.

↧

2019 beta update 1 - tcp fabric unknown

September 7, 2018, 12:14 pm

Latest and popular articles on Intel Technologies

≫ Next: Pinning / affinity for hybrid MPI/OpenMP on cluster

≪ Previous: Trace Analyzer CLI and thread aggregation

Using a small cluster of Skylake-based systems, as below:

2019 beta update 1
Red Hat 7.6 beta x86_64 (3.10.0-938.el7.x86_64)
Systems include Intel Omni-Path HFAs in addition to an onboard gigabit Ethernet nic
Systems are using the the RH7.6 inbox Omni-Path support

Attempting to run the included IMB-MPI1 binary over the OPA HFAs specifying psm2 as the transport appears to work correctly. However, trying to run it across the onboard Ethernet network specifying tcp as the transport generates the below message from MPI startup:

MPI startup(): tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead

The job does execute, but over the OPA fabric instead of the Ethernet network. If the OPA HFA is disconnected, the job fails.

fi_info -l

psm2:

version: 1.6

ofi_rxm:

version: 1.0

sockets:

version: 2.0

↧

Pinning / affinity for hybrid MPI/OpenMP on cluster

September 11, 2018, 3:46 pm

Latest and popular articles on Intel Technologies

≫ Next: I_MPI_PIN_DOMAIN example

≪ Previous: 2019 beta update 1 - tcp fabric unknown

Dear Intel HPC community,

I’m trying to optimize the scalability of my hybrid Fortran MPI/OpenMP code based on PETSc and MUMPS that I’m running on a SGI ICE-X cluster and am seeking advice and good practice rules for optimum process bindings/affinity.

The code executes in a pure MPI fashion until MUMPS is called and that is when hybrid MPI/multithreading is exploited (calls to multithreaded BLAS routines)

The cluster CPU architecture is heterogeneous, compute nodes can be Westmere, Sandy Bridge, Haswell and Broadwell.

I’m compiling the code with appropriate ‘-x’ arch options and -openmp, with the Intel Fortran compiler 17.0.6, the Intel MPI Library 2018.1.163 and also linking with the multithreaded Intel MKL 2017.6.

PBSPro is used to launch jobs on the cluster, where I can choose in the script nselect, ncpus, mpiprocs, ompthreads.

For large matrices that I'm dealing with I heard it is advised to have 1 MPI process per socket. Is setting I_MPI_PIN_DOMAIN=socket enough and do I need to tweak other I_MPI_xxx environment variables?
Is it acceptable to have idle OpenMP threads before the multithreading calls?
How can I ensure that my threads are pinned optimally for my application? Should KMP_AFFINITY and KMP_PLACE_THREADS be set instead of OMP_PROC_BIND and OMP_PLACES? To which setting(s) those environment variables should be set?

I did a lot of research and reading about affinity - but as most presentations and webpages I saw about hybrid MPI/OMP situations are neither generic nor clear enough for me I’d rather ask here.

I appreciate your support and am happy to provide additional information.
Thank you in advance!

↧

I_MPI_PIN_DOMAIN example

September 11, 2018, 4:06 pm

Latest and popular articles on Intel Technologies

≫ Next: Announcing Intel® Parallel Studio XE 2019 Release

≪ Previous: Pinning / affinity for hybrid MPI/OpenMP on cluster

Hi,

I am trying to pin 6 ranks on a dual-socket (12 cores per socket) node, with 4 OpenMP threads per MPI rank. I set I_MPI_PIN_DOMAIN=omp:compact, but I get this I_MPI_DEBUG output:

[0] MPI startup(): 0 14563 n2470 {0,1,12,13}

[0] MPI startup(): 1 14564 n2470 {2,3,14,15}

[0] MPI startup(): 2 14565 n2470 {4,5,16,17}

[0] MPI startup(): 3 14566 n2470 {6,7,18,19}

[0] MPI startup(): 4 14567 n2470 {8,9,20,21}

[0] MPI startup(): 5 14568 n2470 {10,11,22,23}

I would have expected

[0] MPI startup(): 0 14563 n2470 {0,1,2,3}
[0] MPI startup(): 1 14564 n2470 {4,5,6,7}
[0] MPI startup(): 2 14565 n2470 {8,9,10,11}
[0] MPI startup(): 3 14566 n2470 {12,13,14,15}
[0] MPI startup(): 4 14567 n2470 {16,17,18,19}
[0] MPI startup(): 5 14568 n2470 {20,21,22,23}

Intel Parallel Studio Cluster edition 2017 update 5. Am I setting something incorrectly, or is this a behavior to protect performance against myself?

Thanks! - Chris

↧

Announcing Intel® Parallel Studio XE 2019 Release

September 13, 2018, 6:12 am

Latest and popular articles on Intel Technologies

≫ Next: Incorrect result of mpi_reduce over real(16) sums. (2019)

≪ Previous: I_MPI_PIN_DOMAIN example

Just Released Intel® Parallel Studio XE 2019! Accelerate Parallel Code—Transform Enterprise to Cloud, & HPC to AI

Boost your parallel application performance on the latest Intel® processors with this new, comprehensive suite of development tools. It includes new high-performance Python* capabilities to speed machine learning, a rapid visual prototyping environment to help you visualize parallelism, a simplified profiling workflow and new platform profiler, and capabilities extend HPC solutions on the path to Exascale, and more.

Learn more > Blog: What’s New in Intel® Parallel Studio 2019 blog | Release Notes | Intel Parallel Studio XE 2019 site
Current Users: Upgrade Here
New users: Download Free Trial

↧