Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 930

I_MPI_ADJUST_BCAST: srun vs. mpiexec, performance differences

$
0
0

Hi,

I was testing the I_MPI_ADJUST_BCAST flags on our OmniPath cluster (36 nodes). And I was using the OSU microbenchmarks for comparison and testing reasons.

1 Problem 1, mpiexec vs. srun when I_MPI_ADJUST_BCAST is unset

First, I get a SLURM allocation like this: salloc -N 36 –cpu-freq=HIGH -t 10:00:00 (The processors are now all running at the max. frequency.)

I do not set the I_MPI_ADJUST_BCAST, or I do:

unset I_MPI_ADJUST_BCAST

Now with srun I get this

srun --mpi=pmi2 --cpu-bind=core -N 36 --ntasks-per-node=32 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast

# OSU MPI Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
1                     417.27
2                     418.78
4                     417.23
8                     417.51
16                    420.89
32                    463.69
64                    553.98
128                   569.21
256                   606.14
512                   672.11
1024                  848.40
2048                 1009.13
4096                 1360.44
8192                 2081.46
16384                2907.58
32768                3046.97
65536                3184.76
131072               3416.80
262144               3873.73
524288               4500.13
1048576              5556.90

However, the same call with mpiexec gives me these values (same allocation)

mpiexec -machinefile ./machinefile_36_intel -n 1152 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast

# OSU MPI Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
1                       4.92
2                       4.94
4                       4.85
8                       5.01
16                      5.56
32                      5.49
64                      5.57
128                    15.20
256                    13.80
512                    14.22
1024                   15.55
2048                   17.12
4096                   20.43
8192                   27.16
16384                 131.68
32768                 143.90
65536                 170.50
131072                222.50
262144                328.82
524288                513.49
1048576               998.40

Thus, I assume that the Intel MPI library uses some optimized values for mpiexec (but there are none in the etc directories.). But it seems that the selection strategy significantly differs in both cases. Why?

2 Differences with I_MPI_ADJUST_BCAST=0

Similarly, there are also differences between srun and mpiexec when setting I_MPI_ADJUST_BCAST=0

export I_MPI_ADJUST_BCAST=0
  • srun
srun --mpi=pmi2 --cpu-bind=core -N 36 --ntasks-per-node=32 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast

# OSU MPI Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
1                       6.94
2                       6.94
4                       6.99
8                       6.91
16                     11.99
32                     12.28
64                     12.45
128                    12.61
256                    12.88
512                    13.52
1024                   15.46
2048                   18.42
4096                   23.35
8192                   32.81
16384                3157.92
32768                3229.47
65536                3411.76
131072               3447.75
262144               3517.37
524288               3783.26
1048576              4297.23
  • mpiexec
mpiexec -machinefile ./machinefile_36_intel -n 1152 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast

# OSU MPI Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
1                       5.69
2                       5.63
4                       5.76
8                       5.64
16                      6.54
32                      6.61
64                      6.64
128                    14.39
256                    14.63
512                    15.34
1024                   17.06
2048                   19.73
4096                   25.12
8192                   35.69
16384                1350.87
32768                1357.04
65536                1410.95
131072               3447.24
262144               3508.40
524288               3683.10
1048576              4053.10

Interestingly, the performance changes whether or not I am using I_MPI_ADJUST_BCAST=0. The reference guide states that "(>= 0 The default value of zero selects the reasonable settings)". So, it's not really a default value as "unset I_MPI_ADJUST_BCAST" produces different values. Why is that happening?

And, why are there performance differences depending on the use of srun or mpiexec? For example, with 16 Bytes, it is consistently slower with srun than when using mpiexec.

Strangely enough, there are also difference between srun and mpiexec when I set I_MPI_ADJUST_BCAST to any other fixed values, e.g., export I_MPI_ADJUST_BCAST=1

Can it really be caused by the different ways of launching the processes, pmi2 (srun) vs. ssh (mpiexec)?

3 The table format of mpitune

I ran a few tests with mpitune to see what values would work best for me. The log gives me the following information:

10'Aug'18 00:00:46     | TASK 40 from 507 : Launch line template with default parameters: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/bin/mpiexec.hydra   -perhost 32 -machinefile
 "/home/hunold/tmp/mpitune_temp/mpituner_qTPWzY_tmp"  -n 1152 -env I_MPI_FABRICS ofi  "IMB-MPI1" -npmin 1152 -iter 100 -root_shift 0 -sync 1 -imb_barrier 1 -msglen "/home/hunold/tmp/mpitune_temp/mpi
tuner_q_WnMb_tmp" Bcast
10'Aug'18 00:01:01     | TASK 40 from 507 : Report preparation successfully completeded. Printing... 
10'Aug'18 00:01:01     | TASK 40 from 507 : Spreadsheet for I_MPI_ADJUST_BCAST option:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|              |         |                                             Times for algorithms                                            |              |           |              |
| Message size | Initial |-------------------------------------------------------------------------------------------------------------|   Tuned vs   | Validated | Validated vs |
|   (bytes)    |   time  |  Alg 1  |  Alg 2  |  Alg 3  |  Alg 4  |  Alg 5  |  Alg 6  |  Alg 7  |  Alg 8  |  Alg 9  |  Alg 10 |  Alg 11 | initial (ex) |    time   | initial (ex) |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|            1 |    7.96 |   10.61 | 1310.25 | 1317.49 |    9.26 |   12.50 |   39.31 |  807.26 |    8.78 |   *7.77*|    8.36 |    8.03 |        0.98X |      8.49 |        1.07X |
|            2 |    7.49 |   10.44 | 1308.36 | 1321.40 |    9.38 |   12.39 |   39.37 |  763.48 |    8.85 |    7.72 |    7.98 |   *7.85*|        1.05X |      8.32 |        1.11X |
|            4 |    7.79 |   10.03 | 1308.57 | 1320.85 |    9.59 |   13.13 |   39.39 |  755.16 |    8.79 |    7.93 |    7.87 |   *7.76*|        1.00X |      7.54 |        0.97X |
|            8 |    7.23 |   10.08 | 1314.70 | 1321.07 |    9.00 |   13.07 |   39.82 |  759.66 |    8.54 |    8.07 |    7.88 |   *7.69*|        1.06X |      7.26 |        1.00X |
|           16 |    8.52 |   11.55 | 1317.22 | 1323.71 |   10.54 |   14.85 |   41.07 |  963.03 |    9.75 |   *8.48 |    8.21*|    8.34 |        0.96X |      8.14 |        0.96X |
|           32 |    8.23 |   17.10 | 1312.19 | 1322.06 |   15.17 |   16.18 |   41.80 | 2002.17 |   16.12 |    8.44 |   *7.91*|    8.22 |        0.96X |      8.14 |        0.99X |
|           64 |    8.94 |   13.04 | 1313.15 | 1322.47 |   11.15 |   17.76 |   42.65 | 1116.06 |   10.96 |   *8.31*|    8.85 |    8.23 |        0.93X |      8.38 |        0.94X |
|          128 |   26.71 |   22.49*| 1313.86 | 1326.43 |   28.37 |   33.92 |   58.99 | 1135.23 |   32.15 |  *26.23 |   26.74 |   26.04 |        0.84X |     25.68 |        0.96X |
|          256 |   31.72 |   23.12*| 1319.28 | 1332.02 |  *28.56 |   34.55 |   59.23 | 1147.76 |   32.55 |   26.82 |   26.98 |   26.81 |        0.73X |     23.02 |        0.73X |
|          512 |   28.79 |   23.88*| 1322.55 | 1340.19 |   29.06 |   37.10 |   63.43 | 1183.59 |   32.87 |   27.26 |  *27.72 |   27.94 |        0.83X |     23.66 |        0.82X |
|         1024 |   21.51 |   25.56 | 1335.18 | 1346.72 |   26.30 |   33.11 |   71.77 | 1356.15 |   35.04 |  *21.26*|   21.46 |   21.17 |        0.99X |     21.08 |        0.98X |
|         2048 |   24.12 |   29.25 | 1338.87 | 1350.27 |   29.65 |   35.48 |   72.93 | 1907.35 |   38.60 |  *23.92*|   24.13 |   24.09 |        0.99X |     24.01 |        1.00X |
|         4096 |   29.22 |   37.21 | 1387.39 | 1403.71 |   38.48 |   40.52 |   81.73 | 3033.93 |   49.60 |   27.83 |  *28.53*|   28.75 |        0.98X |     28.02 |        0.96X |
|         8192 |   36.75 |   49.39 | 1384.39 | 1404.95 |   49.85 |   50.34 |   91.12 | 4113.16 |   63.48 |  *37.07*|   36.85 |   36.46 |        1.01X |     37.05 |        1.01X |
|        16384 |  134.19 |   84.40*| 1496.26 | 1494.57 |  156.92 |  158.23 |  194.76 | 5402.68 |  117.79 |  133.16 | *132.46 |  134.09 |        0.63X |     84.83 |        0.63X |
|        32768 |  148.22 |  143.02*| 1886.85 | 1898.17 |  187.09 |  201.65 |  217.07 | 5395.04 |  200.72 |  148.02 |  149.20 | *149.19 |        0.96X |    145.48 |        0.98X |
|        65536 |  176.27 |  269.84 | 1996.62 | 1992.10 |  241.02 |  276.40 |  261.34 | 6700.34 |  391.33 |  175.88 |  175.16 | *174.58*|        0.99X |    175.44 |        1.00X |
|       131072 |  232.73 |  513.06 | 3446.27 | 3443.95 |  357.87 |  398.02 |  345.07 | 5645.96 |  742.91 |  229.82 |  229.54 | *230.49*|        0.99X |    230.70 |        0.99X |
|       262144 |  340.12 | 1014.85 | 3448.24 | 3451.58 |  594.88 |  613.78 |  502.43 | 6764.08 | 1433.04 |  340.40 | *339.58*|  338.81 |        1.00X |    340.36 |        1.00X |
|       524288 |  538.05 | 1977.55 | 3672.74 | 3670.65 | 1056.41 | 1060.79 |  846.06 | 6342.24 | 3007.40 | *534.47*|  533.28 |  542.45 |        0.99X |    537.60 |        1.00X |
|      1048576 | 1042.30 | 4202.97 | 4047.54 | 4066.88 | 2070.31 | 1895.42 | 1373.85 | 7872.20 | 6182.58 | 1049.38 | 1025.09 |*1043.05*|        1.00X |   1041.17 |        1.00X |
|          AVG |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |     n/a |        0.95X |       n/a |        0.96X |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10'Aug'18 00:01:01     | TASK 40 from 507 : Initial settings were : I_MPI_ADJUST_BCAST=11:0-0;9:1-1;11:2-12;9:13-22;10:23-39;9:40-147;4:148-464;10:465-598;9:599-2048;10:2049-4096;9:4097-8192;10:8193-16384;11:16385-131072;10:131073-262144;9:262145-524288;11:524289-3727170;6:0-2147483647
10'Aug'18 00:01:01     | TASK 40 from 507 : Tuned settings are : I_MPI_ADJUST_BCAST=9:1-1;11:2-12;10:13-47;9:48-99;1:100-737;9:738-2048;10:2049-4096;9:4097-9844;1:9845-34761;11:34762-131072;10:131073-262144;9:262145-816658;11:816659-2147483647

The values that are marked in each row are not always the smallest values. Why is that so? And what is the meaning of one asterisk or two asterisks? What is the "Initial time"? And lastly, the documentation states an mpitune flag called "-zb" for zero based, saying that "Set zero as the base for all options before tuning." Which values are set to zero? Is that for the algorithm ID or for all values (segment sizes, etc.)?

It would be great if someone could help me to solve my three problems.

Thank you very much

-Sascha


Viewing all articles
Browse latest Browse all 930

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>