Hi,
I was testing the I_MPI_ADJUST_BCAST flags on our OmniPath cluster (36 nodes). And I was using the OSU microbenchmarks for comparison and testing reasons.
1 Problem 1, mpiexec vs. srun when I_MPI_ADJUST_BCAST is unset
First, I get a SLURM allocation like this: salloc -N 36 –cpu-freq=HIGH -t 10:00:00 (The processors are now all running at the max. frequency.)
I do not set the I_MPI_ADJUST_BCAST, or I do:
unset I_MPI_ADJUST_BCAST
Now with srun I get this
srun --mpi=pmi2 --cpu-bind=core -N 36 --ntasks-per-node=32 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast # OSU MPI Broadcast Latency Test v5.4.1 # Size Avg Latency(us) 1 417.27 2 418.78 4 417.23 8 417.51 16 420.89 32 463.69 64 553.98 128 569.21 256 606.14 512 672.11 1024 848.40 2048 1009.13 4096 1360.44 8192 2081.46 16384 2907.58 32768 3046.97 65536 3184.76 131072 3416.80 262144 3873.73 524288 4500.13 1048576 5556.90
However, the same call with mpiexec gives me these values (same allocation)
mpiexec -machinefile ./machinefile_36_intel -n 1152 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast # OSU MPI Broadcast Latency Test v5.4.1 # Size Avg Latency(us) 1 4.92 2 4.94 4 4.85 8 5.01 16 5.56 32 5.49 64 5.57 128 15.20 256 13.80 512 14.22 1024 15.55 2048 17.12 4096 20.43 8192 27.16 16384 131.68 32768 143.90 65536 170.50 131072 222.50 262144 328.82 524288 513.49 1048576 998.40
Thus, I assume that the Intel MPI library uses some optimized values for mpiexec (but there are none in the etc directories.). But it seems that the selection strategy significantly differs in both cases. Why?
2 Differences with I_MPI_ADJUST_BCAST=0
Similarly, there are also differences between srun and mpiexec when setting I_MPI_ADJUST_BCAST=0
export I_MPI_ADJUST_BCAST=0
- srun
srun --mpi=pmi2 --cpu-bind=core -N 36 --ntasks-per-node=32 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast # OSU MPI Broadcast Latency Test v5.4.1 # Size Avg Latency(us) 1 6.94 2 6.94 4 6.99 8 6.91 16 11.99 32 12.28 64 12.45 128 12.61 256 12.88 512 13.52 1024 15.46 2048 18.42 4096 23.35 8192 32.81 16384 3157.92 32768 3229.47 65536 3411.76 131072 3447.75 262144 3517.37 524288 3783.26 1048576 4297.23
- mpiexec
mpiexec -machinefile ./machinefile_36_intel -n 1152 ./osu-micro-benchmarks-5.4.1/mpi/collective/osu_bcast # OSU MPI Broadcast Latency Test v5.4.1 # Size Avg Latency(us) 1 5.69 2 5.63 4 5.76 8 5.64 16 6.54 32 6.61 64 6.64 128 14.39 256 14.63 512 15.34 1024 17.06 2048 19.73 4096 25.12 8192 35.69 16384 1350.87 32768 1357.04 65536 1410.95 131072 3447.24 262144 3508.40 524288 3683.10 1048576 4053.10
Interestingly, the performance changes whether or not I am using I_MPI_ADJUST_BCAST=0. The reference guide states that "(>= 0 The default value of zero selects the reasonable settings)". So, it's not really a default value as "unset I_MPI_ADJUST_BCAST" produces different values. Why is that happening?
And, why are there performance differences depending on the use of srun or mpiexec? For example, with 16 Bytes, it is consistently slower with srun than when using mpiexec.
Strangely enough, there are also difference between srun and mpiexec when I set I_MPI_ADJUST_BCAST to any other fixed values, e.g., export I_MPI_ADJUST_BCAST=1
Can it really be caused by the different ways of launching the processes, pmi2 (srun) vs. ssh (mpiexec)?
3 The table format of mpitune
I ran a few tests with mpitune to see what values would work best for me. The log gives me the following information:
10'Aug'18 00:00:46 | TASK 40 from 507 : Launch line template with default parameters: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/bin/mpiexec.hydra -perhost 32 -machinefile "/home/hunold/tmp/mpitune_temp/mpituner_qTPWzY_tmp" -n 1152 -env I_MPI_FABRICS ofi "IMB-MPI1" -npmin 1152 -iter 100 -root_shift 0 -sync 1 -imb_barrier 1 -msglen "/home/hunold/tmp/mpitune_temp/mpi tuner_q_WnMb_tmp" Bcast 10'Aug'18 00:01:01 | TASK 40 from 507 : Report preparation successfully completeded. Printing... 10'Aug'18 00:01:01 | TASK 40 from 507 : Spreadsheet for I_MPI_ADJUST_BCAST option: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | | Times for algorithms | | | | | Message size | Initial |-------------------------------------------------------------------------------------------------------------| Tuned vs | Validated | Validated vs | | (bytes) | time | Alg 1 | Alg 2 | Alg 3 | Alg 4 | Alg 5 | Alg 6 | Alg 7 | Alg 8 | Alg 9 | Alg 10 | Alg 11 | initial (ex) | time | initial (ex) | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 1 | 7.96 | 10.61 | 1310.25 | 1317.49 | 9.26 | 12.50 | 39.31 | 807.26 | 8.78 | *7.77*| 8.36 | 8.03 | 0.98X | 8.49 | 1.07X | | 2 | 7.49 | 10.44 | 1308.36 | 1321.40 | 9.38 | 12.39 | 39.37 | 763.48 | 8.85 | 7.72 | 7.98 | *7.85*| 1.05X | 8.32 | 1.11X | | 4 | 7.79 | 10.03 | 1308.57 | 1320.85 | 9.59 | 13.13 | 39.39 | 755.16 | 8.79 | 7.93 | 7.87 | *7.76*| 1.00X | 7.54 | 0.97X | | 8 | 7.23 | 10.08 | 1314.70 | 1321.07 | 9.00 | 13.07 | 39.82 | 759.66 | 8.54 | 8.07 | 7.88 | *7.69*| 1.06X | 7.26 | 1.00X | | 16 | 8.52 | 11.55 | 1317.22 | 1323.71 | 10.54 | 14.85 | 41.07 | 963.03 | 9.75 | *8.48 | 8.21*| 8.34 | 0.96X | 8.14 | 0.96X | | 32 | 8.23 | 17.10 | 1312.19 | 1322.06 | 15.17 | 16.18 | 41.80 | 2002.17 | 16.12 | 8.44 | *7.91*| 8.22 | 0.96X | 8.14 | 0.99X | | 64 | 8.94 | 13.04 | 1313.15 | 1322.47 | 11.15 | 17.76 | 42.65 | 1116.06 | 10.96 | *8.31*| 8.85 | 8.23 | 0.93X | 8.38 | 0.94X | | 128 | 26.71 | 22.49*| 1313.86 | 1326.43 | 28.37 | 33.92 | 58.99 | 1135.23 | 32.15 | *26.23 | 26.74 | 26.04 | 0.84X | 25.68 | 0.96X | | 256 | 31.72 | 23.12*| 1319.28 | 1332.02 | *28.56 | 34.55 | 59.23 | 1147.76 | 32.55 | 26.82 | 26.98 | 26.81 | 0.73X | 23.02 | 0.73X | | 512 | 28.79 | 23.88*| 1322.55 | 1340.19 | 29.06 | 37.10 | 63.43 | 1183.59 | 32.87 | 27.26 | *27.72 | 27.94 | 0.83X | 23.66 | 0.82X | | 1024 | 21.51 | 25.56 | 1335.18 | 1346.72 | 26.30 | 33.11 | 71.77 | 1356.15 | 35.04 | *21.26*| 21.46 | 21.17 | 0.99X | 21.08 | 0.98X | | 2048 | 24.12 | 29.25 | 1338.87 | 1350.27 | 29.65 | 35.48 | 72.93 | 1907.35 | 38.60 | *23.92*| 24.13 | 24.09 | 0.99X | 24.01 | 1.00X | | 4096 | 29.22 | 37.21 | 1387.39 | 1403.71 | 38.48 | 40.52 | 81.73 | 3033.93 | 49.60 | 27.83 | *28.53*| 28.75 | 0.98X | 28.02 | 0.96X | | 8192 | 36.75 | 49.39 | 1384.39 | 1404.95 | 49.85 | 50.34 | 91.12 | 4113.16 | 63.48 | *37.07*| 36.85 | 36.46 | 1.01X | 37.05 | 1.01X | | 16384 | 134.19 | 84.40*| 1496.26 | 1494.57 | 156.92 | 158.23 | 194.76 | 5402.68 | 117.79 | 133.16 | *132.46 | 134.09 | 0.63X | 84.83 | 0.63X | | 32768 | 148.22 | 143.02*| 1886.85 | 1898.17 | 187.09 | 201.65 | 217.07 | 5395.04 | 200.72 | 148.02 | 149.20 | *149.19 | 0.96X | 145.48 | 0.98X | | 65536 | 176.27 | 269.84 | 1996.62 | 1992.10 | 241.02 | 276.40 | 261.34 | 6700.34 | 391.33 | 175.88 | 175.16 | *174.58*| 0.99X | 175.44 | 1.00X | | 131072 | 232.73 | 513.06 | 3446.27 | 3443.95 | 357.87 | 398.02 | 345.07 | 5645.96 | 742.91 | 229.82 | 229.54 | *230.49*| 0.99X | 230.70 | 0.99X | | 262144 | 340.12 | 1014.85 | 3448.24 | 3451.58 | 594.88 | 613.78 | 502.43 | 6764.08 | 1433.04 | 340.40 | *339.58*| 338.81 | 1.00X | 340.36 | 1.00X | | 524288 | 538.05 | 1977.55 | 3672.74 | 3670.65 | 1056.41 | 1060.79 | 846.06 | 6342.24 | 3007.40 | *534.47*| 533.28 | 542.45 | 0.99X | 537.60 | 1.00X | | 1048576 | 1042.30 | 4202.97 | 4047.54 | 4066.88 | 2070.31 | 1895.42 | 1373.85 | 7872.20 | 6182.58 | 1049.38 | 1025.09 |*1043.05*| 1.00X | 1041.17 | 1.00X | | AVG | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | 0.95X | n/a | 0.96X | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10'Aug'18 00:01:01 | TASK 40 from 507 : Initial settings were : I_MPI_ADJUST_BCAST=11:0-0;9:1-1;11:2-12;9:13-22;10:23-39;9:40-147;4:148-464;10:465-598;9:599-2048;10:2049-4096;9:4097-8192;10:8193-16384;11:16385-131072;10:131073-262144;9:262145-524288;11:524289-3727170;6:0-2147483647 10'Aug'18 00:01:01 | TASK 40 from 507 : Tuned settings are : I_MPI_ADJUST_BCAST=9:1-1;11:2-12;10:13-47;9:48-99;1:100-737;9:738-2048;10:2049-4096;9:4097-9844;1:9845-34761;11:34762-131072;10:131073-262144;9:262145-816658;11:816659-2147483647
The values that are marked in each row are not always the smallest values. Why is that so? And what is the meaning of one asterisk or two asterisks? What is the "Initial time"? And lastly, the documentation states an mpitune flag called "-zb" for zero based, saying that "Set zero as the base for all options before tuning." Which values are set to zero? Is that for the algorithm ID or for all values (segment sizes, etc.)?
It would be great if someone could help me to solve my three problems.
Thank you very much
-Sascha