Dear Intel HPC community,
I’m trying to optimize the scalability of my hybrid Fortran MPI/OpenMP code based on PETSc and MUMPS that I’m running on a SGI ICE-X cluster and am seeking advice and good practice rules for optimum process bindings/affinity.
The code executes in a pure MPI fashion until MUMPS is called and that is when hybrid MPI/multithreading is exploited (calls to multithreaded BLAS routines)
The cluster CPU architecture is heterogeneous, compute nodes can be Westmere, Sandy Bridge, Haswell and Broadwell.
I’m compiling the code with appropriate ‘-x’ arch options and -openmp, with the Intel Fortran compiler 17.0.6, the Intel MPI Library 2018.1.163 and also linking with the multithreaded Intel MKL 2017.6.
PBSPro is used to launch jobs on the cluster, where I can choose in the script nselect, ncpus, mpiprocs, ompthreads.
- For large matrices that I'm dealing with I heard it is advised to have 1 MPI process per socket. Is setting I_MPI_PIN_DOMAIN=socket enough and do I need to tweak other I_MPI_xxx environment variables?
- Is it acceptable to have idle OpenMP threads before the multithreading calls?
- How can I ensure that my threads are pinned optimally for my application? Should KMP_AFFINITY and KMP_PLACE_THREADS be set instead of OMP_PROC_BIND and OMP_PLACES? To which setting(s) those environment variables should be set?
I did a lot of research and reading about affinity - but as most presentations and webpages I saw about hybrid MPI/OMP situations are neither generic nor clear enough for me I’d rather ask here.
I appreciate your support and am happy to provide additional information.
Thank you in advance!