Recently performed benchmarks of a hybrid-parallelized flow solver showed what one has to consider in order to get best performance.
On the theoretical side, hybrid implementations are thought to be most flexible and still maintain high performance. This is because, one thinks that OpenMP is perfect for intranode communication and faster than MPI there.
Between nodes, MPI anyway is the choice for portable distributed memory parallelization.
In reality however not few MPI implementations already use shared memory buffers when communicating with other ranks in the same shared memory system. So basically there is no advantage between a parallelization of MPI and OpenMP on the same level, when using MPI nevertheless for internode communication.
Quite contrary, it means apart from the additional implementation, a lot of more understanding of processor and memory hierarchy layout, thread and process affinity to cores than the pure MPI implementation.
Nevertheless there are scenarios, where hybrid really can pay off, as MPI lacks the OpenMP feature of accessing shared data in shared caches for example.
Finally if you want to run your hybrid code on RRZE systems, there are the following features available.
Pinning of MPI/OpenMP hybrids
I assume you use the mpirun wrapper provided
- mpirun -pernode issues just one MPI process per node regardless of the nodefile content
- mpirun -npernode issues just n MPI processes per node regardless of the nodefile content
- mpirun -npernode 2 -pin 0,1_2,3 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so Issues 2 MPI processes per node and gives threads of the MPI process 0 just access to core 0 and 1 and threads of MPI process 1 access to cores 2 and 3 (of course the MPI processes themselves are also limited to that cores). Furthermore the , e.g. OpenMP, threads are pinned to one core only, so that migration is no longer an issue
- mpirun -npernode 2 -pin 0_1_2_3 is your choice if you would like to test 1 OpenMP thread per MPI process and 4 MPI processes in total per node. Adding the LD_PRELOAD from above however decreases performance a lot. This is currently under investigation.
- export PINOMP_MASK=2 changes the skip mask of the pinning tool
OpenMP spawns not only worker threads but also threads for administrative business as synchronization etc. Usually you would only pin the threads, contributing to the computation. The default skip mask, skipping the non computationally intensive threads, might not be correct in the case of hybrid programming, as MPI as well spawns non-worker threads. The PINOMP_MASK variable is hereby interpreted like a bitmask, e.g. 2 –> 10 and 6 –> 110. A zero means to pin the thread and a 1 means to skip the pinning of the thread. The least significant bit hereby corresponds to thread zero (bit 0 is 0 in the examples above ) .
6 was used in the algorithm under investigation as soon as one MPI process and 4 OpenMP worker threads were used per node, to have the correct thread pinning.
The usage of the rankfile for pinning hybrid jobs is described in Thomas Zeisers Blog
Thanks to Thomas Zeiser and Michael Meier for their help in resolving this issue.
Keyowords: Thread Pinning Hybrid Affinity
Incorporated Comments of Thomas Zeiser
Thomas Zeiser, Donnerstag, 23. Juli 2009, 18:15
PINOMP_MASK for hybrid codes using Open-MPI
If recent Intel compilers and Open-MPI are used for hybrid OpenMP-MPI programming, the correct PINOMP_MASK seems to be 7 (instead of 6 for hybrid codes using Intel-MPI).
Thomas Zeiser, Montag, 22. Februar 2010, 20:41
PIN OMP and recent mvapich2
Also recent mvapich2 requires special handling for pinning hybrid codes: PINOMP_SKIP=1,2 seems to be appropriate.