Johannes Habichs Blog

Suche


11. June 2010

Thread Pinning/Affinity

Johannes Habich, 09:43 Uhr in HPC

Thread pinning is very important in order to get feasible and reliable results on todays multi and manycore architectures. Otherwise threads will migrate from one core to another, causing the waste of clock cycles. Even more important, if you placed your memory correctly by first touch on ccNUMA systems, e.g. SGI Altix or every dual socket Intel XEON Core i7, the thread accessing the memory has to go over the QPI interface connecting the two sockets to access the memory if it is migrated to another socket.

Jan Treibig developed a tool for that called likwid-pin.

A sample usage would be as follows:

likwid-pin -s 0 -c”0,1,2,3,4,5,6,7″ ./Exec

This would pin 8 Threads of the executable of the cores 0 to 8.
For information about the topology, just use the other tool, called likwid-topology which gives you cache and core hierarchy.
The skipmask is important and thread implementation specific. Also consider, that in hybrid programs, e.g. OpenMP and MPI, multiple shepard threads are present.

19. June 2009

MPI/OpenMP Hybrid pinning or no pinning that is the question

Johannes Habich, 14:34 Uhr in HPC, Parallel Computing MPI/OpenMP

Intro

Recently performed benchmarks of a hybrid-parallelized flow solver showed what one has to consider in order to get best performance.
On the theoretical side, hybrid implementations are thought to be most flexible and still maintain high performance. This is because, one thinks that OpenMP is perfect for intranode communication and faster than MPI there.
Between nodes, MPI anyway is the choice for portable distributed memory parallelization.

In reality however not few MPI implementations already use shared memory buffers when communicating with other ranks in the same shared memory system. So basically there is no advantage between a parallelization of MPI and OpenMP on the same level, when using MPI nevertheless for internode communication.

Quite contrary, it means apart from the additional implementation, a lot of more understanding of processor and memory hierarchy layout, thread and process affinity to cores than the pure MPI implementation.

Nevertheless there are scenarios, where hybrid really can pay off, as MPI lacks the OpenMP feature of accessing shared data in shared caches for example.

Finally if you want to run your hybrid code on RRZE systems, there are the following features available.

Pinning of MPI/OpenMP hybrids

I assume you use the mpirun wrapper provided

  • mpirun -pernode issues just one MPI process per node regardless of the nodefile content
  • mpirun -npernode issues just n MPI processes per node regardless of the nodefile content
  • mpirun -npernode 2 -pin 0,1_2,3 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so Issues 2 MPI processes per node and gives threads of the MPI process 0 just access to core 0 and 1 and threads of MPI process 1 access to cores 2 and 3 (of course the MPI processes themselves are also limited to that cores). Furthermore the , e.g. OpenMP, threads are pinned to one core only, so that migration is no longer an issue
  • mpirun -npernode 2 -pin 0_1_2_3 is your choice if you would like to test 1 OpenMP thread per MPI process and 4 MPI processes in total per node. Adding the LD_PRELOAD from above however decreases performance a lot. This is currently under investigation.
  • export PINOMP_MASK=2 changes the skip mask of the pinning tool

OpenMP spawns not only worker threads but also threads for administrative business as synchronization etc. Usually you would only pin the threads, contributing to the computation. The default skip mask, skipping the non computationally intensive threads, might not be correct in the case of hybrid programming, as MPI as well spawns non-worker threads. The PINOMP_MASK variable is hereby interpreted like a bitmask, e.g. 2 –> 10 and 6 –> 110. A zero means to pin the thread and a 1 means to skip the pinning of the thread. The least significant bit hereby corresponds to thread zero (bit 0 is 0 in the examples above ) .

    6 was used in the algorithm under investigation as soon as one MPI process and 4 OpenMP worker threads were used per node, to have the correct thread pinning.

The usage of the rankfile for pinning hybrid jobs is described in Thomas Zeisers Blog

Thanks to Thomas Zeiser and Michael Meier for their help in resolving this issue.

Keyowords: Thread Pinning Hybrid Affinity

Incorporated Comments of Thomas Zeiser

Thomas Zeiser, Donnerstag, 23. Juli 2009, 18:15

PINOMP_MASK for hybrid codes using Open-MPI

If recent Intel compilers and Open-MPI are used for hybrid OpenMP-MPI programming, the correct PINOMP_MASK seems to be 7 (instead of 6 for hybrid codes using Intel-MPI).

Thomas Zeiser, Montag, 22. Februar 2010, 20:41
PIN OMP and recent mvapich2

Also recent mvapich2 requires special handling for pinning hybrid codes: PINOMP_SKIP=1,2 seems to be appropriate.

Nach oben