Tutorial at Supercomputing 2010, Nov. 14-19, 2010, New Orleans, LA:
Ingredients for good parallel performance on multicore-based systems
Slides: SC2010-Tutorial-Multicore
Authors/Presenters
Georg Hager and Gerhard Wellein
Erlangen Regional Computing Center
University of Erlangen-Nuremberg
Germany
{georg.hager,gerhard.wellein}@rrze.uni-erlangen.de
Abstract
This tutorial covers program optimization techniques for multicore processors and the systems they are used in. It concentrates on the dominating parallel programming paradigms, MPI and OpenMP.
We start by giving an architectural overview of multicore processors. Peculiarities like shared vs. separate caches, bandwidth bottlenecks, and ccNUMA characteristics are pointed out. We show typical performance features like synchronization overhead, intranode MPI bandwidths and latencies, ccNUMA locality, and bandwidth saturation (in cache and memory) in order to pinpoint the influence of system topology and thread affinity on the performance of typical parallel programming constructs. Multiple ways of probing system topology and establishing affinity, either by explicit coding or separate tools, are demonstrated. Finally we elaborate on programming techniques that help establish optimal parallel memory access patterns and/or cache reuse, with an emphasis on leveraging shared caches for improving performance.
Detailed Description
Tutorial goals:
With clusters of multicore, multisocket shared memory nodes forming the price/performance sweet spot in parallel computers, ignoring the specific features of those systems almost inevitably has a negative impact on program performance. Participants will
- gain an awareness of the typical multicore-specific bottlenecks and obstacles one has to face when writing efficient parallel software with MPI and/or OpenMP,
- learn to deduce the qualitative performance behavior of applications from cache/socket topologies and low-level benchmark data,
- learn how to leverage the shared cache feature of modern multicore chips,
- encounter tools to probe multicore and socket topology,
- and learn to establish proper thread-core mapping for OpenMP, MPI, and hybrid MPI/OpenMP.
Targeted audience:
Everyone who as to write efficient parallel software (MPI, OpenMP, hybrid) in a multicore environment.
Content level:
50% Introductory, 25% Intermediate, 25% Advanced
Audience prerequisites:
Some knowledge about parallel programming with MPI and OpenMP.
General description of tutorial content (extended abstract):
There is a common misunderstanding, especially among application programmers and software users, that the building blocks of modern HPC clusters are symmetric multiprocessors (SMPs). The inner structure of multicore chips and the “anisotropy” it creates in a system environment are often overlooked. With this tutorial we will create awareness of the performance issues that arise from ignoring chip and node topology, and convey the necessary knowledge to enable attendees to write efficient software and circumvent performance bottlenecks.
We start by giving an overview of multicore processor architecture, as far as it is relevant in HPC. While putting multiple cores on a chip or in a package without common resources apart from the memory connection is nothing but a peak-performance/bottleneck game, it introduces a new hierarchy level into system design, and has to be considered by programmers and software users alike. Additionally, the shared cache feature of more modern processors has brought about a real opportunity for advanced code optimizations. The tutorial will cover the following relevant issues:
- System topology and affinity control. How can the arrangement of hardware threads, cores, caches, and sockets (“packages”) be properly probed, and how can affinity (“pinning”) be enforced for OpenMP threads, MPI processes, and both at the same time? Concentrating on Linux, we will introduce some tools that can probe node topology and provide information the user needs to establish the desired level of thread/core (or software thread/hardware thread) affinity. Standard tools like taskset and numactl, and improved variants as contained in the LIKWID toolset, are presented, together with affinity libraries (PLPA, hwloc). Examples will illustrate how to “pin” OpenMP, MPI, and hybrid MPI+OpenMP applicatins. Compiler-specific facilities and tools available on non-Linux systems are briefly touched.
- Bandwidth bottlenecks. What are the consequences of having more than one core use the bandwidth of a single memory bus? Low-level benchmarks and application data will be used to demonstrate the typical bandwidth saturation effects in current multi-core processors.
- ccNUMA locality. There are hardly any UMA-type compute nodes left in current HPC systems. Although not strictly a multicore issue, ccNUMA locality is an important prerequisite of good node performance for memory-bound code. We will show how to employ ccNUMA page placement via the first-touch policy, external tools, and under program control. The question how to deal with ccNUMA if dynamic scheduling or tasking cannot be avoided is given due attention.
- Shared vs. separate caches, node topology. What are the pros and cons of having a shared cache used by multiple cores in a package, and how can system topology impact program performance? We will show the influence of shared caches and overall system topology on OpenMP overheads (synchronization, dynamic scheduling) and on intra-node MPI behavior, and point out how topology awareness can pay off. With a growing number of cores and (possibly) MPI processes per node, there is a lot of optimization potential in the mapping between MPI ranks and subdomains when using domain decomposition; we will point out how this mapping can be improved if the MPI library is not aware of this problem. The influence and significance of hardware multithreading (SMT) is demonstrated. An often-used argument against shared caches is the bandwidth bottleneck they create. We will analyze when and why a shared cache may pose a bandwidth problem.
- Leveraging shared caches for performance. Can shared caches be used to boost parallel performance beyond improved OpenMP and MPI overheads? We will demonstrate a strategy (coined “pipelined temporal blocking”) that explicitly uses shared caches on multicore chips to overlap memory traffic and useful computation, and show how mutual thread synchronization can be used to work around the large penalties for global OpenMP barriers.