Full-day tutorial at Supercomputing 2012, Nov. 11-16, 2012, Salt Lake City, UT:
The practicioner’s cookbook for good parallel performance on multi- and manycore systems
Georg Hager and Gerhard Wellein
Erlangen Regional Computing Center
University of Erlangen-Nuremberg
The advent of multi- and manycore chips has led to a further opening of the gap between peak and application performance for many scientific codes. This trend is accelerating as we move from petascale to exascale. Paradoxically, bad node-level performance helps to “efficiently” scale to massive parallelism, but at the price of increased overall time to solution. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. Also, the potential of node-level improvements is widely underestimated, thus it is vital to understand the performance-limiting factors on modern hardware. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as well as the dominant MPI and OpenMP programming models, as far as they are relevant for the practitioner. Peculiarities like shared vs. separate caches, bandwidth bottlenecks, and ccNUMA characteristics are pointed out, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering is introduced as a powerful tool that helps the user assess the impact of possible code optimizations by establishing models for the interaction of the software with the hardware on which it runs.
With clusters of multi-core multi-socket shared memory nodes (possibly equipped with accelerator boards) forming the price/performance sweet spot in parallel computers, ignoring the specific features of those systems almost inevitably has a negative impact on time to solution. Participants will
- learn about the differences and dependencies between scalability, time to solution, and resource utilization,
- gain an awareness of the typical bottlenecks and obstacles one has to face when writing efficient parallel software with MPI, OpenMP, CUDA, or a combination of those on modern cluster nodes, and learn to distinguish between hardware and software reasons for limited scalability,
- learn to establish proper thread-core mapping for OpenMP, MPI, and hybrid MPI/OpenMP,
- learn about tools to probe multicore and socket topology, obtain hardware performance counter data, and use this data properly to get a picture of a code’s interaction with the hardware,
- learn to employ modeling techniques to predict and understand the performance behavior of applications from architectural properties and low-level benchmark data,
- learn to employ a structured performance engineering method that will lead to a better understanding of performance issues and the possible benefit of optimizations,
- use the skills they have acquired to work with toy codes on real hardware during the hands-on sessions.
Everyone who as to write efficient parallel software (MPI, OpenMP, hybrid) in a multi- and manycore environment.
50% Introductory, 25% Intermediate, 25% Advanced
Some knowledge about MPI and OpenMP. Hands-on sessions require a working knowledge of Linux environments (SSH, basic commands, compiling) and the principles of batch processing.
General description of tutorial content (extended abstract):
There is a common misunderstanding, especially among application programmers and software users, that the building blocks of modern HPC clusters are (possibly accelerated) symmetric multiprocessors (SMPs). With this tutorial we will create awareness of the performance issues that arise from ignoring chip and node topology, and convey the necessary knowledge to enable attendees to understand and probably circumvent the inherent bottlenecks.
We start by giving an overview of multicore processor and accelerator (i.e., GPGPU) architecture, as far as it is relevant in HPC. While putting multiple cores on a chip or in a package without common resources apart from the memory connection is nothing but a peak-performance/bottleneck game, it introduces a new hierarchy level into system design, and has to be considered by programmers and software users alike. The tutorial will cover the following relevant issues:
- Bandwidth bottlenecks. What are the consequences of having compute units that share memory levels (cache or main memory)? Low-level benchmarks and application data will be used to demonstrate the typical bandwidth saturation effects in current multi-core processors, and the importance of thread occupancy and latency hiding on GPGPUs.
- ccNUMA locality. ccNUMA locality is an important prerequisite of good node performance for memory-bound code, and it will not ben handled automatically. We show how to employ ccNUMA page placement via the first-touch policy, external tools, and under program control.
- Shared vs. separate caches, node topology. What are the pros and cons of having a shared cache used by multiple cores in a package, and how can system topology impact program performance? Which resources on a node are easy to scale, and which are not? We will show the influence of shared resources and overall system topology on program performance. The influence and significance of hardware multithreading (SMT) is demonstrated and popular myths about SMT are cleared up.
- System topology, affinity control, and hardware performance metrics. How can the arrangement of hardware threads, cores, caches, and sockets (“packages”) be properly probed, and how can affinity (“pinning”) be enforced for OpenMP threads, MPI processes, and both at the same time? Concentrating on Linux, standard tools like taskset and numactl, and improved variants as contained in the LIKWID toolset are presented.
- Performance counter measurements. Likwid-perfctr, contained in the LIKWID suite, is a hardware counter tool that is easier to use than, e.g., PAPI. We show best practices for program optimization guided by hardware counter data.
- Performance modeling and engineering. When is the performance “good enough?” Can it still be improved? We will first introduce simple bandwidth-based performance models, which work well in many cases, to stress the importance of resource utilization. As a typical application we will use such models to estimate the expected performance gains from putting (parts of) a code onto an accelerator. Refinements of the model (including PCIe overhead, in-core execution, inter-cache data transfer) will then be developed and applied in relevant test cases (stencil codes, sparse matrix-vector multiplication). We will also establish performance engineering as a tool for replacing shot-in-the-dark optimizations with a structured approach that tries to reach a definite performance goal.
All aspects will be substantiated by performance data from low-level benchmarks (where applicable) or case studies using real application codes.