Full-day tutorial at PPoPP 2014, Feb. 16, 2014, Orlando, FL:
The practicioner’s cookbook for good parallel performance on multi- and manycore systems
Author/presenter: Dr. Jan Treibig
The advent of multi- and manycore chips has led to a further opening of the gap between peak and application performance for many scientific codes. This trend is accelerating as we move from petascale to exascale. Paradoxically, bad node-level performance helps to “efficiently” scale to massive parallelism, but at the price of increased overall time to solution. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. Also, the potential of node-level improvements is widely underestimated, thus it is vital to understand the performance-limiting factors on modern hardware. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as well as the dominant MPI and OpenMP programming models, as far as they are relevant for the practitioner. Peculiarities like SIMD vectorization, shared vs. separate caches, bandwidth bottlenecks, and ccNUMA characteristics are introduced, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering is introduced as a powerful tool that helps the user assess the impact of possible code optimizations by establishing models for the interaction of the software with the hardware on which it runs.
With clusters of multi-core multi-socket shared memory nodes (possibly equipped with accelerator boards) forming the price/performance sweet spot in parallel computers, ignoring the specific features of those systems almost inevitably has a negative impact on time to solution. Participants will
- learn about the differences and dependencies between scalability, time to solution, and resource utilization,
- learn about the relevant architectural features of modern multi- and manycore hardware, such as SIMD execution, SMT, cache hierarchies, shared resources, and ccNUMA memory architecture,
- gain an awareness of the typical bottlenecks and obstacles one has to face when writing efficient parallel software on modern cluster nodes, and learn to distinguish between hardware and software reasons for limited scalability,
- learn to establish proper thread-core mapping for OpenMP, MPI, and hybrid MPI/OpenMP,
- learn about tools to probe multicore and socket topology, obtain hardware performance counter data, and use this data properly to get a picture of a code’s interaction with the hardware,
- learn to employ modeling techniques to predict and understand the performance behavior of applications from architectural properties and low-level benchmark data,
- learn to employ a structured performance engineering method that will lead to a better understanding of performance issues and the possible benefit of optimizations.
Everyone who as to write efficient parallel software (MPI, OpenMP, hybrid) in a multi- and manycore environment.
40% Introductory, 35% Intermediate, 25% Advanced