Half-day tutorial at the 20th High Performance Computing Symposium (HPC 2012), part of the SCS Spring Simulation Multiconference (SpringSim ’12), March 26-29, 2012, Orlando, FL:
Ingredients for good parallel performance on multicore-based systems
Georg Hager, Gerhard Wellein, and Jan Treibig, University of Erlangen-Nuremberg, Germany
This tutorial covers program optimization techniques for multicore processors and the systems they are used in. It concentrates on the dominating parallel programming paradigms, MPI and OpenMP.
The presenters start by giving an architectural overview of multicore processors. Peculiarities like shared vs. separate caches, bandwidth bottlenecks, and ccNUMA characteristics are pointed out. We show typical performance features like synchronization overhead, intranode MPI bandwidths and latencies, ccNUMA locality, and bandwidth saturation (in cache and memory) in order to pinpoint the influence of system topology and thread affinity on the performance of typical parallel programming constructs. Multiple ways of probing system topology and establishing affinity, either by explicit coding or separate tools, are demonstrated. Finally we elaborate on programming techniques that help establish optimal parallel memory access patterns and/or cache reuse, with an emphasis on leveraging shared caches for improving performance.