Full-day tutorial at the virtual Supercomputer Conference 2020 (SC 20), November 9-20, 2020, Atlanta, GA, USA:
Node-Level Performance Engineering
Slides for download:
- General introduction
- Introduction to node-level computer architecture
- Performance tools part 1: topology and affinity
- Microbenchmarking
- Introduction to the Roofline model
- Performance tools part 2: performance events and clock speed
- Case study: tall & skinny dense matrix-vector multiplication
- Case study: a 2D five-point stencil smoother
- Case study: sparse matrix-vector multiplication
- Programming for single instruction multiple data (SIMD) parallelism
- Programming for cache-coherent non-uniform memory architectures (ccNUMA)
Transcript of the Q&A chat: SC20_NLPE_QandA.pdf
Interesting links:
- Intel® 64 and IA-32 Architectures Optimization Reference Manual
- Agner Fog: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs
- LIKWID tool suite
- LIKWID Wiki
Modified STREAM source code: stream.c
Compile with, e.g.:
icc -Ofast -xHost -qopenmp -fno-alias -nolib-inline -qopt-streaming-stores never|always -o stream.exe stream.c
Run with:
likwid-pin -c <pin_mask> ./stream.exe
LIKWID-instrumented STREAM source code: stream-mapi.c
Compile with, e.g.:
icc <options-from-above> -DLIKWID_PERFMON -I<path_to_likwid_inc> stream-mapi.c -o stream-mapi.exe -L<path_to_likwid_lib> -llikwid
Run with:
likwid-perfctr -C <pin_mask> -m -g <perf_group> ./stream-mapi.exe
Vector triad throughput benchmark: triad-throughput.tar.gz
Compile with:
icc -c timing.c icc -c dummy.c ifort -Ofast -xHost -qopenmp -fno-alias -fno-inline triad-tp.f90 dummy.o timing.o -o triad.exe
Run with:
echo <size> | likwid-pin <PIN_OPTIONS> ./triad.exe
Jacobi 3D stencil code: j3d_with_likwid.tar.gz
Build with the supplied Makefile (may need to adapt to your LIKWID setup).
Run with:
likwid-perfctr -C <pin_mask> -m -g <perf_group> ./J3D.exe <size>
Sparse matrix benchmark code (CSR/SELL-C-sigma):
Authors/Presenters
Georg Hager1, Jan Eitzinger1, and Gerhard Wellein2
1 Erlangen Regional Computing Center
2 Department of Computer Science
University of Erlangen-Nuremberg
Germany
{georg.hager,jan.eitzinger,gerhard.wellein}@fau.de
Abstract
The advent of multi- and manycore chips has led to a further opening of the gap between peak and application performance for many scientific codes. This trend is accelerating as we move from petascale to exascale. Paradoxically, bad node-level performance helps to “efficiently” scale to massive parallelism, but at the price of increased overall time to solution. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as far as they are relevant for the practitioner. Peculiarities like SIMD vectorization, shared vs. separate caches, bandwidth bottlenecks, and ccNUMA characteristics are introduced, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering and performance patterns are suggested as powerful tools that help the user understand the bottlenecks at hand and to assess the impact of possible code optimizations. A cornerstone of these concepts is the roofline model, which is described in detail, including useful case studies, limits of its applicability, and possible refinements.