Georg Hager's Blog

At the International Parallel and Distributed Processing Symposium (IPDPS) 2024 next week in San Fancisco, CA, our PhD student Jan Laukemann will present a paper that came out of a collaboration between NHR@FAU and Brookhaven National Laboratory (BNL): “CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion” investigates a curious effect when benchmarking the MPI-only version of the CloverLeaf proxy application, which is part of the SPEChpc 2021 benchmark suite: When scaling this code across the cores of an Intel Xeon Ice Lake or Sapphire Rapids node, we observed peculiar breakdowns in performance when the number of processes is prime. Being aware that we might just have discovered the most horribly expensive way to calculate prime numbers, we looked into the details with meticulous measurements and performance models. We came to the conclusion that this was neither caused by excessive MPI communication nor breaking layer conditions but by a new feature of Intel CPUs since Ice Lake: a write-allocate evasion mechanism called “SpecI2M.” SpecI2M is supposed to eliminate the write-allocate transfers from memory initiated by write misses if the cache line will be completely overwritten anyway (in which case the write-allocate is just overhead). As it turns out, SpecI2M is especially ineffective when the number of MPI processes is prime. To discover why, visit Jan’s talk (if you happen to be in San Francisco) or take a look at the paper:

J. Laukemann, T. Gruber, G. Hager, D. Oryspayev, and G. Wellein: CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion. Accepted for publication at IPDPS 2024, the 38th IEEE International Parallel & Distributed Processing Symposium. Preprint: arXiv:2311.04797

Along the way, we provided the first comprehensive, predictive Roofline model for the hot-spot loops in CloverLeaf, which helped a lot in figuring out what was actually going on. To our great delight, the paper is one of four best-paper candidates at the conference!

On Sunday, May 12, the brand-new tutorial “Performance Engineering for Linear Solvers” will be presented at ISC High Performance in Hamburg by Christie Alappat (still a PhD student at FAU but now working for Intel), Jonas Thies (TU Delft), Hartwig Anzt (TU München Campus Heilbronn), and myself.

This tutorial was in the making for a long time; many concepts were made, re-made, and updated again. We aimed at a slightly higher abstraction level than in our popular tutorial “Node-Level Performance Engineering,” which has a strong focus on the Roofline model and the optimization of simple loops and loop nests. In contrast, the new tutorial concentrates on the performance of sparse linear solvers, which includes a coverage of sparse matrix-vector multiplication (SpMV), preconditioners, and even cache blocking of matrix powers via RACE, Christie’s Recursive Algebraic Coloring Engine. Since the tutorial was accepted as a half-day event, we could only accommodate online demos instead of hands-on exercises for attendees. However, all code (mostly python/numba) is available for download.

Random thoughts on High Performance Computing

Content

Paper on Write-Allocate Evasion is Best Paper Candidate at IPDPS 2024

New tutorial “Performance Engineering for Linear Solvers” at ISC High Performance 2024