Content

Get a full dose of performance engineering with our two half-day tutorials at ISC25!

June 8, 2025

This year’s ISC High Performance conference takes place in Hamburg from June 10-13. As a refreshing counterweight to the usual enema of AI and Quantum Computing you will get there, we conduct two half-day tutorials about good old solid performance engineering.

In the morning of June 13, Jan Laukemann and I will present the “Core-Level Performance Engineering” tutorial: 3.5 hours packed with information on how modern CPUs execute your code. Pipelining, out-of-order execution, superscalarity, SIMD, plus hands-on exercises using Matt Godbolt’s Compiler Explorer and OSACA, our Open-Source Architecture Code Analyzer, which is integrated with it. If you need to model the in-core performance of code for optimization or co-design, this tutorial is for you.

In the afternoon of June 13, Christie Alappat (still a PhD student at FAU but now working for Intel), Jonas Thies (TU Delft), Hartwig Anzt (TU München Campus Heilbronn), and myself will conduct the tutorial “Performance Engineering for Sparse Linear Solvers.” It provides a thorough coverage of sparse matrix-vector multiplication (SpMV), preconditioners, and even cache blocking of matrix powers via RACE, Christie’s Recursive Algebraic Coloring Engine. In the hands-on exercises, attendees will get access to an A100 GPU and be able to experiment with SpMV and sparse linear solvers. All code (mostly python/numba) is available for download.

Paper on Write-Allocate Evasion is Best Paper Candidate at IPDPS 2024

May 20, 2024

At the International Parallel and Distributed Processing Symposium (IPDPS) 2024 next week in San Fancisco, CA, our PhD student Jan Laukemann will present a paper that came out of a collaboration between NHR@FAU and Brookhaven National Laboratory (BNL): “CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion” investigates a curious effect when benchmarking the MPI-only version of the CloverLeaf proxy application, which is part of the SPEChpc 2021 benchmark suite: When scaling this code across the cores of an Intel Xeon Ice Lake or Sapphire Rapids node, we observed peculiar breakdowns in performance when the number of processes is prime. Being aware that we might just have discovered the most horribly expensive way to calculate prime numbers, we looked into the details with meticulous measurements and performance models. We came to the conclusion that this was neither caused by excessive MPI communication nor breaking layer conditions but by a new feature of Intel CPUs since Ice Lake: a write-allocate evasion mechanism called “SpecI2M.” SpecI2M is supposed to eliminate the write-allocate transfers from memory initiated by write misses if the cache line will be completely overwritten anyway (in which case the write-allocate is just overhead). As it turns out, SpecI2M is especially ineffective when the number of MPI processes is prime. To discover why, visit Jan’s talk (if you happen to be in San Francisco) or take a look at the paper:

J. Laukemann, T. Gruber, G. Hager, D. Oryspayev, and G. Wellein: CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion. Accepted for publication at IPDPS 2024, the 38th IEEE International Parallel & Distributed Processing Symposium. Preprint: arXiv:2311.04797

Along the way, we provided the first comprehensive, predictive Roofline model for the hot-spot loops in CloverLeaf, which helped a lot in figuring out what was actually going on. To our great delight, the paper is one of four best-paper candidates at the conference!

New tutorial “Performance Engineering for Linear Solvers” at ISC High Performance 2024

May 5, 2024

On Sunday, May 12, the brand-new tutorial “Performance Engineering for Linear Solvers” will be presented at ISC High Performance in Hamburg by Christie Alappat (still a PhD student at FAU but now working for Intel), Jonas Thies (TU Delft), Hartwig Anzt (TU München Campus Heilbronn), and myself.

This tutorial was in the making for a long time; many concepts were made, re-made, and updated again. We aimed at a slightly higher abstraction level than in our popular tutorial “Node-Level Performance Engineering,” which has a strong focus on the Roofline model and the optimization of simple loops and loop nests. In contrast, the new tutorial concentrates on the performance of sparse linear solvers, which includes a coverage of sparse matrix-vector multiplication (SpMV), preconditioners, and even cache blocking of matrix powers via RACE, Christie’s Recursive Algebraic Coloring Engine. Since the tutorial was accepted as a half-day event, we could only accommodate online demos instead of hands-on exercises for attendees. However, all code (mostly python/numba) is available for download.

SIAM Parallel Processing 24 (PP24) Minisymposium on “Advancements in Sparse Linear Algebra: Hardware-Aware Algorithms and Optimization Techniques”

December 12, 2023

Together with Christie Alappat and Gerhard Wellein, I am organizing a two-part minisymposium at SIAM Parallel Processing 2024 in Baltimore, MD (full conference program available), on March 7, 2024. It is titled “Advancements in Sparse Linear Algebra: Hardware-Aware Algorithms and Optimization Techniques.” This is the abstract:

Over the last decade, the landscape of computer architecture has undergone tremendous transformations. At the same time, the field of sparse linear algebra (LA) has experienced a resurgence in interest, witnessing the emergence of novel algorithms and the revival of traditional ones. The irregular access patterns inherent in sparse LA often pose significant challenges for efficient execution on highly parallel modern computing devices. This minisymposium delves into diverse algorithmic and programming techniques that address these challenges. We will explore various methods for enhancing the computational intensity and node-level performance of sparse algorithms, along with approaches to improve their scalability. The topics covered encompass mixed precision computation, batched solvers, methods for reducing or hiding communication, cache blocking, and the efficient utilization of parallel paradigms. Naturally, the implementation of some of these techniques is not straightforward and may necessitate algorithmic reformulation. The discussions will shed light on this aspect while also examining their applications, benefits, and limitations. The primary objective is to present an overview of cutting-edge sparse LA techniques that effectively leverage hardware capabilities.

Here’s an overview of the agenda:

Part 1 – MS43 (March 7, 11:00 am – 1:00 pm EST)

11:00-11:25 Accelerating Sparse Solvers with Cache-Optimized Matrix Power Kernels abstract
Christie Louis Alappat, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany

11:30-11:55 Efficient Schwarz Preconditioning Techniques on Current Hardware Using FROSch abstract
Alexander Heinlein, Delft University of Technology, Netherlands; Sivasankaran Rajamanickam and Ichitaro Yamazaki, Sandia National Laboratories, U.S.

12:00-12:25 Sparse Algorithms for Large-Scale Bayesian Inference Problems abstract
Lisa Gaedke-Merzhäuser and Olaf Schenk, Università della Svizzera italiana, Switzerland

12:30-12:55 Communication-Reduced Sparse-Dense Matrix Multiplication with Adaptive Parallelization abstract
Hua Huang and Edmond Chow, Georgia Institute of Technology, U.S.

Part 2 – MS54 (March 7, 3:45 pm – 5:45 pm EST)

3:45-4:10 Residual Inverse Formulation of the Feast Eigenvalue Algorithm Using Mixed-Precision and Inexact System Solves abstract
Ivan Williams and Eric Polizzi, University of Massachusetts, Amherst, U.S.

4:15-4:40 How Mixed Precision Can Accelerate Sparse Solvers abstract
Hartwig Anzt, University of Tennessee, U.S.; Terry Cojean, Pratik Nayak, Thomas Gruetzmacher, Yu-Hsiang Mike Tsai, Marcel Koch, Tobias Ribizel, Fritz Goebel, and Gregor Olenik, Karlsruhe Institute of Technology, Germany

4:45-5:10 Algebraic Programming for High Performance Auto-Parallelised Solvers abstract
Albert Jan N. Yzelman, Denis Jelovina, Aristeidis Mastoras, Alberto Scolari, and Daniele Giuseppe Spampinato, Huawei Technologies Switzerland AG, Switzerland

5:15-5:40 Pipelined Sparse Solvers: Can More Reliable Computations Help Us to Converge Faster? abstract
Roman Iakymchuk, Umeå University and Uppsala University, Sweden

Node-Level Performance Engineering tutorial to be featured again at SC23

October 23, 2023

Our popular “Node-Level Performance Engineering” full-day tutorial has been accepted again (now the twelfth time in a row!) for presentation at SC23, the International Conference for High Performance Computing, Networking, Storage and Analysis. Together with Thomas Gruber and Gerhard Wellein I will teach the basics of node-level computer architecture, the LIKWID performance tools suite, analytic performance modeling (via the Roofline model), and model-guided optimization. Find the details in the official SC23 agenda.

Get the gist of it in our flashy promo video:

PERMAVOST Workshop submission deadline approaching

March 27, 2023

PERMAVOST 2023, the 3rd Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy, is calling for papers. The workshop will be held in conjunction with ACM HPDC 2023 in Orlando, FL.

Modern software engineering is getting increasingly complicated. Effective monitoring and analysis of High-Performance Computing (HPC) applications and infrastructure is critical for ongoing improvement, design, and maintenance. The development and maintenance of these applications expand beyond the realm of computer science to encompass a diverse range of experts in mathematics, science, and other engineering disciplines. Many developers from these disciplines rely increasingly on the tools created by computer scientists to analyze and optimize their code. Thus, there is a pressing need for a forum to bring together a diverse group of experts between these different communities to share their experiences, collaborate on solving challenges, and work towards advancing the field of HPC performance analysis and optimization.

Submission deadline: April 1, 2023, 11:59 pm Anywhere on Earth (AoE)
Notification; April 20, 2023
Camera ready deadline: May 12, 2023
Workshop: June 20, 2023

Continue reading →

New tutorial on “Core-Level Performance Engineering” accepted for ICPE 2023

March 3, 2023

Our brand-new tutorial “Core-Level Performance Engineering” has been accepted as a full-day tutorial at ICPE 2023, the 14th ACM/SPEC International Conference on Performance Engineering. This tutorial concentrates on the in-core aspects of performance modeling and analysis on CPUs. We use Matt Godbolt’s Compiler Explorer and our Open-Source Architecture Code Analyzer (OSACA), which is now integrated with the Compiler Explorer, to teach the basics of code execution including pipelining, superscalarity, SIMD, intra-iteration and loop-carried dependencies, and more. Intel/AMD x86 and ARM Neon/SVE assembly code is introduced, and participants can get their hands dirty exploring the depths of machine code execution using only a web browser! Lead OSACA developer Jan Laukemann did most of the work for this exciting new event. Find the details at: https://icpe2023.spec.org/tutorials/tutorial3/.

All slides and some of the exercises are available at: http://tiny.cc/CLPE.

ISC 2022 has started

May 30, 2022

ISC High Performance 2022 is finally here! After two years of online ISC, people can finally get together and talk face to face (albeit with one or two layers of mask tissue in between). The first “full conference” day (Sunday is traditionally reserved for tutorials) marks some notable events this year.

After a warm welcome and introduction by ISC22 Program Chair Keren Bergman, Rev Lebaredian (NVIDIA) and Michele Melchiorre (BMW) gave an enthralling keynote about how digital twins are used in industry and how they may become the one thing that you never knew about but desperately need. A digital twin is like a computational model of reality – be it a manufacturing plant, a building, or even a whole city. Such twins are used a lot in design phases, but they are rarely kept up to date over the whole life cycle of the structures they describe. With the advent of powerful AI methods, this may change as AI could close the gap between model and reality. Rev even went so far as to speculate that, given enough computing power, one could use the twin to move back and forth in time for improved insight, forecasting, or decision making.

An emotional moment (well, for a technical event at least) came with the special session for celebrating the life and work of Jack Dongarra, recipient of the 2021 ACM Turing Award. Horst Simon, Tony Hey, David E. Keyes, and Satoshi Matsuoka recalled Jack’s many achievements, the most prominent of which are the BLAS and LAPACK libraries (and their more current descendants), the initial seedling of the MPI standard, sustainable and scalable HPC benchmarks, and numerous contributions to mixed-precision linear algebra methods.

Next came Erich Strohmaier with his long-awaited presentation of the June 2022 Top500 list, in which he analyzed current trends in supercomputing using the historical data that is now available since 1993. One surprising fact about the current list is that there was never such a low turnaround – only 39 systems fell out of the list, and the entry threshold of just above 1.6 Pflop/s hasn’t even changed. On the positive side, Fritz and Alex, the new NHR@FAU systems, are officially on it: Fritz is at #323, despite some network hardware still missing, and Alex is at #184 although the more than 300 NVIDIA A40 GPUs could not even be used for running LINPACK. In addition, Alex struck #16 in the Green500, in which energy efficiency in Gflop/W determines the ranking. With that, Alex is the most energy-efficient system in Germany.

Oh yes, and there’s a new #1: “Frontier” at Oak Ridge National Lab broke the Exaflop barrier. We have now officially entered the age of exascale, at least if we use the debatable LINPACK metric as the yardstick. Erich pointed out that Frontier now encompasses 25% of the total aggregated Top500 performance; grain-of-salty extrapolations indicate that this ratio may go up to 50% in 2030, but who knows which marvels await behind the closed doors of Nvidia, Intel, or AMD labs.

Gprofng is the next-generation GNU profiler

April 14, 2022

This week, Ruud van der Pas of OpenMP fame gave a talk in our NHR PerfLab seminar on gprofng, the next-generation GNU profiling tool. If you ever felt that gprof was sorely lacking features like threading support, sampling, and drilling down to source, gprofng comes to rescue. Now you can profile code without even recompiling it, which comes in handy (not only) if you don’t have the source. It has recently been accepted as part of the Linux binutils package and will inevitably find its way into standard Linux distros. If you don’t want to wait that long, clone the development repo with

git clone git://sourceware.org/git/binutils-gdb.git

and compile it yourself. Here’s the recording of Ruud’s talk, where he explains the basic functions of gprofng and also takes a peek at upcoming features like HTML output and hardware performance counter support:

SIAM Parallel Processing 2022 Minisymposium on “Advances in Performance Modeling of Parallel Code”

December 22, 2021

Together with Alexandru Calotoiu (ETH Zurich), I am organizing a two-part minisymposium at SIAM Parallel Processing 2022. It is to take place on February 25, 2022, and it is titled “Advances in Performance Modeling of Parallel Code.” This is the abstract:

Performance modeling is an indispensable tool for the assessment, analysis, prediction, and optimization of parallel code in scientific computing and computational science. Modeling approaches can take a variety of forms, from purely analytic, first-principle models to curve fitting, machine learning, and AI-based solutions. The goals of modeling are just as diverse: Identification of bottlenecks or scaling problems, extrapolation, architectural exploration, and even the prediction of power dissipation and energy consumption can all be supported be modeling procedures. This minisymposium tries to provide an overview of the current state of the art in performance, or more generally, resource modeling of parallel code. The hardware focus will be very broad, from the node to the massively parallel level, including standard multicore systems, GPUs, and reconfigurable hardware. Contributions will cover fundamental research as well as tools development and case studies. After the minisymposium, the organizers plan to issue an open call for a journal special issue.

An impressive lineup of international speakers have been brought together:

Part 1 (February 25, 11:10 am – 12:50 pm PST)

11:10-11:30 Computational Waves in Parallel Programs and Their Impact on Performance Modeling

Ayesha Afzal, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany; Georg Hager, Erlangen National High Performance Computing Center, Germany

11:35-11:55 The Price Performance of Performance Models

Alexandru Calotoiu, ETH Zurich, Switzerland; Alexander Geiss, Benedikt Naumann, Marcus Ritter, and Felix Wolf, Technische Universität Darmstadt, Germany

12:00-12:20 perf-taint: Extracting Clean Performance Models from Tainted Programs

Marcin Copik, ETH Zurich, Switzerland

12:25-12:45 Extra-P Meets Hatchet: Towards Modeling in Performance Analytics

Sergei Shudler, Lawrence Livermore National Laboratory, U.S.

Part 2 (February 25, 3:35 pm – 5:15 pm PST)

3:35-3:55 Performance Modeling of Graph Processing Workloads

Ana Lucia Varbanescu and Merijn Verstraaten, University of Amsterdam, Netherlands

4:00-4:20 Machine Learning–enabled Scalable Performance Prediction of Scientific Codes

Stephan Eidenbenz and Nandakishore Santhi, Los Alamos National Laboratory, U.S.

4:25-4:45 Automatic Application Performance Data Collection with Caliper and Adiak

David Boehme, Lawrence Livermore National Laboratory, U.S.

SIAM PP22 will be a hybrid conference, and many details are still to be sorted out. The full conference program can be viewed at: https://meetings.siam.org/program.cfm?CONFCODE=PP22. Stay tuned for news.

Georg Hager's Blog

Random thoughts on High Performance Computing