Georg Hager's Blog

Random thoughts on High Performance Computing

Content

PERMAVOST Workshop submission deadline approaching

PERMAVOST 2023, the 3rd Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy, is calling for papers. The workshop will be held in conjunction with ACM HPDC 2023 in Orlando, FL.

Modern software engineering is getting increasingly complicated. Effective monitoring and analysis of High-Performance Computing (HPC) applications and infrastructure is critical for ongoing improvement, design, and maintenance. The development and maintenance of these applications expand beyond the realm of computer science to encompass a diverse range of experts in mathematics, science, and other engineering disciplines. Many developers from these disciplines rely increasingly on the tools created by computer scientists to analyze and optimize their code. Thus, there is a pressing need for a forum to bring together a diverse group of experts between these different communities to share their experiences, collaborate on solving challenges, and work towards advancing the field of HPC performance analysis and optimization.

  • Submission deadline: April 1, 2023, 11:59 pm Anywhere on Earth (AoE)
  • Notification; April 20, 2023
  • Camera ready deadline: May 12, 2023
  • Workshop: June 20, 2023

Continue reading

New tutorial on “Core-Level Performance Engineering” accepted for ICPE 2023

ICPE 2023 LogoOur brand-new tutorial “Core-Level Performance Engineering” has been accepted as a full-day tutorial at ICPE 2023, the 14th ACM/SPEC International Conference on Performance Engineering. This tutorial concentrates on the in-core aspects of performance modeling and analysis on CPUs. We use Matt Godbolt’s Compiler Explorer and our Open-Source Architecture Code Analyzer (OSACA), which is now integrated with the Compiler Explorer, to teach the basics of code execution including pipelining, superscalarity, SIMD, intra-iteration and loop-carried dependencies, and more. Intel/AMD x86 and ARM Neon/SVE assembly code is introduced, and participants can get their hands dirty exploring the depths of machine code execution using only a web browser! Lead OSACA developer Jan Laukemann did most of the work for this exciting new event. Find the details at: https://icpe2023.spec.org/tutorials/tutorial3/.

All slides and some of the exercises are available at: http://tiny.cc/CLPE.

ISC 2022 has started

ISC High Performance 2022 is finally here! After two years of online ISC, people can finally get together and talk face to face (albeit with one or two layers of mask tissue in between). The first “full conference” day (Sunday is traditionally reserved for tutorials) marks some notable events this year.

After a warm welcome and introduction by ISC22 Program Chair Keren Bergman,  Rev Lebaredian (NVIDIA) and Michele Melchiorre (BMW) gave an enthralling keynote about how digital twins are used in industry and how they may become the one thing that you never knew about but desperately need. A digital twin is like a computational model of reality – be it a manufacturing plant, a building, or even a whole city. Such twins are used a lot in design phases, but they are rarely kept up to date over the whole life cycle of the structures they describe. With the advent of powerful AI methods, this may change as AI could close the gap between model and reality. Rev even went so far as to speculate that, given enough computing power, one could use the twin to move back and forth in time for improved insight, forecasting, or decision making.

An emotional moment (well, for a technical event at least) came with the special session for celebrating the life and work of Jack Dongarra, recipient of the 2021 ACM Turing Award. Horst Simon, Tony Hey, David E. Keyes, and Satoshi Matsuoka recalled Jack’s many achievements, the most prominent of which are the BLAS and LAPACK libraries (and their more current descendants), the initial seedling of the MPI standard, sustainable and scalable HPC benchmarks, and numerous contributions to mixed-precision linear algebra methods.

Next came Erich Strohmaier with his long-awaited presentation of the June 2022 Top500 list, in which he analyzed current trends in supercomputing using the historical data that is now available since 1993. One surprising fact about the current list is that there was never such a low turnaround – only 39 systems fell out of the list, and the entry threshold of just above 1.6 Pflop/s hasn’t even changed. On the positive side, Fritz and Alex, the new NHR@FAU systems, are officially on it: Fritz is at #323, despite some network hardware still missing, and Alex is at #184 although the more than 300 NVIDIA A40 GPUs could not even be used for running LINPACK. In addition, Alex struck #16 in the Green500, in which energy efficiency in Gflop/W determines the ranking. With that, Alex is the most energy-efficient system in Germany.

Oh yes, and there’s a new #1: “Frontier” at Oak Ridge National Lab broke the Exaflop barrier. We have now officially entered the age of exascale, at least if we use the debatable LINPACK metric as the yardstick. Erich pointed out that Frontier now encompasses 25% of the total aggregated Top500 performance; grain-of-salty extrapolations indicate that this ratio may go up to 50% in 2030, but who knows which marvels await behind the closed doors of Nvidia, Intel, or AMD labs.

 

Gprofng is the next-generation GNU profiler

This week, Ruud van der Pas of OpenMP fame gave a talk in our NHR PerfLab seminar on gprofng, the next-generation GNU profiling tool. If you ever felt that gprof was sorely lacking features like threading support, sampling, and drilling down to source, gprofng comes to rescue. Now you can profile code without even recompiling it, which comes in handy (not only) if you don’t have the source. It has recently been accepted as part of the Linux binutils package and will inevitably find its way into standard Linux distros. If you don’t want to wait that long, clone the development repo with

git clone git://sourceware.org/git/binutils-gdb.git

and compile it yourself. Here’s the recording of Ruud’s talk, where he explains the basic functions of gprofng and also takes a peek at upcoming features like HTML output and hardware performance counter support:

SIAM Parallel Processing 2022 Minisymposium on “Advances in Performance Modeling of Parallel Code”

Together with Alexandru Calotoiu (ETH Zurich), I am organizing a two-part minisymposium at SIAM Parallel Processing 2022.  It is to take place on February 25, 2022, and it is titled “Advances in Performance Modeling of Parallel Code.” This is the abstract:

Performance modeling is an indispensable tool for the assessment, analysis, prediction, and optimization of parallel code in scientific computing and computational science. Modeling approaches can take a variety of forms, from purely analytic, first-principle models to curve fitting, machine learning, and AI-based solutions. The goals of modeling are just as diverse: Identification of bottlenecks or scaling problems, extrapolation, architectural exploration, and even the prediction of power dissipation and energy consumption can all be supported be modeling procedures. This minisymposium tries to provide an overview of the current state of the art in performance, or more generally, resource modeling of parallel code. The hardware focus will be very broad, from the node to the massively parallel level, including standard multicore systems, GPUs, and reconfigurable hardware. Contributions will cover fundamental research as well as tools development and case studies. After the minisymposium, the organizers plan to issue an open call for a journal special issue.

An impressive lineup of international speakers have been brought together:

Part 1 (February 25, 11:10 am – 12:50 pm PST)

11:10-11:30 Computational Waves in Parallel Programs and Their Impact on Performance Modeling

Ayesha Afzal, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany; Georg Hager, Erlangen National High Performance Computing Center, Germany

11:35-11:55 The Price Performance of Performance Models

Alexandru Calotoiu, ETH Zurich, Switzerland; Alexander Geiss, Benedikt Naumann, Marcus Ritter, and Felix Wolf, Technische Universität Darmstadt, Germany

12:00-12:20 perf-taint: Extracting Clean Performance Models from Tainted Programs

Marcin Copik, ETH Zurich, Switzerland

12:25-12:45 Extra-P Meets Hatchet: Towards Modeling in Performance Analytics

Sergei Shudler, Lawrence Livermore National Laboratory, U.S.

Part 2 (February 25, 3:35 pm – 5:15 pm PST)

3:35-3:55 Performance Modeling of Graph Processing Workloads

Ana Lucia Varbanescu and Merijn Verstraaten, University of Amsterdam, Netherlands

4:00-4:20 Machine Learning–enabled Scalable Performance Prediction of Scientific Codes

Stephan Eidenbenz and Nandakishore Santhi, Los Alamos National Laboratory, U.S.

4:25-4:45 Automatic Application Performance Data Collection with Caliper and Adiak

David Boehme, Lawrence Livermore National Laboratory, U.S.

 

SIAM PP22 will be a hybrid conference, and many details are still to be sorted out. The full conference program can be viewed at: https://meetings.siam.org/program.cfm?CONFCODE=PP22. Stay tuned for news.

LIKWID 5.2.1 is out!

LIKWID stickersLIKWID 5.2.1 is out! This bugfix release addresses a lot of small and not-so-small issues:

  • Support for Intel Rocket Lake and AMD Zen3 variant (Family 19, Model 0x50)
  • Fix for perf_event multiplexing (important!)
  • Fix for a potential deadlock in MarkerAPI (thx @jenny-cheung)
  • Build and runtime fixes for Nvidia GPU backend, updates for CUDA test codes
  • likwid-bench “peakflops” kernel for ARMv8
  • Updates for AMD Zen1/2/3 event lists and groups
  • Support spaces in MarkerAPI region tags (thx @jrmadsen)
  • Use ‘online’ cpulist instead of ‘present’
  • Check PID if given through –perfpid
  • Intel Icelake: OFFCORE_RESPONSE events
  • likwid-mpirun: Set MPI type for SLURM automatically
  • likwid-mpirun: Fix skip mask for OpenMPI
  • Fix for triad_sve* benchmarks

You can download the new version from the FTP or GitHub.

A riddle in D/A conversion – and its solution

This post is not about high performance computing, for a change; it does, however, have a distinct performance-related vibe, so I think it qualifies for this blog. Besides, I had so much fun solving this riddle that I just cannot keep it to myself, so here goes.

D/A conversion with Arduino and ZN426

I was always interested in digital signal processing, and recently I decided to look into an old drawer and found a couple of ZN426E8 8-bit D/A converters that I had bought in the 80s. I asked myself whether a standard Arduino Uno with its measly 16 MHz ATmega 328 processor would be able to muster enough performance to produce waveforms in the audio range (up to at least a couple of kHz) by copying bytes from a pre-computed array to the D/A converter. The ZN426 itself wouldn’t be a bottleneck in this because it’s just an R-2R ladder with a worst-case settling time of 2 µs, which leads to a Nyquist frequency of 250 kHz. Ample headroom for audio. At 16 MHz, 2 µs means 32 clock cycles, so it shouldn’t really be a problem for the Arduino as well even with compiled code. Or so I thought.

ZN426E8

Fig. 1: The input lines of the D/A converter are numbered with bit 8 being the LSB (image cut from the Plessey data sheet for the ZN426E8)

So I hooked up the digital output lines of port D on the Uno to the input lines of the D/A converter and added the other necessary components according to the minimal example in the data sheet (basically a resistor and a capacitor needed by the internal 2.55 V reference). Unfortunately, the ZN426 data sheet uses an unusual numbering of the bits: bit 1 is the MSB, and bit 8 is the LSB. I missed that and inadvertently mirrored the bits, which led to an “interesting” waveform the first time I fired up the program. Now instead of fixing the circuit, I thought I could do that in software; just mirror the bits before writing them out to PORTD:

// output bytes in out[]
for(int i=0; i<N; i++) {
  unsigned char d=0;
  // mirror the bits in out[i], result in d
  for(int j=0; j<=7; ++j){
    unsigned char lmask = 1 << j, rmask = 128 >> j;
    if(out[i] & lmask) d |= rmask;
  }
  PORTD = out[i];
}

I knew that this was just a quick hack and that one could do it faster, but I just wanted the waveform to look right, and then care for performance. By the way, shift and mask operations should be fast, right? As a first test I wanted to approximate a square wave using Fourier synthesis:

f(t)=\sin t +\frac{1}{3}\sin 3t+\frac{1}{5}\sin 5t+\ldots

Fourier-synthesized square wave (first 20 harmonics) with wrong duty cylcle

Fig. 2: Fourier-synthesized square wave (first 20 harmonics) with wrong duty cycle

The problem

Of course, the resulting function was then shifted and scaled appropriately to map to the range {0,…,255} of the D/A input. As a test, I used 200 data points per period and stopped at 20 harmonics (10 of which were nonzero). On first sight, the resulting waveform looked OK on the scope (see Fig. 2): the usual Gibbs phenomenon near the discontinuities, the correct number of oscillations. However, the duty cycle was not the expected 1:1 but more like 1.3:1. I first assumed some odd programming error, so I simplified the problem by using a single Fourier component, a simple sine wave. Figure 3 shows the result: It was obviously not a correct sine function, since the “arc” at the top is clearly blunter than that at the bottom. So it wasn’t some bug in the Fourier synthesis but already present in the fundamental harmonic.

This sine function is not quite a sine function

Fig. 3: This sine function is not quite a sine function

Bug hunting

In order to track the bug I made some experiments. First I compiled an adapted version of the code on my PC so I could easily inspect and plot the data in the program. It turned out that the math was correct – when plotting the out[] array, I got exactly what was expected, without distortions. Although this was reassuring, it left basically two options: Either it was an “analog” problem, i.e., something was wrong with the linearity of the D/A converter or with its voltage reference, or the timing was somehow off so that data points with high voltages took longer to output.

To check the voltage reference, I tried to spot fluctuations with the scope hooked up to it, but there was nothing – the voltage was at a solid 2.5 V (ish), and there was no discernible AC component visible within the limitations of my scope (a Hameg HM 208, by the way); a noise component of 20 mVpp would have been easily in range. The next test involved the timing: I inserted a significant delay (about 200 µs) after writing the byte to PORTD. This reduced the output frequency quite a bit since the original code spat out one byte about every 20 µs. Lo and behold – the asymmetries were gone. At this point it should have dawned on me, but I still thought it was an analog problem, somehow connected to the speed at which the D/A converter settles to the correct output voltage. However, the specs say that this should take at most 2 µs, so with one new value every 20 µs I should have been fine.

Correct duty cycle after rewiring the D/A input lines

Fig. 4: Correct duty cycle after rewiring the D/A input lines

The solution

In the end, out of sheer desperation I decided to take out the last bit of complexity and rewired the D/A input bits so that the byte-mirror code became obsolete. Figure 4 shows the resulting output voltage for the square wave experiment. There are two major observations here: (1) The duty cycle was now correct at 1:1, and (2) the performance of the code was greatly improved. Instead of one value every 20 µs it now wrote one value every 550 ns (about 9 clock cycles). Problem solved – but why?

The observation that the nonlinearity disappeared when I inserted a delay was the key. In the byte-mirror code above, the more bits in the input byte out[i] are set to one, the more often the if() condition is true and the or-masking of the output byte d takes place. In other words, the larger the number of one-bits, the longer it takes to put together the output byte. And since voltages near the maximum are generated by numbers close to 255 (i.e., with lots of set bits), a phase in the output waveform with a large average voltage used a longer time step than one with a low average voltage. In other words, the time step had a component that was linear in the Hamming weight (or popcount) of the binary output value.

Of course I did some checks to confirm this hypothesis. Rewriting the byte-mirror code without an if condition (instead of rewiring the data lines) fixed the problem as well:

for(int i=0; i<N; i++) {
  unsigned char d=0,lmask=1,rmask=128,v;
  for(unsigned char j=0; j<=7; ++j){
    v = out[i] & lmask;
    d |= (v<<(7-j))>>j;
  }
  PORTD = d;
}

There are many ways to do this, but all resulted in (1) correct output and (2) slow code, more than an order of magnitude slower than the “plain” code with correctly wired data lines. Whatever I tried to mirror the bits, it always took a couple hundred cycles.

In summary, a problem which looked like a nonlinearity in a D/A converter output was actually caused by my code taking longer when more bits were set in the output, resulting in a larger time step for higher voltages.

Odds and ends

At a time step of roughly half a microsecond, the behavior of the D/A converter and the Arduino become visible

Fig. 5: At a time step of roughly half a microsecond, the behavior of the D/A converter and the Arduino become visible

Without mirroring, the time step between successive updates is much smaller than the worst-case settling time of the D/A converter. Figure 5 shows a pure sine function again. One can clearly discern spikes at the points where lots of input bits change concurrently, e.g., at half of the maximum voltage. I don’t know if this is caused by the outputs of the Arduino port being not perfectly synchronous, or if it is an inherent property of the converter. In practice it probably should not matter since there will be some kind of buffer/amplifier/offset compensation circuit that acts as a low-pass filter.

Figure 5 shows yet another oddity: From time to time, the Arduino seems to take a break (there is a plateau at about 75% of the width). This first showed up on the scope as a “ghost image,” i.e., there was a second, shifted version of the waveform on the screen. Fortunately, my Hameg scope has a builtin digital storage option, and after a couple of tries I could capture what is seen in Fig. 5. These plateaus are roughly 6-7 µs in length and turn up regularly every 1.1 ms or so. I checked that this is not something that occurs in between calls to the loop() function; it is really regular, and asynchronous to whatever else happens on the Arduino. I don’t know exactly what it is, but the frequency of these plateaus is suspiciously close to the frequency of the PWM outputs on pins 5 and 6 of the Uno (960 Hz), so it might have something to do with that.

You can download the code of the sketch here: dftsynth.c Remember that this is just a little test code. It only implements  a pure sine transform. I am aware that the Arduino Uno with its limited RAM is not the ideal platform for actually producing audio output.

Please drop me a note if you have any comments or questions.

Write-allocate evasion has finally arrived at Intel – or has it?

Intel’s Xeon Ice Lake CPU has finally caught up with AMD’s Rome in terms of full-chip peak performance and memory bandwidth. And, at long last, they have also fixed the Port 7 AGU problem I wrote about two years ago: Ice Lake now has two fully capable Store AGUs and an additional Store unit (although you can only do two stores concurrently if they go to the same cache line). There is one thing, however, that has appeared in recent years on ARM-based CPUs: automatic write-allocate elimination. We saw this for the first time on Cavium/Marvell’s ThunderX2 [1], although it was presumably present before on other ARM-based chips as well.

Basically what happens is that in situations where a write-allocate operation would be necessary, the hardware detects if the whole cache line is going to be overwritten. If it is, there’s no point in reading the cache line at all, and it can just be claimed in the cache right away. This saves 1/3 of the memory traffic on STREAM Copy, leading to a 50% performance gain if the saturated bandwidth doesn’t change. On the Fujitsu A64FX, this is not automatic but can be triggered by a special instruction, to the same effect [2]. Up to now, Intel followed a different path and supported nontemporal stores, which also avoid the write-allocate but in a different way: The cache line is stored “directly” to memory (actually through a write-combine buffer) so that it does not end up in the cache in the first place. Which strategy is better depends on the application: If the data is not needed soon, nontemporal stores may be better because the stored cache lines do not pollute the cache.

With Ice Lake, Intel provides for the first time a working mechanism for write-allocate evasion that’s similar to what the TX2 did. Intel calls it “SpecI2M,” and it’s described on slide 12 of a HotChips 2020 presentation:

SpecI2M optimization: Convert RFO to specI2M when memory subsystem is heavily loaded

  • Reduces mem bandwidth demand on streaming WLs that do full cache line writes (25% efficiency increase)

This is quite cryptic, and there’s a lot of speculation about what SpecI2M actually does, but it’s actually simple to figure out. likwid-bench is the perfect tool for that. The copy_avx512 kernel provided with it is as simple as it gets:

vmovapd    zmm1, [STR0 + GPR1 * 8]
vmovapd    zmm2, [STR0 + GPR1 * 8 + 64]
vmovapd    zmm3, [STR0 + GPR1 * 8 + 128]
vmovapd    zmm4, [STR0 + GPR1 * 8 + 192]
vmovapd    [STR1 + GPR1 * 8]     , zmm1
vmovapd    [STR1 + GPR1 * 8 + 64], zmm2
vmovapd    [STR1 + GPR1 * 8 + 128], zmm3
vmovapd    [STR1 + GPR1 * 8 + 192], zmm4

I’ve omitted the loop mechanics here (and don’t worry about the base/index registers; the likwid-bench code generator substitutes them with the real names). This is the version with normal stores (vmovapd). The copy_mem_avx512 kernel uses nontemporal stores (vmovntpd) instead.

Figure 1: STREAM copy standard stores (blue, red) vs. nontemporal stores on Xeon Platinum 8358

Figure 2: Ratio of actual vs. reported memory bandwidth

I’ve run scaling tests with these kernels using a 2 GB working set on a Xeon Platinum 8358 (Ice Lake 32 cores, 2.6 GHz base frequency, SNC off, THP=always, numa_balancing=0):

$ likwid-bench -t copy_avx512 -W S0:2GB:${NUM_THREADS}:1:2

For standard stores, Figure 1 shows the reported bandwidth in blue and the actual bus bandwidth (measured with likwid-perfctr) in red. With a few cores, the actual bandwidth is exactly 50% larger than reported, as expected. This can be seen in Figure 2 where I plotted this ratio (basically red divided by blue in Fig. 1). However, the ratio gradually goes down as the number of cores goes up. It’s as if the write-allocate is avoided, but not based on the (sole) fact that full cache lines are overwritten but based on the actual bus utilization! At 29 cores and above, the ratio is finally down at 1.0, so the write-allocate is fully gone.

I don’t like that behavior.

Figure 3: Reported bandwidth of the copy_avx512 kernel on the 38-core Ice Lake at KIT (HoreKa)

What I do like is a nice saturation curve like the one we see for NT stores (gray in Fig. 1). Not everyone can use NT stores, though; they only exist in SIMD variants (not quite, but for practical purposes they do), and you have to convince the compiler to use them unless you want to employ intrinsics or assembly. I looked for a way to switch off or alter the behavior of SpecI2M in the machine’s BIOS, but to no avail. [Update 2022: There is an MSR bit which can switch SpecI2M off or on; however, Intel does not disclose it publicly so we can’t include it in an open-source tool. Duh!]

Even worse, there are machines on which SpecI2M acts even more weirdly. The HoreKa cluster at KIT has 38-core Ice Lake Xeon Platinum 8368 CPUs. Figure 3 shows the reported bandwidth of the copy_avx512 loop vs. cores.  Here, the write-allocate evasion mechanism seems to kick in only beyond 30 cores, when the memory bandwidth is already very close to saturation (and it has been this way at 20 cores already). This means that in order to get the full socket performance, I have to use (almost) all cores although I could get away with much fewer if only the SpecI2M were more aggressive. I wonder what speaks against letting it fire already from core 1. What harm could it do? And why can I not just turn it off?

To make sure that this is not just a property of the specific benchmark setup, here are some additional observations:

  • The working set has no influence on the behavior. 10 GB instead of 2 GB make no difference whatsoever.
  • It’s not specific to STREAM Copy but it shows with all streaming loops that have store misses (STREAM Triad, Schönauer Triad, etc.). Of course, the higher the load/store ratio, the smaller the effect.
  • Turbo mode was switched off in these experiments, but it makes no difference if it’s on (apart from the higher single-core bandwidth, obviously).
  • It’s not a specific quirk of likwid-bench; the same behavior can be observed with code that was compiled from a high-level language, such as Jan Eitzinger’s TheBandwidthBenchmark or the original McCalpin STREAM.

Although it makes teaching people about memory traffic harder, I still see SpecI2M as a step in the right direction. We certainly have to ask additional questions: At which cache level is the cache line claimed? Does the write-allocate evasion also work with more realistic code such as a stencil smoother? Great questions. Stay tuned.

 

[1] J. Hofmann, C. L. Alappat, G. Hager, D. Fey, and G. Wellein: Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors. Supercomputing Frontiers and Innovations 7(2), 54-78, July 2020. Available with Open Access. DOI: 10.14529/jsfi200204.

[2] C. L. Alappat, N. Meyer, J. Laukemann, T. Gruber, G. Hager, G. Wellein, and T. Wettig: ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX. Concurrency and Computation: Practice and Experience, e6512 (2021). Available with Open Access. DOI: 10.1002/cpe.6512.

IACS Stony Brook seminar talk available

On October 14, 2021 I gave an invited online talk at Stony Brook University‘s Institute for Advanced Computational Science (IACS). I talked about white/gray-box approaches to performance modeling and how they can fail in interesting ways on highly parallel systems because of desynchronization effects. The slides and a video recording are now available:

Title: From numbers to insight via performance models

Abstract: High-performance parallel computers are complex systems. There seems to be a general consensus among developers that the performance of application programs is to be taken for granted, and that it cannot really be understood in terms of simple rules and models. This talk is about using analytic performance models to make sense of performance numbers. By means of examples from computational science, I will motivate that it makes a lot of sense to try and set up performance models even if their accuracy is sometimes limited. In fact, it is when a model yields false predictions that we learn more about the problem because our assumptions are challenged. I will start with a general categorization of performance models and then turn to ECM and Roofline models for loop-based code on multicore CPUs. Going beyond the compute node level and adding communication models to the mix, I will show how stacking models on top of each other may not work as intended but instead open new insights and a fresh view on how massively parallel code is executed.