# ISC High Performance 2022 deadlines extended!

Get into gear! The research paper and workshop deadlines of ISC High Performance 2022 have just been updated:

• Research papers: November 29 (abstracts) / December 6 (papers)
• Workshops: November 30 (with proceedings) / January 27 (without proceedings)

ISC 2022 takes place from May 29 to June 2 at Messe Hamburg. Ana Lucia Varbanescu (University of Amsterdam) and Abhinav Bhatele (University of Maryland) are in charge of the paper track. Keren Bergman from Columbia University is the 2022 Program Chair.

# Write-allocate evasion has finally arrived at Intel – or has it?

Intel’s Xeon Ice Lake CPU has finally caught up with AMD’s Rome in terms of full-chip peak performance and memory bandwidth. And, at long last, they have also fixed the Port 7 AGU problem I wrote about two years ago: Ice Lake now has two fully capable Store AGUs and an additional Store unit (although you can only do two stores concurrently if they go to the same cache line). There is one thing, however, that has appeared in recent years on ARM-based CPUs: automatic write-allocate elimination. We saw this for the first time on Cavium/Marvell’s ThunderX2 [1], although it was presumably present before on other ARM-based chips as well.

Basically what happens is that in situations where a write-allocate operation would be necessary, the hardware detects if the whole cache line is going to be overwritten. If it is, there’s no point in reading the cache line at all, and it can just be claimed in the cache right away. This saves 1/3 of the memory traffic on STREAM Copy, leading to a 50% performance gain if the saturated bandwidth doesn’t change. On the Fujitsu A64FX, this is not automatic but can be triggered by a special instruction, to the same effect [2]. Up to now, Intel followed a different path and supported nontemporal stores, which also avoid the write-allocate but in a different way: The cache line is stored “directly” to memory (actually through a write-combine buffer) so that it does not end up in the cache in the first place. Which strategy is better depends on the application: If the data is not needed soon, nontemporal stores may be better because the stored cache lines do not pollute the cache.

With Ice Lake, Intel provides for the first time a working mechanism for write-allocate evasion that’s similar to what the TX2 did. Intel calls it “SpecI2M,” and it’s described on slide 12 of a HotChips 2020 presentation:

SpecI2M optimization: Convert RFO to specI2M when memory subsystem is heavily loaded

• Reduces mem bandwidth demand on streaming WLs that do full cache line writes (25% efficiency increase)

This is quite cryptic, and there’s a lot of speculation about what SpecI2M actually does, but it’s actually simple to figure out. likwid-bench is the perfect tool for that. The copy_avx512 kernel provided with it is as simple as it gets:

vmovapd    zmm1, [STR0 + GPR1 * 8]
vmovapd    zmm2, [STR0 + GPR1 * 8 + 64]
vmovapd    zmm3, [STR0 + GPR1 * 8 + 128]
vmovapd    zmm4, [STR0 + GPR1 * 8 + 192]
vmovapd    [STR1 + GPR1 * 8]     , zmm1
vmovapd    [STR1 + GPR1 * 8 + 64], zmm2
vmovapd    [STR1 + GPR1 * 8 + 128], zmm3
vmovapd    [STR1 + GPR1 * 8 + 192], zmm4

I’ve omitted the loop mechanics here (and don’t worry about the base/index registers; the likwid-bench code generator substitutes them with the real names). This is the version with normal stores (vmovapd). The copy_mem_avx512 kernel uses nontemporal stores (vmovntpd) instead.

 Figure 1: STREAM copy standard stores (blue, red) vs. nontemporal stores on Xeon Platinum 8358 Figure 2: Ratio of actual vs. reported memory bandwidth

I’ve run scaling tests with these kernels using a 2 GB working set on a Xeon Platinum 8358 (Ice Lake 32 cores, 2.6 GHz base frequency, SNC off, THP=always, numa_balancing=0):

$likwid-bench -t copy_avx512 -W S0:2GB:${NUM_THREADS}:1:2

For standard stores, Figure 1 shows the reported bandwidth in blue and the actual bus bandwidth (measured with likwid-perfctr) in red. With a few cores, the actual bandwidth is exactly 50% larger than reported, as expected. This can be seen in Figure 2 where I plotted this ratio (basically red divided by blue in Fig. 1). However, the ratio gradually goes down as the number of cores goes up. It’s as if the write-allocate is avoided, but not based on the (sole) fact that full cache lines are overwritten but based on the actual bus utilization! At 29 cores and above, the ratio is finally down at 1.0, so the write-allocate is fully gone.

I don’t like that behavior.

Figure 3: Reported bandwidth of the copy_avx512 kernel on the 38-core Ice Lake at KIT (HoreKa)

What I do like is a nice saturation curve like the one we see for NT stores (gray in Fig. 1). Not everyone can use NT stores, though; they only exist in SIMD variants (not quite, but for practical purposes they do), and you have to convince the compiler to use them unless you want to employ intrinsics or assembly. I looked for a way to switch off or alter the behavior of SpecI2M in the machine’s BIOS, but to no avail.

Even worse, there are machines on which SpecI2M acts even more weirdly. The HoreKa cluster at KIT has 38-core Ice Lake Xeon Platinum 8368 CPUs. Figure 3 shows the reported bandwidth of the copy_avx512 loop vs. cores.  Here, the write-allocate evasion mechanism seems to kick in only beyond 30 cores, when the memory bandwidth is already very close to saturation (and it has been this way at 20 cores already). This means that in order to get the full socket performance, I have to use (almost) all cores although I could get away with much fewer if only the SpecI2M were more aggressive. I wonder what speaks against letting it fire already from core 1. What harm could it do? And why can I not just turn it off?

To make sure that this is not just a property of the specific benchmark setup, here are some additional observations:

• The working set has no influence on the behavior. 10 GB instead of 2 GB make no difference whatsoever.
• It’s not specific to STREAM Copy but it shows with all streaming loops that have store misses (STREAM Triad, Schönauer Triad, etc.). Of course, the higher the load/store ratio, the smaller the effect.
• Turbo mode was switched off in these experiments, but it makes no difference if it’s on (apart from the higher single-core bandwidth, obviously).
• It’s not a specific quirk of likwid-bench; the same behavior can be observed with code that was compiled from a high-level language, such as Jan Eitzinger’s TheBandwidthBenchmark or the original McCalpin STREAM.

Although it makes teaching people about memory traffic harder, I still see SpecI2M as a step in the right direction. We certainly have to ask additional questions: At which cache level is the cache line claimed? Does the write-allocate evasion also work with more realistic code such as a stencil smoother? Great questions. Stay tuned.

[1] J. Hofmann, C. L. Alappat, G. Hager, D. Fey, and G. Wellein: Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors. Supercomputing Frontiers and Innovations 7(2), 54-78, July 2020. Available with Open Access. DOI: 10.14529/jsfi200204.

[2] C. L. Alappat, N. Meyer, J. Laukemann, T. Gruber, G. Hager, G. Wellein, and T. Wettig: ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX. Concurrency and Computation: Practice and Experience, e6512 (2021). Available with Open Access. DOI: 10.1002/cpe.6512.

# IACS Stony Brook seminar talk available

On October 14, 2021 I gave an invited online talk at Stony Brook University‘s Institute for Advanced Computational Science (IACS). I talked about white/gray-box approaches to performance modeling and how they can fail in interesting ways on highly parallel systems because of desynchronization effects. The slides and a video recording are now available:

Abstract: High-performance parallel computers are complex systems. There seems to be a general consensus among developers that the performance of application programs is to be taken for granted, and that it cannot really be understood in terms of simple rules and models. This talk is about using analytic performance models to make sense of performance numbers. By means of examples from computational science, I will motivate that it makes a lot of sense to try and set up performance models even if their accuracy is sometimes limited. In fact, it is when a model yields false predictions that we learn more about the problem because our assumptions are challenged. I will start with a general categorization of performance models and then turn to ECM and Roofline models for loop-based code on multicore CPUs. Going beyond the compute node level and adding communication models to the mix, I will show how stacking models on top of each other may not work as intended but instead open new insights and a fresh view on how massively parallel code is executed.

# Upcoming: 38th VI-HPS Online Tuning Workshop, March 1-3, 2021

It is our pleasure to announce the 38th VI-HPS Tuning Workshop, organized by NHR@FAU. FAU is a member of VI-HPS, the “Virtual Institute – High Productivity Supercomputing.” The mission of VI-HPS is to to improve the quality and accelerate the development process of complex simulation programs in science and engineering that are being designed for the most advanced parallel computer systems.

To this end, VI-HPS organizes a series of tuning workshops that introduce advanced performance analysis tools. This workshop will:

• give an overview of the VI-HPS programming tools suite,
• explain the functionality of individual tools, and how to use them effectively,
• offer hands-on experience and expert assistance using the tools.

In this particular event, we will cover the tools TAU , MAQAO, Score-P, Paraver/Extrae/Dimemas, and Extra-P. On completion participants will be familiar with common performance analysis and diagnosis techniques and how they can be employed in practice. Those who prepared their own application test cases will have been coached in the tuning of their measurement and analysis, and provided optimization suggestions.

Important: Note that this workshop is aimed at HPC developers. Participants must be familiar with handling a Linux environment over an SSH connection, basic parallel programming, and working with a batch system. There will be no time to teach these topics during the workshop.

Workshop dates: March 1-3, 2021, 9:00-17:00

More information (agenda, registration) is available on the workshop page. You can register directly by sending an e-mail to georg.hager@fau.de with the following information:

Participation is free of charge. Please register only if you are really planning to attend. No-shows will be blacklisted and excluded from future events.

# Tutorial: Empirical Roofline model with LIKWID

Thomas Gruber (a.k.a. TomTheBear), the main developer of the LIKWID tool suite, has published a short tutorial about constructing empirical Roofline models with likwid-perfctr.  An empirical Roofline model uses measurements of computational intensity and performance to compare the resource utilization of running code with the limits set by the hardware.

Tutorial: Empirical Roofline Model

This is something that often comes up as a question in our node-level or tools courses. Keep in mind that the computational intensity can also be predicted analytically if you know enough about the loop(s) in your application and the properties of the hardware. Comparing the analytical prediction with the measurement and the machine limits is a powerful way to analyze the performance of code. You can learn more about this, and more, in one of our Node-Level Performance Engineering tutorials.

# LIKWID 5.1 released

We are happy to announce a new major release 5.1.0 of LIKWID. This release adds support for the latest and upcoming architectures. Besides numerous bug fixes, these are the major new features:

• Support for Intel Icelake desktop (Core + Uncore)
• Support for Intel Icelake server (Core only)
• Support for Intel Tigerlake desktop (Core only)
• Support for Intel Cannon Lake (Core only)
• Support for Nvidia GPUs with compute capability >= 7.0 (CUpti Profiling API)
• Initial support for Fujitsu A64FX (Core) including SVE assembly benchmarks
• Support for ARM Neoverse N1 (AWS Graviton 2)
• Support for AMD Zen3 (Core + Uncore but without any events)
• Fortran 90 interface for NvMarkerAPI (update)

We want to thank Intel, AMD, AWS and the University of Regensburg for their support.

# EoCoE webinar on A64FX

Our friends from the “EoCoE-II” project have invited us to share our results about the new A64FX processor. Attendance is free and open to everyone. Please register using the link given below.

Title: The A64FX processor: Understanding streaming kernels and sparse matrix-vector multiplication

Date: November 18, 2020, 10:00 a.m. CET

Speakers: Christie L. Alappat and Georg Hager (RRZE)

Registration URL: https://attendee.gotowebinar.com/register/3926945771611115789

Abstract:  The A64FX CPU powers the current #1 supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, the Erlangen Regional Computing Center (RRZE) team will detail how they construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. They will describe how the machine model points to peculiarities in the microarchitecture to keep in mind when optimizing applications, and how, applying the ECM model to sparse matrix-vector multiplication (SpMV), they motivate why the CRS matrix storage format is inappropriate and how the SELL-C-sigma format can achieve bandwidth saturation for SpMV. In this context, they will also look into some code optimization strategies that are relevant for A64FX and compare SpMV performance with AMD Rome, Intel Cascade Lake and NVIDIA V100.

This webinar is organized by the European Energy-Oriented Center of Excellence (EoCoE). A video recording is now available on the EoCoE YouTube channel:

# SC20 tutorial “Node-Level Performance Engineering”

Our most popular tutorial was accepted again for the SC20 conference in Atlanta! SC is a 100% virtual event this year. The tutorial will be airing on November 9 and 10 as a number of pre-recorded presentations and live Q&A sessions. There’s still time to register: https://show.jspargo.com/sc20/

# LIKWID 5.0.2 released

We are happy to announce a new release 5.0.2 of LIKWID. It is mainly a bugfix release, but it also has some important updates for modern architectures (IBM Power9, AMD Zen[2]). If you want to use LIKWID on AMD Zen/Zen2 systems, we highly recommend updating. Thanks to HLRS and LANL for valuable input.

Here is the full Changelog:

• Fix memory leak in calc_metric()
• New peakflops benchmarks in likwid-bench
• Fix for NUMA domain handling
• Improvements for perf_event backend
• Fix for perfctr and powermeter with perf_event backend
• Fix for likwid-mpirun for SLURM with cpusets
• Fix for likwid-setFrequencies in cpusets
• Update for POWER9 event list
• Updates for AMD Zen, Zen+ and Zen2 (events, groups)
• Fix for Intel Uncore events with same name for different devices
• Fix for file descriptor handling
• Fix for compilation with GCC10
• Remove sleep timer warning
• Update examples C-markerAPI and C-internalMarkerAPI

Problems with GPU measurements on recent Nvidia GPUs are not addressed with this release. The fixes will be part of the 5.1.0 release (including support for Fujitsu A64FX and ARM Neoverse N1).

# Fujitsu’s A64FX demystified. Well, somewhat.

With all the craze around the Fugaku supercomputer (current Top500 #1) and its 48-core A64FX CPU, it was high time for some in-depth analysis of that beast. At a peak double-precision performance of about 3 Tflop/s and a memory bandwidth close to 1 Tbyte/s it’s certainly an interesting piece of silicon. Through our friends at the physics department of the University of Regensburg, where the “QPACE 4” system is installed (an FX700, the “little brother” of the FX1000 at RIKEN), we had access to one. Although it lacked the Fujitsu compiler and the Tofu network, we still got some very interesting results, which you can read about in our recent paper (which got, incidentally, the Best Short Paper Award at the PMBS20 workshop):

C. L. Alappat, J. Laukemann, T. Gruber, G. Hager, G. Wellein, N. Meyer, and T. Wettig: Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX. Accepted for the 11th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS20). Preprint: arXiv:2009.13903

The first step towards a good understanding of the performance features (and quirks) of a new CPU is to get a good grasp of its instruction execution resources and its memory hierarchy; connoisseurs know that these are the ingredients for ECM performance models of steady-state loops. We were able to show that the cache hierarchy of the A64FX is partially overlapping, mainly with respect to data writes. That’s a good thing. What’s not so good is that many instructions in the A64FX core have rather long latencies. For instance, the 512-bit Scalable Vector Extensions (SVE) floating-point ADD and FMA instructions take 9 cycles to complete, and horizontal ADDs across a SIMD register take even more, which means that sum reductions, scalar products, etc. can be very slow if the compiler doesn’t have a clue about modulo variable expansion. To add insult to injury, the core seems to have very limited out-of-order (OoO) capabilities, putting even more burden on the compiler.

As a consequence, sparse matrix-vector multiplication (SpMV) needs special care to get good performance (i.e, to saturate the memory bandwidth). In particular, you need a proper data format: Compressed Row Storage (CRS) just doesn’t cut it unless the number of nonzeros per row is ridiculously large. Our SELL-C-$\sigma$ format is just the right fit as it supports SIMD vectorization and deep unrolling without much hassle. As a result, SpMV can easily exceed the 100 Gflop/s barrier for reasonably benign matrices on the A64FX, but you need almost all the twelve cores on each of the four ccNUMA domains – which means that any load imbalance will immediately by punished with a performance loss. Your run-of-the-mill x86 server chips are much more forgiving in this respect since load imbalance can be partially hidden by the strong memory saturation.

The SVE intrinsics code for all experiments can be found in our artifacts description at https://github.com/RRZE-HPC/pmbs2020-paper-artifact.