Georg Hager's Blog

Random thoughts on High Performance Computing

Content

LIKWID 5.1 released

We are happy to announce a new major release 5.1.0 of LIKWID. This release adds support for the latest and upcoming architectures. Besides numerous bug fixes, these are the major new features:

  • Support for Intel Icelake desktop (Core + Uncore)
  • Support for Intel Icelake server (Core only)
  • Support for Intel Tigerlake desktop (Core only)
  • Support for Intel Cannon Lake (Core only)
  • Support for Nvidia GPUs with compute capability >= 7.0 (CUpti Profiling API)
  • Initial support for Fujitsu A64FX (Core) including SVE assembly benchmarks
  • Support for ARM Neoverse N1 (AWS Graviton 2)
  • Support for AMD Zen3 (Core + Uncore but without any events)
  • Fortran 90 interface for NvMarkerAPI (update)

We want to thank Intel, AMD, AWS and the University of Regensburg for their support.

EoCoE webinar on A64FX

Our friends from the “EoCoE-II” project have invited us to share our results about the new A64FX processor. Attendance is free and open to everyone. Please register using the link given below.


Title: The A64FX processor: Understanding streaming kernels and sparse matrix-vector multiplication

Date: November 18, 2020, 10:00 a.m. CET

Speakers: Christie L. Alappat and Georg Hager (RRZE)

Registration URL: https://attendee.gotowebinar.com/register/3926945771611115789

Abstract:  The A64FX CPU powers the current #1 supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, the Erlangen Regional Computing Center (RRZE) team will detail how they construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. They will describe how the machine model points to peculiarities in the microarchitecture to keep in mind when optimizing applications, and how, applying the ECM model to sparse matrix-vector multiplication (SpMV), they motivate why the CRS matrix storage format is inappropriate and how the SELL-C-sigma format can achieve bandwidth saturation for SpMV. In this context, they will also look into some code optimization strategies that are relevant for A64FX and compare SpMV performance with AMD Rome, Intel Cascade Lake and NVIDIA V100.

This webinar is organized by the European Energy-Oriented Center of Excellence (EoCoE). A video recording is now available on the EoCoE YouTube channel:

LIKWID 5.0.2 released

We are happy to announce a new release 5.0.2 of LIKWID. It is mainly a bugfix release, but it also has some important updates for modern architectures (IBM Power9, AMD Zen[2]). If you want to use LIKWID on AMD Zen/Zen2 systems, we highly recommend updating. Thanks to HLRS and LANL for valuable input.

Here is the full Changelog:

  • Fix memory leak in calc_metric()
  • New peakflops benchmarks in likwid-bench
  • Fix for NUMA domain handling
  • Improvements for perf_event backend
  • Fix for perfctr and powermeter with perf_event backend
  • Fix for likwid-mpirun for SLURM with cpusets
  • Fix for likwid-setFrequencies in cpusets
  • Update for POWER9 event list
  • Updates for AMD Zen, Zen+ and Zen2 (events, groups)
  • Fix for Intel Uncore events with same name for different devices
  • Fix for file descriptor handling
  • Fix for compilation with GCC10
  • Remove sleep timer warning
  • Update examples C-markerAPI and C-internalMarkerAPI

Get the download from our FTP server: ftp://ftp.fau.de/mirrors/likwid/

Problems with GPU measurements on recent Nvidia GPUs are not addressed with this release. The fixes will be part of the 5.1.0 release (including support for Fujitsu A64FX and ARM Neoverse N1).

Fujitsu’s A64FX demystified. Well, somewhat.

With all the craze around the Fugaku supercomputer (current Top500 #1) and its 48-core A64FX CPU, it was high time for some in-depth analysis of that beast. At a peak double-precision performance of about 3 Tflop/s and a memory bandwidth close to 1 Tbyte/s it’s certainly an interesting piece of silicon. Through our friends at the physics department of the University of Regensburg, where the “QPACE 4” system is installed (an FX700, the “little brother” of the FX1000 at RIKEN), we had access to one. Although it lacked the Fujitsu compiler and the Tofu network, we still got some very interesting results, which you can read about in our recent paper (which got, incidentally, the Best Short Paper Award at the PMBS20 workshop):

C. L. Alappat, J. Laukemann, T. Gruber, G. Hager, G. Wellein, N. Meyer, and T. Wettig: Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX. Accepted for the 11th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS20). Preprint: arXiv:2009.13903

The first step towards a good understanding of the performance features (and quirks) of a new CPU is to get a good grasp of its instruction execution resources and its memory hierarchy; connoisseurs know that these are the ingredients for ECM performance models of steady-state loops. We were able to show that the cache hierarchy of the A64FX is partially overlapping, mainly with respect to data writes. That’s a good thing. What’s not so good is that many instructions in the A64FX core have rather long latencies. For instance, the 512-bit Scalable Vector Extensions (SVE) floating-point ADD and FMA instructions take 9 cycles to complete, and horizontal ADDs across a SIMD register take even more, which means that sum reductions, scalar products, etc. can be very slow if the compiler doesn’t have a clue about modulo variable expansion. To add insult to injury, the core seems to have very limited out-of-order (OoO) capabilities, putting even more burden on the compiler.

As a consequence, sparse matrix-vector multiplication (SpMV) needs special care to get good performance (i.e, to saturate the memory bandwidth). In particular, you need a proper data format: Compressed Row Storage (CRS) just doesn’t cut it unless the number of nonzeros per row is ridiculously large. Our SELL-C-\sigma format is just the right fit as it supports SIMD vectorization and deep unrolling without much hassle. As a result, SpMV can easily exceed the 100 Gflop/s barrier for reasonably benign matrices on the A64FX, but you need almost all the twelve cores on each of the four ccNUMA domains – which means that any load imbalance will immediately by punished with a performance loss. Your run-of-the-mill x86 server chips are much more forgiving in this respect since load imbalance can be partially hidden by the strong memory saturation.

The SVE intrinsics code for all experiments can be found in our artifacts description at https://github.com/RRZE-HPC/pmbs2020-paper-artifact.

Introducing the MachineState reproducibility tool

MachineState is a python3 module and CLI application for documenting and comparing settings known to affect application performance: e.g., CPU/Uncore frequencies, hardware prefetchers, memory capacity, but also OS and software settings like NUMA balancing, writeback workqueues, scheduling, or the versions of common tools and libraries (e.g., compilers and MPI). All this information can be essential for reproduction of benchmark results. The MachineState tool gathers all (known) settings and presents them as a JSON document. A state file written earlier can be compared to the current machine state to uncover deviations from  the original test system.

Check out the MachineState github project, maintained by Thomas “TomTheBear” Gruber

PMACS 2020 Workshop Call for Papers

Euro-Par 2020, Warsaw, PolandThe Euro-Par conference is a well-established venue for presenting and discussing research on parallel computing in Europe. In association with the Euro-Par 2020 conference, which takes place from August 24-28 in Warsaw, Poland, the workshop “PMACS – Performance Monitoring and Analysis of Cluster Systems” is organized by Thomas Gruber from our group. The workshop offers presentations and exchange of experience about various solutions on the collection, transmission, storage, evaluation, and visualization of runtime data about the hard- and software of whole cluster systems.

Continue reading

Intel’s port 7 AGU blunder

Everyone’s got their pet peeves: For Poempelfox it’s Schiphol Airport, for Andreas Stiller it’s the infamous A20 gate. Compared to those glorious fails, my favorite tech blunder is a rather measly one, and it may not be relevant to many users in practice. However, it’s not so much the importance of it but the total mystery of how it came to happen. So here’s the thing.

Loads, stores, and AGUs

Sandy Bridge and Ivy Bridge LOAD and STORE units, AGUs, and their respective ports.

The Intel Sandy Bridge and Ivy Bridge architectures have six execution ports, two of which (#2 & #3) feed one LOAD pipeline each and one (#4) feeds a STORE pipe. These units are capable of transferring 16 bytes of data per cycle each. With AVX code, the core is thus able to sustain one full-width 32-byte LOAD (in two adjacent 16-byte chunks) and one half of a 32-byte STORE per cycle. But the LOAD and STORE ports are not the only thing that’s needed to execute these instructions – the core must also generate the corresponding memory addresses, which can be rather complicated. In a LOAD instruction like:

vmovupd ymm0, [rdx+rsi*8+32]

the memory address calculation involves two integer add operations and a shift. It is the task of the address generation units (AGUs) to do this. Each of ports 2 and 3 serves an AGU in addition to the LOAD unit, so the core can generate two addresses per cycle – more than enough to sustain the maximum LOAD and STORE throughput with AVX.

The peculiar configuration of LOAD and STORE units and AGUs causes some strange effects. For instance, if we execute the Schönauer vector triad: Continue reading

Node-Level Performance Engineering Course at LRZ

We proudly present a retake of our PRACE course on “Node-Level Performance Engineering” on December 3-4, 2019 at LRZ Garching.

This course covers performance engineering approaches on the compute node level. Even application developers who are fluent in OpenMP and MPI often lack a good grasp of how much performance could at best be achieved by their code. This is because parallelism takes us only half the way to good performance. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted. This course conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. We introduce the basic architectural features and bottlenecks of modern processors and compute nodes. Pipelining, SIMD, superscalarity, caches, memory interfaces, ccNUMA, etc., are covered. A cornerstone of node-level performance analysis is the Roofline model, which is introduced in due detail and applied to various examples from computational science. We also show how simple software tools can be used to acquire knowledge about the system, run code in a reproducible way, and validate hypotheses about resource consumption. Finally, once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of code changes can often be predicted, replacing hope-for-the-best optimizations by a scientific process.

This is a two-day course with demos. It is provided free of charge to members of European research institutions and universities.

 Date: Tuesday, Dec 3, 2019 09:00 – 17:00
Wednesday, Dec 4, 2019 09:00 – 17:00
Location: LRZ Building, University campus Garching, near Munich, Hörsaal H.E.009 (Lecture hall)
Course webpage with detailed agenda: https://events.prace-ri.eu/event/901/
Registration: Via https://events.prace-ri.eu/event/901/registrations/633/