Content

New tutorial “Performance Engineering for Linear Solvers” at ISC High Performance 2024

May 5, 2024

On Sunday, May 12, the brand-new tutorial “Performance Engineering for Linear Solvers” will be presented at ISC High Performance in Hamburg by Christie Alappat (still a PhD student at FAU but now working for Intel), Jonas Thies (TU Delft), Hartwig Anzt (TU München Campus Heilbronn), and myself.

This tutorial was in the making for a long time; many concepts were made, re-made, and updated again. We aimed at a slightly higher abstraction level than in our popular tutorial “Node-Level Performance Engineering,” which has a strong focus on the Roofline model and the optimization of simple loops and loop nests. In contrast, the new tutorial concentrates on the performance of sparse linear solvers, which includes a coverage of sparse matrix-vector multiplication (SpMV), preconditioners, and even cache blocking of matrix powers via RACE, Christie’s Recursive Algebraic Coloring Engine. Since the tutorial was accepted as a half-day event, we could only accommodate online demos instead of hands-on exercises for attendees. However, all code (mostly python/numba) is available for download.

LIKWID 5.1 released

November 20, 2020

We are happy to announce a new major release 5.1.0 of LIKWID. This release adds support for the latest and upcoming architectures. Besides numerous bug fixes, these are the major new features:

Support for Intel Icelake desktop (Core + Uncore)
Support for Intel Icelake server (Core only)
Support for Intel Tigerlake desktop (Core only)
Support for Intel Cannon Lake (Core only)
Support for Nvidia GPUs with compute capability >= 7.0 (CUpti Profiling API)
Initial support for Fujitsu A64FX (Core) including SVE assembly benchmarks
Support for ARM Neoverse N1 (AWS Graviton 2)
Support for AMD Zen3 (Core + Uncore but without any events)
Fortran 90 interface for NvMarkerAPI (update)

We want to thank Intel, AMD, AWS and the University of Regensburg for their support.

EoCoE webinar on A64FX

October 28, 2020

Our friends from the “EoCoE-II” project have invited us to share our results about the new A64FX processor. Attendance is free and open to everyone. Please register using the link given below.

Title: The A64FX processor: Understanding streaming kernels and sparse matrix-vector multiplication

Date: November 18, 2020, 10:00 a.m. CET

Speakers: Christie L. Alappat and Georg Hager (RRZE)

Registration URL: https://attendee.gotowebinar.com/register/3926945771611115789

Abstract: The A64FX CPU powers the current #1 supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, the Erlangen Regional Computing Center (RRZE) team will detail how they construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. They will describe how the machine model points to peculiarities in the microarchitecture to keep in mind when optimizing applications, and how, applying the ECM model to sparse matrix-vector multiplication (SpMV), they motivate why the CRS matrix storage format is inappropriate and how the SELL-C-sigma format can achieve bandwidth saturation for SpMV. In this context, they will also look into some code optimization strategies that are relevant for A64FX and compare SpMV performance with AMD Rome, Intel Cascade Lake and NVIDIA V100.

This webinar is organized by the European Energy-Oriented Center of Excellence (EoCoE). A video recording is now available on the EoCoE YouTube channel:

LIKWID 5.0 is here

November 15, 2019

Laptop decorations available at SC19!

Just in time for SC19, version 5 of our popular LIKWID tool suite has been released. There are tons of new developments in there; these are the most important ones:

Support for ARM architectures, especially for Marvell Thunder X2
Support for IBM POWER architectures (POWER8 and POWER9)
Support for AMD Zen2 and for data fabric counters of the AMD Zen microarchitecture
Support for Nvidia GPU monitoring (with NvMarkerAPI)
New clock frequency backend (with less overhead)
Generation of benchmarks for likwid-bench on-the-fly from ptt files
Integration of GOTCHA for hooking into client applications at runtime
Thread-local initialization of streams for likwid-bench
Enhanced support for SLURM with likwid-mpirun
New MPI and Hybrid pinning features for likwid-mpirun
JSON output filter file (use -o output.json)
Updated quick reference sheet with all the new options

The full list is available at the github release page. And if you need something really cool to cover that empty spot on your laptop lid, we’ll have LIKWID stickers available during our SC19 tutorial “Node-Level Performance Engineering” and at the Bavarian Supercomputing booth (#2063).

Direct download from FAU FTP

LIKWID documentation Wiki

Github project

PPAM 2019 Workshops Best Paper Award for Dominik Ernst

September 16, 2019

Dominik and his PPAM Best Poster Award

Our paper “Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs” by Dominik Ernst, Georg Hager, Jonas Thies, and Gerhard Wellein just received the best workshop paper award at the PPAM 2019, the 13th International Conference on Parallel Processing and Applied Mathematics, in Bialystok, Poland. In this paper, Dominik investigated different methods to optimize an important but often neglected computational kernel: the multiplication of two extremely non-square matrices, with millions of rows but very few (tens of) columns. Vendor libraries still do not perform well in this situation, so you have to roll your own implementation, which is a daunting task because of the huge optimization parameter space, especially on GPGPUs.

The optimizations were guided by the Roofline model (hence “Performance Engineering”), which provides an upper limit for the performance of the kernel. On an Nvidia V100 GPGPU, Dominik’s solution achieves more than 90% of the maximum performance for matrices with up to 40 columns, and more than 50% at up to 64 columns. This is significantly faster than vendor libraries at the time of writing.

The work was funded by the project ESSEX (Equipping Sparse Solvers for Exascale) within the DFG priority programme 1648 (SPPEXA). A preprint of the paper is available at arXiv:1905.03136.

Fooling the masses – Stunt 9: Boast massive speedups with accelerators!

November 3, 2012

(See the prelude for some general information on what this is all about)

GPGPUs, FPGAs, massively parallel single-chip multiprocessors: Accelerators are cool. The prospect of boosting your application’s performance by 100x or more is mouthwatering, so you invest days and weeks, even months to port and tweak code and make it run fast on that shiny new piece of hardware. However, in the end the outcome may well be a meager 2.5x, or even much less, if you compare to well-optimized parallel code on a standard 2-socket server that everyone can handle.

What happened? Well, a direct peak performance and memory bandwidth comparison reveals that – even without considering Amdahl’s Law and overheads like PCIe transfers – 2x-4x is just what can be expected. Assuming a fair game, of course. And that’s exactly your chance to sugarcoat your mediocre results and fabricate a blazing success! Here are some hints:

Compare bleeding-edge accelerator technology with vintage CPUs. A Pentium III will be fine for most purposes today. If you can’t get your hands on one of those, give me a call.
Always compare the fully parallel accelerator code with a sequential (single-core) CPU version. You can even argue in favor of this, saying that “serial CPU code constitutes a well-defined baseline.” Be sure to rush ahead to the next slide before anyone dares to ask why you don’t use a single GPGPU “core” to make the comparison even more “well-defined.”
Use no optimizations whatsoever with the CPU compiler; -O0 will be fine. Also use a plain, proof-of-concept implementation of your algorithm with no manual code changes. It’s good scientific practice because a rock-bottom baseline is more meaningful!
Use single precision on the GPU but double precision on the CPU. That will effectively cut the CPU’s cache, the memory bandwidth, and its peak performance in half.
Accelerate only the kernels that are easily portable and report only their performance. Amdahl’s Law and communication overhead will be nothing but smoke and mirrors!

Figure 1: If you do it right, a 200x speedup for the GPU is absolutely feasible (comparing nVIDIA Tesla C2050 and Intel Dual Xeon 5650, dense matrix-vector multiply, matrix size 4500×4500, Intel compiler V12 update 11)

These tricks should give you a competitive edge; 20x-200x reported speedup should be no problem, depending on your unscrupulousness.

See Fig. 1 for an instructive example: A dense matrix-vector multiply using a 4500×4500 matrix is memory-bound on a Fermi-type GPGPU as well as on a standard CPU-based system such as a dual-socket Xeon 5650 (“Westmere”). The GPGPU baseline of about 45 GFlop/s at single precision (SP) is completely in line with the achievable memory bandwidth of roughly 90 GB/s (4 bytes per multiply-add operation, assuming the right-hand side is mostly held in shared memory). To get a slight speedup you can switch off ECC (who cares for some flipped bits?) to get about 120 GB/s, corresponding to 60 GFlop/s. No need to mention that the PCIe transfers to get the RHS and result vectors to and from the device are neglected. So much for the GPGPU.

On the CPU side, the full dual-Xeon node has a bandwidth of close to 40 GB/s, leading to an SP performance of 19 GFlop/s – much too fast to be shown in public! Going to double precision cuts that in half, but there is more: Using just a single core gets you down another factor of 4, and disabling vectorization gives you a further 0.5x (since you’re now far away from any bandwidth limitation). Finally, turning off optimization altogether (-O0) lets you hit rock bottom at 248 MFlop/s – a whopping 214x slower than the best GPGPU result! Going to my old P-III is not even necessary any more.

Note the cunning use of a linear scale in the diagram; in this particular case it is much better than a log scale since it reduces the CPU performance “baseline” to a mere flyspeck (see Stunt 3 for more information on log vs. linear scale).

This stunt is essentially identical to #6 and #7 of Bailey’s original collection.

The 400x GPU speedup baloney

December 7, 2010

Recently, in the HPC services office…

(… phone rings …)

“Computing Center, HPC services. How can I help you?”

“Yes, High Performance Computing.”

“You want to use our GPGPU cluster? Great! The load on this baby could be higher anyway. It’s hard to believe, but people seem to avoid it like the plague (jovial laughter). Do you have a code that runs on GPUs already?”

“I see, the compiler should be able to handle this. But the code is SIMD vectorized for standard processors, right?”

“No, this has nothing to do with cell phones. SIMD means ‘Single Instruction Multiple Data,’ i.e., several operations can be performed on different operands with a single machine instruction. If that works, chances are that the program can be run efficiently on a GPU as well. After all, GPUs implement the SIMD principle quite extensively.”

“Hm? I think I don’t understand…”

“Ah, ok. No, you don’t have to learn assembly programming to do this. But you may have to think a little more about how the compiler looks at your code. Often you can help it by supplying additional information, like source directives. And of course you need to use a compiler that understands what SIMD is. Alas, the GNU compilers don’t have a clue about it, mostly. By the way, how have you parallelized the code?”

“Um, no. Compilers can’t help you much here, except for very simple cases where a 10-year-old can do it just as well. But you do have to parallelize. How do you want to draw a meaningful comparison to the GPU version?”

“What do you mean, you don’t need to do this?”

“Um, yes, I think I’ve seen this paper recently. It should be on my desk somewhere… (paper rustling) And what exactly are you referring to?”

“Section 4.3, just a sec… here we are: ‘As shown in Fig. whatever, we have achieved a 400x speedup on an NVIDIA Tesla GPU as compared with our CPU implementation.’ (long pause)”

“Sorry, no, I’m still here. I’ve just been looking for the details of the CPU implementation. One moment… (longer pause) Ok, here’s something in the caption of the pretty CFD visualization: ‘In order to avoid issues with OpenMP parallelization we have run the CPU version on a single core.’ Wow. This has to sink in. And if I’m not mistaken, they compare a single-precision GPU code with a double precision version on the CPU. Truly hilarious.”

“No, I’m not making fun of those scientists. But ‘scientists’ is not the word I would use. They obviously think that everyone else is stupid.”

“Why? Because a factor of 400 is impossible. Neither the floating-point peak performance nor the memory bandwidth of the GPU is 400 times larger than that of a current standard compute node, or a chip, or even a single core. So they must have fabricated a slow CPU code on purpose. Realistically one may expect a factor of 2-4 if you compare a reasonably optimized CPU code on a single node with a single GPU, and use the same precision on both.”

“Yes, I agree. 2-4 isn’t bad at all. But that’s just counting the raw GPU. How would the data transfer affect the performance of you code? Can you estimate this?”

“Well, somehow the data must be brought into the GPU and the results must be copied back so that they don’t start rotting over there…”

“Yes (sighing with resignation). Compilers are smart. But there are limits. If you copy the whole problem through the PCI bus after every step, the only way to exploit the performance advantage of the GPU is to perform ridiculously many flops. It’s all a matter of code balance.”

“Code balance tells you how much data transfer your code needs per executed floating-point operation – and you have to count everything, including communication via buses, the network, the memory interface, etc. This can add up to quite some data volume. And then you compare with the amount of data the hardware can transfer per peak flop executed, which gives you an estimate for the performance.”

“You don’t know how many flops and how much data transfer your code needs? Can’t you just count that by looking at it? In most cases, compute time is spent in a very limited number of numerical kernel loops.”

“No, the compiler doesn’t count that for you (keeping composure with obvious effort). It’s also a matter of optimization; perhaps one can reduce the data transfers a bit. One would have to take a look at the code.”

“The compiler? (devastated) Yes, you should try that. Definitely… No listen, people are much too enthusiastic about the compiler’s abilities. Compilers can not read your mind, they can’t even look through the standard C++ template tricks that programmers are so fond of. Let alone generate optimal code for GPUs.”

“Ok, err, sorry, are you crying? (nonplussed) Please don’t. May I suggest that we sit together over a cup of coffee and I give you a crash course in basic performance modeling? Don’t worry, everything’s going to be alright. There, there…”

Georg Hager's Blog

Random thoughts on High Performance Computing