Fujitsu’s A64FX demystified. Well, somewhat.

With all the craze around the Fugaku supercomputer (current Top500 #1) and its 48-core A64FX CPU, it was high time for some in-depth analysis of that beast. At a peak double-precision performance of about 3 Tflop/s and a memory bandwidth close to 1 Tbyte/s it’s certainly an interesting piece of silicon. Through our friends at the physics department of the University of Regensburg, where the “QPACE 4” system is installed (an FX700, the “little brother” of the FX1000 at RIKEN), we had access to one. Although it lacked the Fujitsu compiler and the Tofu network, we still got some very interesting results, which you can read about in our recent paper (which got, incidentally, the Best Short Paper Award at the PMBS20 workshop):

C. L. Alappat, J. Laukemann, T. Gruber, G. Hager, G. Wellein, N. Meyer, and T. Wettig: Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX. Accepted for the 11th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS20). Preprint: arXiv:2009.13903

The first step towards a good understanding of the performance features (and quirks) of a new CPU is to get a good grasp of its instruction execution resources and its memory hierarchy; connoisseurs know that these are the ingredients for ECM performance models of steady-state loops. We were able to show that the cache hierarchy of the A64FX is partially overlapping, mainly with respect to data writes. That’s a good thing. What’s not so good is that many instructions in the A64FX core have rather long latencies. For instance, the 512-bit Scalable Vector Extensions (SVE) floating-point ADD and FMA instructions take 9 cycles to complete, and horizontal ADDs across a SIMD register take even more, which means that sum reductions, scalar products, etc. can be very slow if the compiler doesn’t have a clue about modulo variable expansion. To add insult to injury, the core seems to have very limited out-of-order (OoO) capabilities, putting even more burden on the compiler.

As a consequence, sparse matrix-vector multiplication (SpMV) needs special care to get good performance (i.e, to saturate the memory bandwidth). In particular, you need a proper data format: Compressed Row Storage (CRS) just doesn’t cut it unless the number of nonzeros per row is ridiculously large. Our SELL-C-$\sigma$ format is just the right fit as it supports SIMD vectorization and deep unrolling without much hassle. As a result, SpMV can easily exceed the 100 Gflop/s barrier for reasonably benign matrices on the A64FX, but you need almost all the twelve cores on each of the four ccNUMA domains – which means that any load imbalance will immediately by punished with a performance loss. Your run-of-the-mill x86 server chips are much more forgiving in this respect since load imbalance can be partially hidden by the strong memory saturation.

The SVE intrinsics code for all experiments can be found in our artifacts description at https://github.com/RRZE-HPC/pmbs2020-paper-artifact.