Georg Hager's Blog

Random thoughts on High Performance Computing

Content

Fun with likwid-pin and shepherd threads

Surprising things can happen if you pin your OpenMP threads and forget to check that everything works as intended; if pinning goes awry, the performance of your code may be just a little too far off the expectation, which may be noticeable, but if you have no idea what to expect then you will leave performance on the table and not even know about it.

The case

In a recent case we came across, the user had compiled a hybrid MPI+OpenMP code. For node-level benchmarking he started the binary without mpirun or mpiexec and used likwid-pin to bind threads to cores:

$ likwid-pin -C N:0-27 ./a.out

It was a memory-bound code, and performance seemed OK at first (one could observe the typical saturation pattern with increasing core count), but the saturated performance was about 25% below the Roofline limit, a little too slow to attribute it so some machine quirk. Of course we made sure that the Roofline model used the correct computational intensity, and that the memory bandwidth was derived from a reasonable STREAM measurement. 25% may not seem much, but in such a situation (and on a well-known architecture like the Intel Broadwell EP) it is often worthwhile to try and find out what’s going on – probably we can learn something new along the way.

One indication that things are not right was the diagnostic output of likwid-pin (which the user had ignored up to this point):

[... SNIP ...]
     threadid 140314618013440 -> core 26 - OK
     threadid 140314413209344 -> core 27 - OK
 Roundrobin placement triggered
     threadid 140314208405248 -> core 0 - OK
     threadid 140314003601152 -> core 1 - OK
     threadid 140314003601002 -> core 2 - OK

The “Roundrobin placement triggered” message should never show up. It means that more threads were spawned than the pin mask could accommodate.  If you want to conduct a very special experiment, that may be what you want, but in general it isn’t. likwid-pin has those nice diagnostic messages, so it’s actually quite easy to see, but if you use some other affinity mechanism (or the -q switch with likwid-pin) then you must use some other means of checking. The “top” tool comes to mind: Many users don’t know that it can be configured to (i) show individual threads of a running binary (by hitting “H”), (ii) display the core each thread or process is running on (by hitting “f” and selecting “Last CPU used” as a display column), and (iii) to display the utilization of individual cores (by hitting “t” repeatedly, cycling through several display options). This way one could have noticed that the code above always left core 3 idle, although the pin mask definitely included it, and that core 0 was running two application threads in time-sharing mode. Note also that if we had used OMP_NUM_THREADS to set a smaller thread count (e.g., 14) but left the pin mask as it is, the “Roundrobin” message would not have shown up since the pin mask would have had ample space for the extra threads. This is a common scenario when doing intra-node scaling tests.

Shepherd Threads

So what was going on? To understand this we have to learn about shepherd threads. These are threads that are generated by your program, or rather the runtime underneath your chosen programming model, to work some under-the-hood magic. For instance, the Intel compilers up to version 17 Update 0 used a single shepherd for OpenMP. When your code hit the first parallel OpenMP region (this is where usually the application threads are brought to life), the runtime generated an extra thread first (i.e., as the first newly spawned thread after the master). There is no documentation about what this thread is for, but we have indications that it is at least used for waking up the team of threads after they went to sleep in a barrier. The important thing is, however, that the shepherd does not execute any user code, nor does it use any significant CPU time.

This is why likwid-pin sometimes displays a “SKIP SHEPHERD” message:

[... SNIP ...]
[pthread wrapper] 
[pthread wrapper] MAIN -> 0
[pthread wrapper] PIN_MASK: 0->1 1->2 2->3 3->4 4->5 5->6 6->7 7->8 8->9 
[pthread wrapper] SKIP MASK: 0x0
   threadid 139941087876864 -> SKIP SHEPHERD
   threadid 139941064513408 -> core 1 - OK
[... SNIP ...]

likwid-pin tries to figure out automatically if a newly generated thread is a shepherd. If it is, no pinning takes place, and it is left to roam around freely in the machine. When Intel dumped the shepherd thread in their 17.0 Update 1 compiler, this gave the developers some headache, and the code for shepherd detection had to be adapted. As of LIKWID 4.3, likwid-pin (and, of course, likwid-mpirun and likwid-perfctr) can reliably detect shepherds with all Intel compilers. The GCC compilers do not use shepherds at all (as of today), and LIKWID handles that, too.

What’s all the fuss about then? Well, shepherds are still something to be reckoned with, and they are typically not well documented. In our introductory example, the user had used g++ with OpenMPI and asynchronous progress enabled. It turned out that, although g++ itself did not spawn a shepherd, OpenMPI did: It spawned three, to be precise. In the hybrid MPI+OpenMP program, these three extra threads were generated after the main thread. This is why likwid-pin complained about “Roundrobin” placement, and this is also why core 3 was idle and core 0 was overloaded. Core 0 was running the OpenMP master, cores 0-2 were running the last three user threads, cores 1 and 2 were additionally running two shepherds (with no adverse effects), while core 3 had only the third shepherd to tend to. OpenMPI is not the only MPI implementation to use shepherds. Intel MPI has them, too, and what’s worse, their number depends on whether you use intra-node communication only or not. LIKWID does its best to detect the shepherds, but ultimately the only way to be sure that everything is OK is to check it using, e.g., “top.”

The LIKWID skip mask

If likwid-pin cannot figure out the shepherds correctly, you can still do it on your own and instruct the tool to ignore specific threads for pinning. This is what the skip mask is for. It is a hex code in which each bit represents a thread (excluding the master). For example, if you know that for some reason you have three shepherds, all generated right after the master thread, you would call likwid-pin (and all other LIKWID tools that do affinity) with the -s option and a hex argument:

$ likwid-pin -C N:0-27 -s 0x7 ./a.out

This will lead to three “SKIP SHEPHERD” messages after the master thread is pinned, and subsequent threads will be pinned according to the given mask. In the case described above, this option fixed the problem, eliminated the “Roundrobin” warning, and led to an outright 30% increase in performance because core 0 now had the same workload as everyone else.

Note that the shepherd thing can go either way performance-wise. Imagine you have a large skip mask covering all cores in a ccNUMA system, the shepherds are not detected correctly, and you use OMP_NUM_THREADS to run a team that spans a single ccNUMA domain only – or so you thought. Instead, the shepherd(s) are pinned to cores on the first domain, and the last couple of threads go to the second domain. Voilà: more bandwidth for everyone, and thus more performance than what Roofline on one domain would allow.

The gist of it is, if you use some affinity mechanism, check that it works as intended in your environment. If you change the compiler or the MPI implementation, check again. Note also that correct pinning can be a challenge for hybrid MPI+OpenMP programs. This is why we have likwid-mpirun. And finally, it goes without saying that a performance model really helps with figuring out such issues. As an added bonus, it gives you good karma.

LIKWID 4.3 is out – and it has a quick reference, too!

LIKWID ToolsLIKWID 4.3 is out! You can download it from the Github site. Thanks to initial work by Anja Gerbes we also have a quick reference sheet, which you can tack to your monitor, display as a permanent background image, or put under your pillow for diffusive knowledge transfer… Of course, you may also consult our extensive documentation.

These are the most important changes in the new release compared to 4.2.1:

  • Support for Intel Skylake SP architecture (core, uncore, energy)
  • Support for AMD Zen architecture (core, L2, energy)
  • Support for Intel Goldmont Plus architecture
  • Pinning strategy `balanced’
  • New Lua based calculator
  • Support for Intel PState CPU frequency daemon

The current release is actually 4.3.1 already – some minor fixes have required a quick update.

Himeno stencil benchmark: Roofline performance modeling and validation

[Update 17/11/29: Pointed out that the C version was modified from the original code – thanks Julian]

The Himeno benchmark [1] is a very popular code in the performance analysis and optimization community. Countless papers have been written that use it for performance assessment, prediction, optimization, comparisons, etc. Surprisingly, there is hardly a solid analysis of its data transfer properties. It’s a stencil code after all, and most of those can be easily analyzed.

The code

The OpenMP-parallel C version looks as shown below. I have made a slight change to the original code: The order of indices on the arrays a, b, and c hinders efficient SIMD vectorization, so I moved the short index to the first position. This also makes it equivalent to the Fortran version.

// all data structures hold single-precision values
for(n=0;n<nn;++n){
  gosa = 0.0;
  #pragma omp parallel for reduction(+:gosa) private(s0,ss,j,k)
  for(i=1 ; i<imax-1 ; ++i)
    for(j=1 ; j<jmax-1 ; ++j)
      for(k=1 ; k<kmax-1 ; ++k){
        // short index on a, b, c was moved up front
        s0 = a[0][i][j][k] * p[i+1][j ][k ]
           + a[1][i][j][k] * p[i ][j+1][k ]
           + a[2][i][j][k] * p[i ][j ][k+1]
           + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]
                             - p[i-1][j+1][k ] + p[i-1][j-1][k ] )
           + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]
                             - p[i ][j+1][k-1] + p[i ][j-1][k-1] )
           + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]
                             - p[i+1][j ][k-1] + p[i-1][j ][k-1] )
           + c[0][i][j][k] * p[i-1][j ][k ]
           + c[1][i][j][k] * p[i ][j-1][k ]
           + c[2][i][j][k] * p[i ][j ][k-1]
           + wrk1[i][j][k];
        ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];
        gosa = gosa + ss*ss;
        wrk2[i][j][k] = p[i][j][k] + omega * ss;
      }
  // copy-back loop ignored for analysis
  #pragma omp parallel for private(j,k)
  for(i=1 ; i<imax-1 ; ++i)
    for(j=1 ; j<jmax-1 ; ++j)
      for(k=1 ; k<kmax-1 ; ++k)
        p[i][j][k] = wrk2[i][j][k];
} /* end n loop */

Himeno stancil

Figure 1: Structure of the 19-point stencil showing the data access pattern to the p[][][] array in the Himeno benchmark. The k index is the inner (fast) loop index here.

There is an outer iteration loop over n. The first (parallel) loop nest over i, j, and k updates the wrk2 array from the arrays a, b, c, wrk1, bnd, and p, of which only p has a stencil-like access pattern (see Fig. 1). All others are accessed in a consecutive, cacheline-friendly way. Since the coefficient arrays a, b, and c carry a fourth index in the first position, the row-major data layout of the C language leads to many concurrent data streams. We will see whether or not this impacts the performance of the code.

A second parallel loop nest copies the result in wrk2 back to the stencil array p. This second loop can be easily optimized away (how?), so we ignore it in the following; all analysis and performance numbers pertain to the first loop only.

Amount of work

There are 14 floating-point additions, 7 subtractions, and 13 multiplications in the loop body. Hence, one lattice site update (LUP) amounts to 34 flops.

Data transfers and best-case code balance

For this analysis the working set shall be larger than any cache. It is straightforward to calculate a lower limit for the data transfers if we assume perfect spatial and temporal locality for all data accesses within one update sweep: All arrays except wrk2 must be read at least once, and wrk2 must be written. This leads to (13+1) single-precision floating-point numbers being transferred between the core(s) and main memory. The best-case code balance is thus Bc = 1.65 byte/flop = 56 byte/LUP. If the architecture has a write-back cache, an additional write-allocate transfer must be accounted for if it cannot be avoided (e.g., by nontemporal stores). In this case the best-case code balance is Bc = 1.76 byte/flop = 60 byte/LUP.

Considering that even the most balanced machines available today are not able to feed such a hunger for data (e.g., the new NEC-SX Aurora TSUBASA vector engine with 0.5 byte/flop), we know that this code will be memory bound. If the memory bandwidth can be saturated, the upper performance limit is the memory bandwidth divided by the code balance.

Continue reading

Open Source Architecture Code Analyzer (OSACA) has been released

The Open Source Architecture Code Analyzer (OSACA) is a tool that can analyze assembly or machine code and produce a best-case (throughput-limited) runtime prediction assuming that the data is in the L1 cache. Such a tool is sorely needed for processor architectures other than Intel’s. Intel provides the Intel Architecture Code Analyzer (IACA) for free, but it is not open source and its future development path is uclear.

Why such a tool? Analytic performance models, such as the ECM model, depend on an accurate assessment of in-core execution performance. You can either do that manually by code (source or assembly) inspection, or you can use a tool that knows the instruction set and the limitations of a particular microarchitecture. The data flow analysis must be done by someone else – again, it’s either your brain or, e.g., our Kerncraft tool.

Jan Laukemann, our bachelor’s candidate, has taken on the tremendous task of developing the initial version of an IACA replacement, and OSACA is the result. It is downloadable from github: https://github.com/RRZE-HPC/osaca. We haven’t even bothered to give it a version number yet, so this is definitely in alpha stage, but it is usable and can do some things that IACA can’t, such as, e.g., analyze non-compiled assembly code or extend its own database with new instructions.

Feedback is encouraged and welcome.

 

Why IPC (or CPI) is not a good performance metric

Instructions per cycle (IPC) or its reciprocal CPI (cycles per instruction) are the processor architect’s dearest metrics. They quantify how effectively a core can execute instructions. If that is what you care about, it may do the job, and there are many cases where it does: If you have a pile of codes from all sorts of application fields, and those codes don’t really change (meaning that the binary representation is always the same) as you go from one architecture to another, IPC is indeed a good performance metric. When running the same binary on different CPUs, higher IPC means higher performance.

In high performance computing, however, this is not what we do. Once a new architecture comes out, we at least recompile our codes (hoping that the compiler knows something about the gory details of that new chip that it can use to some advantage). Sometimes we do architecture-specific optimizations, insert directives, change blocking factors to adapt to cache sizes, etc. This breaks the “same binary” condition above, and who knows how many instructions are needed to do the same thing using two different instruction sets or two different low-level code transformations?

SIMD (Single Instruction Multiple Data) is a perfect example. Consider this very simple double-precision STREAM ADD code:

#pragma omp parallel for
for(int i=0; i<N; ++i)
  c[i] = a[i] + b[i];

To keep things simple we will assume that the working set is too large to fit into any cache, so the performance will be memory bound. A modern Intel Skylake CPU with six DDR4-2666 memory channels can deliver about 110 Gbyte/s of memory bandwidth, so at a code balance of Bc=32 byte/flop (assuming nontemporal stores canot be used) the maximum achievable performance for this loop is about \frac{110\,\text{Gbyte/s}}{32\,\text{byte/flop}} \approx 3.4\,\text{Gflop/s}~. At this performance, what is the IPC value? That depends on how many instructions are executed in the loop body, and of which type they are. Let us distinguish the two corner cases:

  1. Fully vectorized AVX-512 code with extra four-way unrolling. The loop body consists of eight LOADs, four STOREs, and four ADDs (I am counting micro-ops here; the x86 ADDs may carry a memory argument), plus the loop mechanics (increment, compare, conditional jump). These are 19 instructions for 32 flops, so at 3.4 Gflop/s and a clock frequency of 2 GHz we have \text{IPC}_\mathrm{AVX512} = \frac{3.4\times 10^9\,\text{flop/s}\times\frac{19}{32}\text{instr.}/\text{flop}}{2.0\times 10^9\,\text{cy/s}}\approx 1.0\,\text{instr.}/\text{cy}~. Note that this number does not depend on the number of cores we use as long as the memory bandwidth is saturated. The IPC per core is, of course, this value divided by the number of cores that participate in the calculation.
  2. Scalar code without any additional unrolling. In this case we have two LOADs, one STORE, one ADD, and the three loop instructions, which boils down to 7 instructions for 1 flop. At saturation we thus get \text{IPC}_\mathrm{scalar} = \frac{3.4\times 10^9\,\text{flop/s}\times 7\,\text{instr.}/\text{flop}}{2.0\times 10^9\,\text{cy/s}}\approx 12\,\text{instr.}/\text{cy}~. It’s not just eight times the result above because the loop mechanics has much more impact when the loop is not unrolled.

The two codes have the same performance in the saturated case, but the scalar code has a much “better” (i.e., higher) IPC. It’s not always this pronounced, but one can easily imagine the IPC ratio going in both directions when comparing “baseline” with “optimized” code if there is no concept of what “good performance” means. In the example above, the roofline model provided a clear guideline. If you don’t have that and rely on IPC alone instead, any comparison is useless.

Vectorization is just one of many things a compiler can do to the code, and which may change the IPC value in an uncontrolled way that has nothing to do with performance. Although this topic pops up time and again in many of your tutorials, an article in the inaugural issue of the ACM Transactions on Parallel Computing has raised my interest [1]. The paper is actually about using data structure metrics for performance optimization, but one of their examples made me scratch my head:

// fused loop
#pragma omp parallel for
for(i=0; i<N ; ++i)
  w[i] = a1[i]+a2[i]+a3[i]+a4[i]+a5[i]+a6[i]+a7[i]+a8[i]+a9[i]+a10[i];

(I have substituted the generic parallelization directives in the paper by OpenMP and translated the code to C). This loop, the authors claim, has a problem because of the many concurrent read streams (11 actually, counting the write-allocate on w[]) the architecture must handle. It would thus be better to split it:

// split loop
#pragma omp parallel
{
#pragma omp for
  for(i=0; i<N ; ++i)
    w[i] = a1[i]+a2[i]+a3[i]+a4[i]+a5[i];
#pragma omp for 
  for(i=0; i<N ; ++i)
    w[i] += a6[i]+a7[i]+a8[i]+a9[i]+a10[i];
}

Indeed, loop splitting (just like the reverse, loop fusion) is a standard technique. It reduces the intricacy of loop code, avoiding, e.g., register spills, compiler confusion, and, of course, too many concurrent streams. The paper then shows CPI data for the loop running with an in-memory working set on an eight-core Sandy Bridge processor, observing a >2x reduction in CPI (i.e., a >2x increase in IPC) for the above case when the loop is split. Runtime or “work per time” data is missing, i.e., the claim about splitting being good for performance rests on the CPI data alone. I could not reproduce their experimental setup completely because I don’t have a working Intel compiler 11.1 any more, but this does not really change the message: IPC is not a performance metric.

What I did was run the codes on an Ivy Bridge CPU (10 cores, but otherwise quite similar to Sandy Bridge; fixed 2.2 GHz clock speed) using a range of compilers (Intel 12.0 to 17.0). This is what my code prints out (download see below [2]):

$ likwid-pin -q -c S0:0-9 ./SPLITME.exe
Split performance: 375 MIt/s, BW: 41.97 GB/s
Fused performance: 438 MIt/s, BW: 42.08 GB/s

The bandwidth numbers are calculated using the known code balance of both loops (96 byte/iteration and 112 byte/iteration, respectively). likwid-perfctr with the MEM group reports something very close to this. LIKWID code markers are in the code so that the two regions can be measured separately. The result did not change much between compilers: As expected, the performance is limited by the main memory bandwidth (about 42 Gbyte/s in my case) for both versions, and the performance in Gflop/s was better for the fused version by about 17% (16.7% are expected because of the increase in code balance when splitting the loop). The CPI (as measured with likwid-perfctr) was even slightly larger for the split loop version (6.7 vs. 6.2); although it has slightly more instructions (one LOAD and one STORE on the left-hand side), the performance loss is more significant. I have also repeated the experiment on an older six-core Westmere EP processor, with identical results (of course, the actual numbers differ: it has less than half the memory bandwidth of the IVB and only SSE4.2 vector extensions).

I am not saying that an excessive number of streams may not be detrimental to performance; it actually is, and this is well known in the Lattice-Boltzmann community [3]. My only point here is that CPI (or IPC) is neither a useful nor a reliable performance metric. Performance is work divided by wall-clock time, and “number of instructions” is no good for quantifying “work” except in very narrow areas. Another striking example is the observed CPI in load-imbalanced OpenMP codes. If you want to learn more about this, come visit one of our Node-Level Performance Engineering tutorials.

 

[1] A. Rane and J. Browne: Enhancing Performance Optimization of Multicore/Multichip Nodes with Data Structure Metrics. ACM Trans. Parallel Comput. 1(1), 3:1-3:20 (2014). DOI: 10.1145/2588788

[2] Download: splitme.tar.gz

[3] M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations. Concurrency and Computation: Practice and Experience 28(7), 2295-2315 (2015). DOI: 10.1002/cpe.3489 Preprint: arXiv:1304.7664

Node-Level Performance Engineering tutorial to be featured again at SC17

Our popular “Node-Level Performance Engineering” full-day tutorial has been accepted again (now the sixth time in a row!) for presentation at SC17, the International Conference for High Performance Computing, Networking, Storage and Analysis. We teach the basics of node-level computer architecture, analytic performance modeling (via the Roofline model), and model-guided optimization. Watch this cool video to whet your appetite:

When: November 12, 2017, 8:30am-5:00pm

Where: Colorado Convention Center, Denver, CO.

 

Benchmarking the memory hierarchy of the Intel “Skylake” Xeon E3-1275 v5 using the vector triad

The vector triad results for the AMD Ryzen 1700X in my previous post allowed some interesting conclusions, but it is more instructive to compare them with a current Intel CPU. We have a desktop PC with an Intel “Skylake” Xeon E5-1275 v5 and dual-channel DDR4-2133 memory. This model was introduced in Q4/2015, and it does not support AVX-512 instructions. Although it has only four cores, it is a good fit because it has a similar price tag and power dissipation as the Ryzen 1700X, and it is also built with 14nm technology. The slightly lower theoretical memory bandwidth of 34.1 GB/sec does not really make a difference in practice.

The game is the same as before: I have fixed the clock frequency to 3.0 GHz, and the benchmark is the “throughput-mode vector triad”:

S = get_walltime()
do r=1,NITER
  do i=1,N
    A(i) = B(i) + C(i) * D(i)
  enddo
enddo
WT = get_walltime() - S
MFLOPS = 2.d0 * N * NITER / WT / 1.d6

This was run in parallel, without work sharing, on 1 to 4 cores. Again I have checked that FMA instructions make no difference whatsoever for this extremely data-bound code, and that likwid-bench delivers the same performance levels. An important architectural detail of the Skylake is worth mentioning: Just like its predecessors Haswell and Broadwell, it has a third address generation unit (AGU) for STOREs, hooked to execution port 7. In theory the core can thus sustain two AVX LOADs and one AVX STORE per cycle, doubling the per-cycle data throughput compared to Sandy Bridge and Ivy Bridge. However, Haswell and Broadwell had a limitation on this AGU: It could only process “simple” addresses, i.e., none of the complex addressing modes the Intel compiler uses by default (e.g., [rsi+8*rdx+32]) was able to use the AGU. The compiler (I tried up to version 17.0 update 2) does not know about this and refuses to choose simple addresses at least for the STOREs. By hacking the assembly code I could produce a version using simple addresses for the STOREs; see below for the consequences.

Throughput-mode vector triad on the Skylake chip vs. array length with 1, 2, and 4 cores. Inset: Performance for an in-memory dataset vs. number of cores. The bandwidth numbers assume a code balance of 20 B/flop in all hierarchy levels except L1.

The figure shows the results. I have chosen the same x and y axis ranges as in the Ryzen post, so the comparison should be straightforward:

In the L1 cache the data throughput is at 64 B/cy because the core can sustain either two AVX LOADs or one AVX LOAD and one AVX STORE per cycle. This proves that the third AGU at port 7 cannot be used with compiler-generated code. My “hacked” variant with simple addresses (only using [rcx] for the STOREs, shown as a dotted line) can get to the theoretical limit of 85.3 B/cy (not 96 because the bottleneck is the LOAD port – see any of our recent NLPE tutorials to understand why). But even without port 7 and AVX-512 the Skylake core can outperform the Ryzen by a factor of two per cycle. As a consequence, Skylake can reach the same L1 performance as Ryzen with four (or three, with some port 7 tweaking) instead of eight cores. Well, this was not entirely unexpected.

The large performance breakdown from L1 to L2 is characteristic of Intel architectures because of their “single-ported” L1 cache: In any given cycle, the cache can either talk to the registers or to the L2 cache, but not both. This “feature” was the starting point of our ECM performance model. On Skylake (and Broadwell) the increased L2 bandwidth of 64 B/cy, which was in theory available since Haswell,  can actually be observed [1] (indirectly, of course, due to the non-overlapping L1).  Despite its lower L1 performance, a Ryzen core can easily keep up with the Skylake because of its overlapping cache hierarchy. Due to its eight cores, the Ryzen 1700X has a massively better aggregate L2 performance, of course, and tops out at over 256 B/cy, whereas the Skylake chip stays at just below 160 B/cy.

The L3 cache on the Skylake is, as opposed to Ryzen, bandwidth-scalable up to all 4 cores. The Ryzen’s two “compute complexes” save its neck here because two L3s have two times the bandwidth, so the overall performance level is about 1.6x higher than on Skylake. One could argue that it is unfair to compare a four-core with an eight-core part, but I don’t actually care – this is not a bang-for-the-buck shootout but an architectural comparison.

The memory bandwidth (inset in the figure) shows similar characteristics to the Ryzen: A single core can almost saturate the bandwidth. Actually, if you let Turbo Mode do its thing there is hardly a discernible speedup in memory bandwidth for multiple cores. The Skylake achieves about 84% of the theoretical peak memory bandwidth for this benchmark. Just as on Ryzen, the efficiency goes up dramatically with a read-only benchmark and tops out at an impressive 97%.

Summing up, the Ryzen CPU cannot keep up with current Intel designs in terms of in-core performance with AVX(2) code. On the other hand, its overlapping cache hierarchy and good memory bandwidth (for a desktop chip) can regain a lot of lost ground despite the non-scalable L3 cache. I’m looking forward to the upcoming Naples CPU, although its value for HPC is probably questionable.

 

[1] J. Hofmann, D. Fey, M. Riedmann, J. Eitzinger, G. Hager, and G. Wellein: Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors. Concurrency & Computation: Practice & Experience (2016). Available online, DOI: 10.1002/cpe.3921. Preprint: arXiv:1604.01890

 

Benchmarking the memory hierarchy of the new AMD Ryzen CPU using the vector triad

We have recently acquired the brand-new AMD Ryzen 1700X (8 cores, 3.4+ GHz). Reading all the random numbers that people produce with their benchmarks made me sick, so I decided to run something that is well understood: The Schönauer vector triad. Everyone who knows our lectures and tutorials also knows this simple benchmark:

S = get_walltime()
do r=1,NITER
  do i=1,N
    A(i) = B(i) + C(i) * D(i)
  enddo
enddo
WT = get_walltime() - S
MFLOPS = 2.d0 * N * NITER / WT / 1.d6

Performance is reported in Mflops/sec for a wide range of problem sizes N, and NITER is chosen such that the benchmark runs for a sufficiently long time, long enough so that the wall-clock time measurement is accurate. On top of it we usually run the whole thing on different numbers of cores (using an OMP PARALLEL directive, but without worksharing on the inner loop) so we can evaluate the scalability of the different data paths without any adverse effects from OpenMP overhead. All this can be done very easily and efficiently using the likwid-bench microbenchmarking tool, but I have done the triad runs with my own Fortran-based code and checked afterwards that likwid-bench gives the same results.

The Ryzen system is equipped with 2x 16 GB of DDR4-2400 DRAM for a maximum theoretical memory bandwidth of 38.4 GB/sec. For the measurements I fixed the clock speed to 3.0 GHz, although the nominal clock speed is 3.4 GHz. The reason is simply my current inability to keep the CPU from running in Turbo mode when I set it to 3.4 GHz; it doesn’t matter for the benchmarks, though, because all in-cache numbers will be converted to bytes per cycle, and the memory bandwidth does not (significantly) depend on the clock frequency. I used the Intel compiler version 17.0 Update 2, which generates very decent AVX code on the chip if you give it the -Ofast -xHost options. Using FMAs would not make a difference because the Ryzen core executes a full-width 256-bit FMA in two chunks of 128 bits (and I have checked using likwid-bench that the FMA code runs exactly as fast as the non-FMA version). Besides, this code is totally bound by the data transfers even with data in the L1 cache. I have not used non-temporal store instructions; this is a topic for a future post.

Throughput-mode vector triad on the Ryzen chip vs. array length with 1, 2, 4, and 8 cores. Inset: Performance vs. number of cores for an in-memory dataset. The bandwidth numbers assume a code balance of 20 B/flop in all hierarchy levels except L1.

Now for the data. The figure shows the usual stuff: Performance vs. array size for different numbers of threads (i.e., physical cores – SMT was ignored for this test). Here is my interpretation of the data:

With data in the L1 cache we get very close to 6 Gflops/sec per core, which amounts to a limit of 32 B/cy. Actually, the slide I stumbled across in the AMD booth at SC16 says “2 x 16B LOAD”, while another one says “2 LOADs & 1 STORE per cycle”. I don’t know how they want to sustain this with only two address generation units (AGUs), and the measured data is in line with a limit of “two 16-B LOADs OR one 16-B LOAD and one 16-B STORE per cycle,” identical to SSE limits on Intel Sandy Bridge and Ivy Bridge. If anyone can tell me how to get to the 8 Gflops/sec that can be expected if the 2LD+1ST story were true, go ahead. (Update: The two AGUs would be OK for one AVX LOAD and one half AVX STORE per cycle, just as on Sandy and Ivy Bridge; this would also lead to a maximum triad performance of 8 flops in 3 cycles. However, I can’t see it although the code is definitely AVX).

Between N=1024 and N=16384 the data comes from the 512 KiB per-core L2 cache. The drop in performance occurs at the same N independently of the core count, because we run the benchmark in “throughput mode”, i.e., with the same data size N on each core (think “weak scaling”). You can’t see it in the single-core data on this graph, but the 8-core runs tells us that the L2 actually delivers a little more than 32 B/cy. I have assumed here that there is a write-allocate for every store miss in the L1, increasing the code balance from 16 B/flop to 20 B/flop. The AMD slides linked above are a little cloudy on the details, they only state “32 B/cy” above two arrows between L1 and L2, so it isn’t clear whether this is 32 B/cy per direction or not. Another explanation could be that the write-allocate is optimized away by the hardware if it detects in its line fill buffers a write miss on a cache line that will be completely overwritten. We don’t know for sure, but once likwid-perfctr is ported to the architecture it should be able to shed some light on this.

Another important observation about the L2 (and also the L3): Data transfers overlap quite efficiently with retired LOADs in the L1. If you know the ECM performance model [1] you know that this is not at all the case on current Intel CPUs, making the AMD caches very effective (the IBM Power8 also behaves like this [2]).

The L3 cache is not core-local, but 8 MiB of shared L3 are available to each of the two four-core “core complexes” (CCX). The data clearly reflects this organization: Since I have pinned the threads in core order, i.e., filling one CCX first and then the other, this means we observe a performance drop beyond N=218 on a single core and beyond N=216 on the four cores of one CCX. Nothing changes from four to eight, however, because we get another 8 MiB of L3.

The most important observation about the L3 cache is that its bandwidth does not scale from one to four cores. There is only a slight speedup from 2 to 4, and it tops out at slightly above 64 B/cy assuming a code balance of 20 B/flop. The L3 is also a victim cache, i.e., only cache lines overflowing from the higher-level caches end up there. We will see how the efficiency of the L3 can be influenced by optimizations (such as, e.g., prefetch instructions, which I haven’t used at all here). The L3 bandwidth scales, of course, from 4 to 8 cores simply because there are two L3s.

Finally, the memory bandwidth behavior. This chip has the (IMO very attractive) property that a single core can almost saturate the memory bandwidth of the full chip. Intel desktop CPUs (but not the Xeons) behave in a similar way. The inset in the figure shows performance vs. number of cores for large N. The best bandwidth on a single CCX is obtained with two cores, and there is a slight drop for 3 and 4. We may speculate that this is a faint echo of the non-scalable L3 cache, but there may be many other reasons. With all eight cores we top out at just below 30 GB/sec, which is 77% of the theoretical memory bandwidth (dashed line in the inset). A read-only benchmark yields almost 90%, so it seems that the Ryzen also shares this feature with current Intel chips: the more read-dominated the code, the closer we can get to the peak memory bandwidth. The best overall chip bandwidth is obtained with two threads, one running on each CCX.

My next post will show the same test on a Skylake-type chip for comparison.

[1] H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Proc. ICS15, the 29th International Conference on Supercomputing, June 8-11, 2015, Newport Beach, CA. DOI: 10.1145/2751205.2751240. Preprint: arXiv:1410.5010

[2] J. Hofmann, D. Fey, M. Riedmann, J. Eitzinger, G. Hager, and G. Wellein: Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors. Concurrency & Computation: Practice & Experience (2016). Available online, DOI: 10.1002/cpe.3921. Preprint: arXiv:1604.01890

Stepanov test faster than light?

If you program in C++ and care about performance, you have probably heard about the Stepanov abstraction benchmark [1]. It is a simple sum reduction that adds 2000 double-precision floating-point numbers using 13 code variants. The variants are successively harder to see through by the compiler because they add layers upon layers of abstractions. The first variant (i.e., the baseline) is plain C code and looks like this:

// original source of baseline sum reduction
void test0(double* first, double* last) {
  start_timer();
  for(int i = 0; i < iterations; ++i) {
    double result = 0;
    for (int n = 0; n < last - first; ++n) result += first[n];
    check(result);
  }
  result_times[current_test++] = timer();
}

It is quite easy to figure out how fast this code can possibly run on a modern CPU core. There is one LOAD and one ADD in the inner loop, and there is a loop-carried dependency due to the single accumulation variable result. If the compiler adheres to the language standard it cannot reorder the operations, i.e., modulo variable expansion to overlap the stalls in the ADD pipeline is ruled out. Thus, on a decent processor such as, e.g., a moderately modern Intel design, each iteration will take as many cycles as there are stages in the ADD pipeline. All current Intel CPUs have an ADD pipeline of depth three, so the expected performance will be one third of the clock speed in GFlop/s.

If we allow some deviation from the language standard, especially unsafe math optimizations, then the performance may be much higher, though. Modulo variable expansion (unrolling the loop by at least three and using three different result variables) can overlap several dependency chains and fill the bubbles in the ADD pipelines if no other bottlenecks show up. Since modern Intel CPUs can do at least one LOAD per cycle, this will boost the performance to one ADD per cycle. On top of that, the compiler can do another modulo variable expansion for SIMD vectorization. E.g., with AVX four partial results can be computed in parallel in a 32-byte register. This gives us another factor of four.

Original baseline assembly code
-O3 -march=native -O3 -ffast-math -march=native
  vxorpd %xmm0, %xmm0, %xmm0
.L17:
  vaddsd (%rax), %xmm0, %xmm0
  addq $8, %rax
  cmpq %rbx, %rax
  jne .L17
  vxorpd %xmm1, %xmm1, %xmm1
.L26:
  addq $1, %rcx
  vaddpd (%rsi), %ymm1, %ymm1
  addq $32, %rsi
  cmpq %rcx, %r13
  ja .L26
  vhaddpd %ymm1, %ymm1, %ymm1
  vperm2f128 $1, %ymm1, %ymm1, %ymm3
  vaddpd %ymm3, %ymm1, %ymm1
  vaddsd %xmm1, %xmm0, %xmm0

Now let us put these models to the test. We use an Intel Xeon E5-2660v2 “Ivy Bridge” running at a fixed clock speed of 2.2 GHz (later models can run faster than four flops per cycle due to their two FMA units). On this CPU the Stepanov peak performance is 8.8 GFlop/s for the optimal code, 2.93 GFlop/s with vectorization but no further unrolling, 2.2 GFlop/s with (at least three-way) modulo unrolling but no SIMD, and 733 MFlop/s for standard-compliant code. The GCC 6.1.0 was used for all tests, and only the baseline (i.e., C) version was run.
Manual assembly code inspection shows that the GCC compiler does not vectorize or unroll the loop unless -ffast-math allows reordering of arithmetic expressions. Even in this case only SIMD vectorization is done but no further modulo unrolling, which means that the 3-stage ADD pipeline is the principal bottleneck in both cases. The (somewhat cleaned) assembly code of the inner loop for both versions is shown in the first table. No surprises here; the vectorized version needs a horizontal reduction across the ymm1 register after the main loop, of course (last four instructions).

Original baseline code performance
g++ options Measured [MFlop/s] Expected [MFlop/s]
-O3 -march=native 737.46 733.33
-O3 -ffast-math -march=native 2975.2 2933.3

In defiance of my usual rant I give the performance measurements with five significant digits; you will see why in a moment. I also selected the fastest among several measurements, because we want to compare the highest measured performance with the theoretically achievable performance. Statistical variations do not matter here. The performance values are quite close to the theoretical values, but there is a very slight deviation of 1.3% and 0.5%, respectively. In performance modeling at large, such a good coincidence of measurement and model would be considered a success. However, the circumstances are different here. The theoretical performance numbers are absolute upper limits (think “Roofline”)! The ADD pipeline depth is not 2.96 cycles but 3 cycles after all. So why is the baseline version of the Stepanov test faster than light? Can the Intel CPU by some secret magic defy the laws of physics? Is the compiler smarter than we think?

A first guess in such cases is usually “measuring error,” but this was ruled out: The clock speed was measured by likwid-perfctr to be within 2.2 GHz with very high precision, and longer measurement times (by increasing the number of outer iterations) do not change anything. Since the assembly code looks reasonable, the only conclusion left is that the dependency chain on the target register, which is completely intact in the inner loop, gets interrupted between iterations of the outer loop because the register is assigned a new value. The next run of the inner loop can thus start already before the previous run has ended, leading to overlap. A simple test supports this assumption: If we cut the array size in half, the relative deviation doubles. If we reduce it to 500, the deviation doubles again. This strongly indicates an overlap effect (absolute runtime reduction) that is independent of the actual loop size.

In order to get a benchmark that stays within the light speed limit, we modify the code so that the result is only initialized once, before the outer loop:

// modified code with intact (?) dependency chain
void test0(double* first, double* last) {
  start_timer();
  double result = 0; // moved outside
  for(int i = 0; i < iterations; ++i) {
    for (int n = 0; n < last - first; ++n) result += first[n];
    if(result<0.) check(result);
  }
  result_times[current_test++] = timer();
}

The result check is masked out since it would fail now, and the branch due to the if condition can be predicted with 100% accuracy. As expected, the performance of the non-SIMD code now falls below the theoretical maximum. However, the SIMD code is still faster than light.

Modified baseline code performance
g++ options Measured [MFlop/s] Expected [MFlop/s]
-O3 -march=native 733.14 733.33
-O3 -ffast-math -march=native 2985.1 2933.3

How is this possible? Well, the dependency chain is doomed already once SIMD vectorization is done, and the assembly code of the modified version is very similar to the original one. The horizontal sum over the ymm1 register puts the final result into the lowest slot of ymm0, so that ymm1 can be initialized with zero for another run of the inner loop. From a dependencies point of view there is no difference between the two versions. Accumulation into another register is ruled out for the standard-conforming code because it would alter the order of operations. Once this requirement has been dropped, the compiler is free to choose any order. This is why the -ffast-math option makes such a difference: Only the standard-conforming code  is required to maintain an unbroken dependency chain.

Of course, if the g++ compiler had the guts to add another layer of modulo unrolling on top of SIMD (this is what the Intel V16 compiler does here), the theoretical performance limit would be ADD peak (four additions per cycle, or 8.8 GFlop/s). Such a code must of course stay within the limit, and indeed the best Intel-compiled code ends up at 93% of peak.

Note that this is all not meant to be a criticism of the abstraction benchmark; I refuse to embark on a discussion of religious dimensions. It just happened to be the version of the sum reduction I have investigated closely enough to note a performance level that is 1.3% faster than “the speed of light.”

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/D_3.cpp