Georg Hager's Blog

Random thoughts on High Performance Computing

Content

And the 2018 Gauss Award goes to: Erlangen!

The Gauss Award is sponsored by the Gauss Centre for Supercomputing (GCS), which is a collaboration of the German national supercomputing centers at Garching, Jülich and Stuttgart. The winner receives a cash prize of 3,000 €, courtesy of the Gauss Center, which is traditionally presented during the ISC Conference Opening Session. This year, the Gauss Award committee has selected a paper which reports on the outcome of a collaboration between the Chair for Computer Architecture and the HPC group at the computing center (RRZE) of FAU:

Johannes Hofmann, Georg Hager, and Dietmar Fey:

On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors

In this paper we have expanded the execution-cache-memory (ECM) performance model developed at RRZE to describe more accurately the saturation behavior of memory-bound code when the number of cores in increased. Together with an improved power consumption model, which takes into account frequency- (and thus voltage-) dependent static power dissipation and the presence of a separate Uncore clock domain in recent Intel CPUs, we can now very accurately describe the performance and the energy consumption of steady-state loops over a wide range of clock frequency settings and core numbers. Although the paper mostly deals with Xeon Intel Sandy Bridge and Broadwell CPUs and “simple” kernels such as STREAM and DGEMM, our models work very well for other architectures (like, e.g., the AMD Epyc) and codes (such as stencil algorithms), too.

Johannes Hofmann will present our work during the ISC award session on June 25, 2018: https://2018.isc-program.com/?page_id=10&id=pap122&sess=sess201

You can download a (pre-review) preprint of the paper at arXiv:1803.01618.

LIKWID marker overhead and “Meltdown” patches

The Marker API of likwid-perfctr lets you count hardware events on your CPU core(s) separately for different execution regions. E.g., in order to count events for a loop, you would use it like this:

#include <likwid.h>

int main(...) {
  // always required once
  LIKWID_MARKER_INIT;
  // ...
  LIKWID_MARKER_START("loop");
  for(int i=0; i<n; ++i) {
    do_some_work();
  }
  LIKWID_MARKER_STOP("loop");
  // ...
  LIKWID_MARKER_CLOSE;
  return 0;
}

An arbitrary number of regions is allowed, and you can use the LIKWID_MARKER_START and LIKWID_MARKER_STOP macros in parallel regions to get per-core readings. The events to be counted are configured on the likwid-perfctr command line. As with anything that is not part of the actual work in a code, one may ask about the cost of the marker API calls. Do they impact the runtime of the code? Does the number of cores play a role?

I ran benchmark tests on a dual-Xeon “Broadwell” E5-2697 v4 (2.3 GHz) node with 18 cores per socket in non-CoD mode. The OS was Ubuntu 16.04.4 LTS with kernel 4.4.0-119-generic and the latest patches. The clock speed (core and Uncore) was set to 2.3 GHz.

This is the benchmark code (slightly simplified – see below for a download link):

LIKWID_MARKER_INIT;
#pragma omp parallel
{
 LIKWID_MARKER_REGISTER("iter");
}

NITER=1;
do {
  // time measurement
  timing(&wct_start);

  #pragma omp parallel private(k)
  {
    for(k=0; k<NITER; ++k) {
      LIKWID_MARKER_START("iter");
 
      if(divsleep(200000) < 0.0)
        exit(1);

      LIKWID_MARKER_STOP("iter");
    }
  }
  timing(&wct_end);
  NITER = NITER*2;
} while (wct_end-wct_start < 0.2);

NITER = NITER/2;

Figure 1: LIKWID marker overhead vs. number of cores (compact pinning, physical cores only) for four different metric groups on the 36-core Broadwell system (LIKWID 4.3.2).

The divsleep() function performs the indicated number of floating-point divides and is a reliable busy-waiting in-core routine that always takes the same amount of time (about 4 cycles per scalar divide on a Broadwell CPU, 7 on an Ivy Bridge). Running the benchmark without markers activated (i.e., without the -m option to likwid-perfctr) , it reports exactly the expected number of cycles (800k on BDW, 1.4 million on IVY). The marker calls return immediately in this case, and of course the runtime does not depend on the number of OpenMP threads. The macro LIKWID_MARKER_REGISTER avoids excessive overhead when LIKWID_MARKER_START is called for the first time. In our case it’s not strictly necessary since the actual measurement is taken multiple times with increasing NITER anyway, but in general it’s a good idea to call LIKWID_MARKER_REGISTER for every marker region that is used later in the code.

With markers activated, the overhead becomes visible. I subtracted the constant runtime of the delay routine from the measured runtime. The result has a slight statistical variation, but the chosen runtime of at least 0.2 seconds makes the measurements quite reproducible.

Figure 1 shows the resulting overhead cycles on the Broadwell system versus the number of threads (compact pinning, physical cores only) with the current version of LIKWID (4.3.2). Out of the considerable number of available metric groups in LIKWID, I chose four: DATA (core-only, loads/stores), CACHES (full account of cacheline traffic through the memory hierarchy), ENERGY (RAPL measurements of energy consumption), and MEM_DP (combination of flops, memory traffic, and energy). These are the relevant observations:

Figure 2: Overhead for the LIKWID marker API calls on a patched Ivy Bridge node (LIKWID 4.3.1).

  1. The overhead depends weakly on the number of cores. There is a noticeable jump when the second socket is involved, but the overhead cycles always stay in the same ballpark.
  2. The overhead depends strongly on the events that are counted. Roughly speaking, “the more events, the more expensive.” CACHES is most toxic in this respect, with over 5 million cycles on the whole node.
  3. The marker overhead is much larger than typical OpenMP overheads (which are around a couple of 1000s cycles for barriers, for instance, when using the Intel OpenMP runtime).

Since starting and stopping counters requires access to the MSR registers and/or PCI devices (for Uncore events), do the recent microcode and kernel patches to mitigate the “Meltdown” hardware vulnerability scenario worsen the overhead? All our test cluster machines have the latest patches installed, but we have kept one node of our Ivy-Bridge-based Emmy cluster in a non-patched state.  Emmy has Intel Xeon “Ivy Bridge” E5-2660 v2 CPUs at 2.2 GHz and runs CentOS with kernel 3.10.0-693.21.1.el7.x86_64; on the non-patched node we run 3.10.0-693.11.6.el7.x86_64. There are also no microcode patches on the latter, so it is in the state it was in before Meltdown was discovered. We don’t have the latest LIKWID version on Emmy yet, but 4.3.1 is available, and there haven’t been any changes recently that would influence the marker overhead.

Figure 3: LIKWID marker overhead on a non-patched Ivy Bridge node (LIKWID 4.3.1).

Figure 2 shows the results for the same experiment as above on a patched node. The general observations are the same, even the relative overhead between different metric groups is comparable, and we are also in the range between 1 and 5 million cycles per start/stop pair.

Figure 3 shows the results on the non-patched node. The overhead is roughly a factor of two smaller. So, while the Meltdown patches haven’t let the marker overhead grow without bounds, they still caused a significant increase.

A couple of million cycles may be insignificant for many applications, but this is in the millisecond range – knowing the cost enables us to calculate how fine-grained the marker regions may become before impacting the code performance.

Benchmark code for download: likwid_marker_overhead.zip. The makefile assumes that the variables LIKWID_LIB and LIKWID_INC are set appropriately, which is done automatically by our modules system. Adapt as needed.

 

LIKWID 4.3.2 released

LIKWID ToolsLIKWID 4.3.2 has just been released. This is an important maintenance release if you need support for the latest architectures:

  • Support for Intel Knights Mill (core, RAPL, Uncore)
  • AMD Zen: Use RETIRED_INSTRUCTIONS instead of fixed-purpose counter for metric calculation
  • Intel Skylake X: Some fixes for events and performance groups

Bug fixes and minor enhancements:

  • Set KMP_INIT_AT_FORK to bypass bug in Intel OpenMP memory allocator
  • All FLOPS_* groups now have vectorization ratio
  • Fix for Marker API with perf_event backend
  • Fix for maximal/minimal Uncore frequency
  • Skip counters that are already in use, don’t exit
  • likwid-mpirun: minor fix when overloading a host
  • Improved detection of PCI devices
  • Fix in internal metric calculator

You can either clone the git repository or download the package from our FTP server.

 

Himeno stencil benchmark: ECM model, SIMD, data layout

In a previous post I have shown how to construct and validate a Roofline performance model for the Himeno benchmark. The relevant findings were:

  • The Himeno benchmark is a rather standard stencil code that is amenable to the well-known layer condition analysis. For in-memory data sets it achieves a performance that is well described by the Roofline model.
  • The performance potential of spatial blocking is limited to about 10% in the saturated case (on a Haswell-EP socket), because the data transfers are dominated by coefficient arrays with no temporal reuse.
  • The large number of concurrent data streams through the cache hierarchy and into memory does not hurt the performance, at least not too much. We had chosen a version of the code which was easy to vectorize but had a lot of parallel data streams (at least 15, probably more if layer conditions are broken).

Some further questions pop up if you want more insight: Is SIMD vectorization relevant at all? Does the data layout matter? What is the single-core performance in relation to the saturated performance, and why? All these questions can be answered by a detailed ECM model, and this is what we are going to do here. This is a long post, so I provide some links to the sections below:

Hardware and code

I will assume that the reader is familiar with stencil analysis, layer conditions, and the ECM model for Intel x86 CPUs. A nice intro to the ECM model in the context of stencil codes is given in [1]. You should also read my post about the Roofline model for Himeno. Most of the manual analysis below was double-checked with our kerncraft tool for automatic loop performance modeling and benchmarking [3].

We start with the same code as in the Roofline analysis: C with SIMD-friendly data layout (in contrast to the original code). This is the hardware:

  • Xeon Haswell E5-2695v3, CoD mode, 14 cores per socket (7 per ccNUMA domain)
  • Cache sizes: 32 KiB L1 per core, 256 KiB L2 per core, 17.5 MiB shared L3 for 7 cores
  • Memory bandwidth per ccNUMA domain: 28.1 GB/s with Schönauer vector triad
    (measured with likwid-bench)
  • Clock frequency (core and Uncore) fixed to 2.3 GHz via likwid-setFrequencies. Forgetting to fix the Uncore clock speed is a frequent source of errors in benchmarking modern Intel CPUs. The separate Uncore clock domain exists since Haswell.

As before, we only look at the main loop and ignore the copy-back part, since that can be easily eliminated. The center loop nest is the following:

// SIMD-friendly data layout
for(int i=1 ; i<imax-1 ; ++i)
  for(int j=1 ; j<jmax-1 ; ++j)
    for(int k=1 ; k<kmax-1 ; ++k){
      // short index on a, b, c was moved up front
      s0 = a[0][i][j][k] * p[i+1][j ][k ]
         + a[1][i][j][k] * p[i ][j+1][k ]
         + a[2][i][j][k] * p[i ][j ][k+1]
         + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]
                           - p[i-1][j+1][k ] + p[i-1][j-1][k ] )
         + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]
                           - p[i ][j+1][k-1] + p[i ][j-1][k-1] )
         + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]
                           - p[i+1][j ][k-1] + p[i-1][j ][k-1] )
         + c[0][i][j][k] * p[i-1][j ][k ]
         + c[1][i][j][k] * p[i ][j-1][k ]
         + c[2][i][j][k] * p[i ][j ][k-1]
         + wrk1[i][j][k];
      ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];
      gosa = gosa + ss*ss;
      wrk2[i][j][k] = p[i][j][k] + omega * ss;
    }

The inner loop index k is also the last index on all arrays, which makes SIMD vectorization rather easy. If the short index on the arrays a, b, and c is moved to the back, this is more of a challenge since the accesses to those now become strided with respect to the inner loop index:

// SIMD-unfriendly data layout
for(int i=1 ; i<imax-1 ; ++i)
  for(int j=1 ; j<jmax-1 ; ++j)
    for(int k=1 ; k<kmax-1 ; ++k){
      // short index on a, b, c in original position
      s0 = a[i][j][k][0] * p[i+1][j ][k ]
         + a[i][j][k][1] * p[i ][j+1][k ]
         + a[i][j][k][2] * p[i ][j ][k+1]
         + b[i][j][k][0] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]
                           - p[i-1][j+1][k ] + p[i-1][j-1][k ] )
         + b[i][j][k][1] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]
                           - p[i ][j+1][k-1] + p[i ][j-1][k-1] )
         + b[i][j][k][2] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]
                           - p[i+1][j ][k-1] + p[i-1][j ][k-1] )
         + c[i][j][k][0] * p[i-1][j ][k ]
         + c[i][j][k][1] * p[i ][j-1][k ]
         + c[i][j][k][2] * p[i ][j ][k-1]
         + wrk1[i][j][k]; 
      ss = ( s0 * a[i][j][k][3] - p[i][j][k] ) * bnd[i][j][k]; 
      gosa = gosa + ss*ss; 
      wrk2[i][j][k] = p[i][j][k] + omega * ss; 
    }

However, all cache lines are still fully used, so the Roofline model does not change (assuming everything is still strongly bandwidth bound). In the following analysis  we will look at both code versions; as seen from the ECM model they do not differ in the data transfer volume but only in the in-core part.

In-core model

The kernel executes 13 multiplications and 21 additions or subtractions. Some of those are fused by the compiler into FMAs. Just by looking at the code we need 32 loads and one store, but the massive amount of memory references may lead to some spilling, presumably with registers holding array base addresses. This is the inner loop code that the Intel compiler generates for the SIMD-friendly code (version 17.0 Update 5, compile options -std=c99 -O3 -xCORE-AVX2 -fno-alias):

# SIMD-friendly layout (Intel 17.0up5)
mov r15, qword ptr [rsp+0x210]
vmovups ymm8, ymmword ptr [r10+r12*4+0x8]
vmovups ymm15, ymmword ptr [r9+r12*4+0x8]
vmovups ymm12, ymmword ptr [r15+r12*4+0x4]
vmovups ymm4, ymmword ptr [r13+r12*4+0x4]
vmovups ymm6, ymmword ptr [r10+r12*4+0x4]
vmovups ymm7, ymmword ptr [r14+r12*4+0x8]
vmovups ymm9, ymmword ptr [r9+r12*4+0x4]
vmovups ymm10, ymmword ptr [r14+r12*4+0x4]
vsubps ymm8, ymm8, ymmword ptr [rdi+r12*4+0x8]
vsubps ymm2, ymm15, ymmword ptr [r13+r12*4+0x8]
mov r15, qword ptr [rsp+0x1d0]
vsubps ymm3, ymm2, ymmword ptr [r9+r12*4]
vmovups ymm2, ymmword ptr [rdi+r12*4+0x4]
vsubps ymm13, ymm12, ymmword ptr [r15+r12*4+0x4]
vsubps ymm12, ymm8, ymmword ptr [r10+r12*4]
vaddps ymm3, ymm3, ymmword ptr [r13+r12*4]
vsubps ymm14, ymm13, ymmword ptr [r11+r12*4+0x4]
vaddps ymm13, ymm12, ymmword ptr [rdi+r12*4]
vmovups ymm8, ymmword ptr [rsi+r12*4+0x4]
mov r15, qword ptr [rsp+0x1d8]
vfmadd213ps ymm8, ymm4, ymmword ptr [rdx+r12*4+0x4]
vaddps ymm5, ymm14, ymmword ptr [r15+r12*4+0x4]
vmovups ymm14, ymmword ptr [rax+r12*4+0x4]
vmulps ymm5, ymm5, ymmword ptr [r8+r12*4+0x4]
vmulps ymm4, ymm14, ymmword ptr [r14+r12*4]
mov r15, qword ptr [rsp+0x1e8]
vfmadd231ps ymm8, ymm13, ymmword ptr [r15+r12*4+0x4]
mov r15, qword ptr [rsp+0x1f0]
vfmadd231ps ymm4, ymm2, ymmword ptr [r15+r12*4+0x4]
mov r15, qword ptr [rsp+0x1c8]
vaddps ymm15, ymm8, ymm4
vfmadd231ps ymm5, ymm6, ymmword ptr [r15+r12*4+0x4]
mov r15, qword ptr [rsp+0x1e0]
vmulps ymm6, ymm3, ymmword ptr [r15+r12*4+0x4]
mov r15, qword ptr [rsp+0x218]
vfmadd231ps ymm6, ymm7, ymmword ptr [r15+r12*4+0x4]
vaddps ymm7, ymm5, ymm6
nop
mov r15, qword ptr [rsp+0x220]
vfmadd231ps ymm7, ymm9, ymmword ptr [r15+r12*4+0x4]
mov r15, qword ptr [rsp+0x200]
vaddps ymm2, ymm7, ymm15
vmovups ymm9, ymmword ptr [r15+r12*4+0x4]
vfmsub213ps ymm9, ymm2, ymm10
vmulps ymm3, ymm9, ymmword ptr [rcx+r12*4+0x4]
vfmadd231ps ymm10, ymm1, ymm3
# this is the reduction on gosa
vfmadd231ps ymm11, ymm3, ymm3  
vmovups ymmword ptr [rbx+r12*4+0x4], ymm10
add r12, 0x8
cmp r12, qword ptr [rsp+0x1b0]
jb 0xfffffffffffffeab

This is one full AVX iteration (eight scalar iterations). For some reason the compiler  refrains from using half-wide LOAD instructions, which makes the code very clean. As expected, some integer register spill has occurred: Register r15 is loaded ten times from the stack, for an overall load count of 42. As for arithmetic, there are 13 FMA or multiply instructions and 12 add or subtract instructions in the assembly code.

For a full cache line (16 scalar iterations) we thus have a non-overlapping time of \(T_\mathrm{nOL}=42\,\mbox{cy}\) because the core can execute two loads per cycle. Just looking at the arithmetic, the 26 FMAs and 24 add/subtracts take 37 cycles, but we need 43 cycles to generate the necessary addresses. Hence, we have \(T_\mathrm{OL}=43\,\mbox{cy}\). Intel IACA reports 46 cycles due to a backend stall. Close enough IMO. As long as \(T_\mathrm{nOL}\) and \(T_\mathrm{OL}\) are in the same ballpark, we know that the data delay will dominate anyway in the end.

This analysis assumes full instruction throughput, i.e., completely independent instructions that are fed to the execution ports as fast as they arrive. If you know your CPU architecture there is actually a little problem with that (highlighted in the assembly listing above): The sum reduction on the gosa variable causes a stall due to an inter-iteration dependency on register ymm11 . However, this little extra time can be easily hidden behind all the other stuff that’s going on in the kernel. In other words, it is not on the critical path.

How about the SIMD-unfriendly layout? The Intel compiler, in its unique way of doing everything it can to vectorize the code, generates pretty much the same arithmetic but has to account for scattered loads from the coefficient arrays:

# SIMD-unfriendly layout (Intel 17.0up5)
vmovups xmm7, xmmword ptr [rdx]
vmovups xmm10, xmmword ptr [rdx+0x10]
vmovups xmm13, xmmword ptr [rdx+0x20]
vmovups xmm5, xmmword ptr [rdx+0x30]
vmovups xmm15, xmmword ptr [rbx]
mov r15, qword ptr [rsp+0x1a8]
vinsertf128 ymm9, ymm7, xmmword ptr [rdx+0x40], 0x1
vinsertf128 ymm3, ymm10, xmmword ptr [rdx+0x50], 0x1
vinsertf128 ymm2, ymm13, xmmword ptr [rdx+0x60], 0x1
vinsertf128 ymm4, ymm5, xmmword ptr [rdx+0x70], 0x1
add rdx, 0x80
vshufps ymm7, ymm9, ymm3, 0x14
vshufps ymm14, ymm4, ymm2, 0x41
vshufps ymm6, ymm7, ymm14, 0x6c
vshufps ymm7, ymm7, ymm14, 0x39
vmovups xmm14, xmmword ptr [rbx+0x10]
vmovups xmm5, xmmword ptr [rbx+0x20]
vmovups ymm10, ymmword ptr [r9+r10*4+0x4]
vshufps ymm9, ymm9, ymm3, 0xbe
vshufps ymm0, ymm4, ymm2, 0xeb
vmovups ymm4, ymmword ptr [r14+r10*4+0x8]
vsubps ymm4, ymm4, ymmword ptr [r8+r10*4+0x8]
vsubps ymm4, ymm4, ymmword ptr [r14+r10*4]
vshufps ymm13, ymm9, ymm0, 0x6c
vshufps ymm9, ymm9, ymm0, 0x39
vmovups ymm0, ymmword ptr [rcx+r10*4+0x8]
vsubps ymm0, ymm0, ymmword ptr [rdi+r10*4+0x8]
vinsertf128 ymm12, ymm15, xmmword ptr [rbx+0x30], 0x1
vinsertf128 ymm3, ymm14, xmmword ptr [rbx+0x40], 0x1
vinsertf128 ymm5, ymm5, xmmword ptr [rbx+0x50], 0x1
add rbx, 0x60
vblendps ymm2, ymm3, ymm5, 0x22
vblendps ymm15, ymm12, ymm5, 0x44
vshufps ymm2, ymm12, ymm2, 0x6c
vshufps ymm14, ymm3, ymm15, 0x9c
vblendps ymm3, ymm12, ymm3, 0x22
vmovups ymm12, ymmword ptr [r15+r10*4+0x4]
vaddps ymm15, ymm4, ymmword ptr [r8+r10*4]
vsubps ymm4, ymm0, ymmword ptr [rcx+r10*4]
vmovups xmm0, xmmword ptr [rax]
vaddps ymm4, ymm4, ymmword ptr [rdi+r10*4]
vshufps ymm5, ymm3, ymm5, 0xc6
vsubps ymm3, ymm12, ymmword ptr [r13+r10*4+0x4]
mov r15, qword ptr [rsp+0x1b0]
vshufps ymm14, ymm14, ymm14, 0xd2
vmulps ymm14, ymm14, ymm15
vsubps ymm12, ymm3, ymmword ptr [r15+r10*4+0x4]
vmovups ymm3, ymmword ptr [r8+r10*4+0x4]
nop dword ptr [rax], eax
vfmadd132ps ymm13, ymm14, ymmword ptr [r9+r10*4+0x8]
vaddps ymm12, ymm12, ymmword ptr [r11+r10*4+0x4]
vmulps ymm2, ymm2, ymm12
vmovups xmm12, xmmword ptr [rax+0x10]
vfmadd132ps ymm6, ymm2, ymmword ptr [rcx+r10*4+0x4]
vmovups xmm2, xmmword ptr [rax+0x20]
vaddps ymm13, ymm6, ymm13
vfmadd132ps ymm7, ymm13, ymmword ptr [r14+r10*4+0x4]
mov r15, qword ptr [rsp+0x1b8]
vinsertf128 ymm12, ymm12, xmmword ptr [rax+0x40], 0x1
vinsertf128 ymm2, ymm2, xmmword ptr [rax+0x50], 0x1
vblendps ymm15, ymm12, ymm2, 0x22
vinsertf128 ymm0, ymm0, xmmword ptr [rax+0x30], 0x1
add rax, 0x60
vshufps ymm14, ymm0, ymm15, 0x6c
vblendps ymm15, ymm0, ymm2, 0x44
vshufps ymm6, ymm12, ymm15, 0x9c
vblendps ymm0, ymm0, ymm12, 0x22
vshufps ymm6, ymm6, ymm6, 0xd2
vshufps ymm2, ymm0, ymm2, 0xc6
vfmadd213ps ymm6, ymm3, ymmword ptr [rsi+r10*4+0x4]
vmulps ymm3, ymm2, ymmword ptr [r9+r10*4]
vfmadd213ps ymm5, ymm4, ymm6
nop
vfmadd132ps ymm14, ymm3, ymmword ptr [rdi+r10*4+0x4]
vaddps ymm4, ymm5, ymm14
vaddps ymm2, ymm7, ymm4
vfmsub213ps ymm9, ymm2, ymm10
vmulps ymm3, ymm9, ymmword ptr [r12+r10*4+0x4]
vfmadd231ps ymm10, ymm1, ymm3
vfmadd231ps ymm11, ymm3, ymm3
vmovups ymmword ptr [r15+r10*4+0x4], ymm10
add r10, 0x8
cmp r10, qword ptr [rsp+0x180]
jb 0xfffffffffffffe19

The extra “processor work” mainly consists of some half-wide loads and shuffles, which put some extra pressure on ports 0, 1, and 5, and leads to a slight increase in the overlapping time, which is now \(T_\mathrm{OL}=54.5\,\mbox{cy}\) according to IACA (frontend bottleneck due to several ports having similar load now). The non-overlapping time rises to a nondramatic  \(T_\mathrm{nOL}=46\,\mbox{cy}\); although the integer register spill was reduced (only 3 additional loads to r15 instead of 10), the half-wide loads come at an additional cost. All this shows that the “bad” data layout can be almost compensated by the ability of the architecture to move the complex shuffling and shifting between SIMD registers off the critical path. You need a good compiler, of course.

Speaking of compilers: I don’t want to turn this into a compiler shoot-out, but at least we have to throw a glance at what happens when the compiler cannot vectorize. Using gcc 7.2.0 and options -std=c99 -Ofast -mavx2 -mfma -fargument-noalias I got the following code with the SIMD-friendly data layout:

# scalar code for SIMD-friendly layout (gcc 7.2.0)
vmovss xmm1, dword ptr [r14+rax*4]
add r8, 0x4
add rdi, 0x4
mov r12, qword ptr [rbp-0x68]
vmovss xmm0, dword ptr [r15+rax*4]
vmulss xmm1, xmm1, dword ptr [rcx+0x4]
mov r9, qword ptr [rbp-0xa0]
add rdx, 0x4
vmulss xmm0, xmm0, dword ptr [rdi]
add rsi, 0x4
add rcx, 0x4
vmovss xmm2, dword ptr [r12+rax*4]
mov r12, qword ptr [rbp-0x60]
vmovss xmm4, dword ptr [r9+rax*4]
mov r9, qword ptr [rbp-0x90]
vmovss xmm3, dword ptr [r12+rax*4]
mov r12, qword ptr [rbp-0x98]
vfmadd132ss xmm2, xmm1, dword ptr [rdx+0x4]
vfmadd231ss xmm0, xmm3, dword ptr [r8]
vaddss xmm1, xmm2, xmm0
vmovss xmm0, dword ptr [r11+rax*4]
vmulss xmm0, xmm0, dword ptr [rdx-0x4]
vmovss xmm2, dword ptr [r12+rax*4]
mov r12, qword ptr [rbp-0x78]
vfmadd132ss xmm2, xmm0, dword ptr [rsi]
vaddss xmm0, xmm1, xmm2
vmovss xmm1, dword ptr [r10+rax*4]
vsubss xmm1, xmm1, dword ptr [rbx+rax*4]
vsubss xmm1, xmm1, dword ptr [r12+rax*4]
mov r12, qword ptr [rbp-0x80]
vaddss xmm1, xmm1, dword ptr [r12+rax*4]
mov r12, qword ptr [rbp-0x70]
vfmadd132ss xmm1, xmm4, dword ptr [r12+rax*4]
vaddss xmm1, xmm0, xmm1
vmovss xmm0, dword ptr [r8+0x4]
vsubss xmm0, xmm0, dword ptr [rcx+0x4]
vsubss xmm0, xmm0, dword ptr [r8-0x4]
vaddss xmm0, xmm0, dword ptr [rcx-0x4]
vmulss xmm2, xmm0, dword ptr [r9+rax*4]
vmovss xmm0, dword ptr [rdi+0x4]
vsubss xmm0, xmm0, dword ptr [rsi+0x4]
mov r9, qword ptr [rbp-0x88]
vsubss xmm0, xmm0, dword ptr [rdi-0x4]
vaddss xmm0, xmm0, dword ptr [rsi-0x4]
vfmadd132ss xmm0, xmm2, dword ptr [r9+rax*4]
vaddss xmm0, xmm1, xmm0
mov r9, qword ptr [rbp-0xa8]
vmovss dword ptr [rbp-0x40], xmm0
vmovss xmm5, dword ptr [rdx]
vfmsub132ss xmm0, xmm5, dword ptr [r13+rax*4]
vmulss xmm0, xmm0, dword ptr [r9+rax*4]
mov r9, qword ptr [rbp-0x58]
vmovaps xmm1, xmm0
vmovss dword ptr [rbp-0x44], xmm0
vfmadd213ss xmm1, xmm0, dword ptr [rbp-0x48]
vmovss dword ptr [rbp-0x48], xmm1
vmovss xmm6, dword ptr [rdx]
vfmadd132ss xmm0, xmm6, dword ptr [rbp-0x3c]
vmovss dword ptr [r9+rax*4], xmm0
add rax, 0x1
cmp rax, qword ptr [rbp-0xb0]
jnz 0xfffffffffffffec1

This is purely scalar code with no unrolling on top. The compiler, although it recognizes the architecture and employs FMA instructions, refuses to use any xmm register beyond xmm6, which leads to some more spills. We now have \(T_\mathrm{OL}=377\,\mbox{cy}\) and \(T_\mathrm{nOL}=370\,\mbox{cy}\), a massive increase from either SIMD code shown above. We will see below whether or not this has any influence on the saturated performance.

Layer conditions and data transfers

Since we want an accurate single-core model we have to look at the data transfers through the complete memory hierarchy instead of just to and from main memory.

In order to keep it simple we start with a problem size around “xl” from the original Himeno set (xl has 513\(\times\)513 grid points in the inner two dimensions l and k).  Looking at the cache sizes and the Roofline analysis we conclude that the 3D layer condition is satisfied in the L3 cache (and broken in L2 and L1), while the 2D layer condition is satisfied in L2 (and broken in L1). Why? The 2D layer condition requires to accommodate three rows of the array p per outer (i) layer in some cache. This data alone has a size of \[513\times 3\times 3\times 4\,\mbox{byte}\approx 18\,\mbox{KiB}~,\] which is more than half the L1 cache size but fits easily into L2 even with all the other streams taking  up cache space. The 3D layer condition requires \[513\times 513\times 3\times 4\,\mbox{byte}\approx 3.0\,\mbox{MiB}~,\] which fits nicely into the 17.5 MiB L3 cache if only a single thread is running, but a parallel code with static scheduling of the outer loop will break the condition (as shown in the Roofline analysis). We will have to keep this in mind when we look at the scalability analysis.

Now we can write down the data transfer volumes between adjacent cache levels for one cacheline (16 iterations) of work: \[\{V_\mathrm{L1L2}\,|\,V_\mathrm{L2L3}\,|\,V_\mathrm{L3Mem}\}=\{23\,|\,17\,|\,15\}\,\mbox{CLs}\] The memory data volume will go up to 17 CLs if the 3D layer condition is broken due to shared cache shortage when running multiple threads.

ECM model

If you look into the Intel64 and IA-32 architectures optimization reference manual, and you’re lucky enough to find the right section about the Haswell architecture, then you find that Haswell can theoretically transfer one full cacheline per cycle between L1 and L2. Unfortunately, this is not what you measure in practice [2]. Of course the ECM model for Intel x86 architectures tells us that this bandwidth can never be observed in a benchmark because of the non-overlapping L1 cache, but even if you take this into account, Haswell can only manage to transfer 43 bytes/cy under best conditions (this has been corrected with Skylake, by the way). Hence, we will use a theoretical L1-L2 bandwidth of 43 bytes/cy here. For the memory bandwidth we use the upper limit of 28.12 GB/s measured with the Schönauer vector triad (triad_avx code from likwid-bench). The ccNUMA domain can actually deliver up to 32.4 GB/s in read-only mode, but since our machine model cannot describe those differences in saturated bandwidth we make it a little more “gray-box” and use a baseline benchmark that has a strong load/store ratio.

Translating the memory data volume into a number of cycles is simple. If \(V\) is the data volume in cache lines, \(b_\mathrm{S}\) is the memory bandwidth, and \(l\) is th cache line size in bytes, then the number of cycles is\[\frac{V\times l\times f}{b_\mathrm{S}}~,\]where \(f\) is the clock frequency in cycles per second.

SIMD-friendly layout, vectorized code (V1)

Using the input from above, the code with SIMD-friendly data layout has the following cycle counts for 16 scalar iterations: \[\{T_\mathrm{OL}\,\|\,T_\mathrm{nOL}\,|\,T_\mathrm{L1L2}\,|\,T_\mathrm{L2L3}\,|\,T_\mathrm{L3Mem}\}=\{43\,\|\,42\,|34.2\,|\,34\,|\,78.5\}\,\mbox{cy}\] Taking into account the non-overlapping machine model for Intel x86 CPUs we get an expected runtime of \((42+34.2+34+78.5)\,\mbox{cy}\approx 189\,\mbox{cy}\). The expected saturation point is at \(n_\mathrm{s}=\left\lceil \frac{189}{89}\right\rceil=3\,\mbox{cores}\). The 89 cy come from the excess memory data volume due to the broken 3D LC in the L3 cache: \(78.5\cdot\frac{17}{15}\approx 89\).

SIMD-unfriendly layout, vectorized code (V2)

The flipped data layout has exactly the same data delay as the SIMD-friendly layout. The only thing that changes is the overlapping and non-overlapping time: \[\{T_\mathrm{OL}\,\|\,T_\mathrm{nOL}\,|\,T_\mathrm{L1L2}\,|\,T_\mathrm{L2L3}\,|\,T_\mathrm{L3Mem}\}=\{54.5\,\|\,46\,|34.2\,|\,34\,|\,78.5\}\,\mbox{cy}\] The resulting expected runtime is \((46+34.2+34+78.5)\,\mbox{cy}\approx 193\,\mbox{cy}\). We see that the increase in the overlapped time is irrelevant, and the only expected slowdown comes from the slightly larger number of loads in the loop kernel. However, the difference is only 2% and will hardly be noticeable. It is highly likely that other effects our model cannot encompass (like the change in the data access pattern) are more important. In short, the ECM model does not predict a significant change in performance for the SIMD-unfriendly data layout. The saturation point does not change either.

SIMD-friendly layout, scalar code (gcc) (V3)

We have seen above that gcc 7.2.0 did not produce the best possible scalar code (due to its sturdy refusal to use more than eight floating-point registers), but let’s use it anyway because it leads to an interesting corner case. We get\[\{T_\mathrm{OL}\,\|\,T_\mathrm{nOL}\,|\,T_\mathrm{L1L2}\,|\,T_\mathrm{L2L3}\,|\,T_\mathrm{L3Mem}\}=\{377\,\|\,370\,|34.2\,|\,34\,|\,78.5\}\,\mbox{cy}~.\]The kernel is obviously still data bound. The expected runtime is then \((370+34.2+34+78.5)\,\mbox{cy}\approx 516\,\mbox{cy}\), a 2.7\(\times\) slowdown compared to the best code. The saturation point is at \(n_\mathrm{s}=\left\lceil \frac{516}{89}\right\rceil=6\,\mbox{cores}\), so we expect this code to just barely saturate within a ccNUMA domain (but experience shows that there will be less-than-perfect scalability since the memory bandwidth is still very close to saturation).

Peformance check

Figure 1: Serial runtime of 16 scalar iterations of version 1 versus problem size (cubic domain). Statistical variations of individual measurements were always below 1%.

Does the model predict the runtime or performance of the code accurately?

Unfortunately, if you set the xl problem size as defined in the original Himeno benchmark (1025\(\times\)513\(\times\)513), the model isn’t too accurate. In fact, the leading dimension seems dangerously close to a power of two, and if you play around with the sizes (including the outer size!) you see significant runtime variations. To explore this we set a cubic problem size of \(N\times N\times N\) and scan \(N\) from 500 to 530. The following benchmark data has been taken with the “Benchmark” mode of kerncraft. Individual measurements vary by less than 1% from run to run, so I didn’t include error bars.

Figure 1 shows the number of cycles for 16 iterations versus problem size for version 1 together with the ECM prediction of 190 cycles. Indeed, near a problem size of \(512^3\) the model is too optimistic. This might have been expected, especially for the SIMD-friendly data layout, since the neighbor accesses in the stencil and also the accesses to different components (first index) in the coefficient arrays a, b, and c have mutual distances of powers of two, or very close to that. The massive performance breakdown at \(N=512\) speaks for itself.

Figure 2: Serial runtime of 16 scalar iterations of version 2 versus problem size (cubic domain). Statistical variations of individual measurements were always below 1%.

On the bright side, the ECM model provides a very good prediction of the single-threaded runtime away from “pathological” cases. This experiment also shows that it is always good to look at performance numbers over a range of problem sizes. Figure 2 shows measurements for version 2 (SIMD code on SIMD-unfriendly data layout). Although there is a slight dip at \(N=512\), the strong variations of runtime vs. problem size are mostly gone, because the index order on the coefficient arrays now causes an automatic skew in the access pattern. The only remaining issue is with the stencil array p. Although the ECM model is also accurate here with an error of well below 5%, our expectation that the code should be slightly slower than version 1 is not satisfied. On the contrary, the code is slightly faster. However, we have already mentioned above that the 2% of expected speedup can be easily swamped by other effects. We’ll come back to that later.

Finally, Figure 3 shows measurements for version 3 (scalar gcc code on SIMD-friendly data layout). Again we see strong variations with problem size as with version 1, but the ECM model still yields a good prediction even in this very core-dominated case. The best measured value is 478 cy, which is 5.6% faster than the model prediction. This is not unexpected because we know that there is some overlap in the memory hierarchy when the in-core part is slow, even if it’s load dominated. 370 of the 516 predicted cycles are spent with loads in the core, but still the data delay down to memory accounts for a significant part of the runtime. The code is thus far from being totally “core dominated.”

Figure 3: Serial runtime of 16 scalar iterations of version 3 versus problem size (cubic domain). Statistical variations of individual measurements were always below 1%.

Now what is the take-home message from all this? First of all, we have learned that the data layout has only a minor impact on the code performance as long as the code is vectorized. We could have found out about this just by taking the performance data, but thanks to the model we understand why: The major part of the execution time is still in the data delay, and even the in-core part is only changed slightly because all the “complicated” stuff that’s necessary to vectorize the code happens in the overlapping in-core part. This part  with its 50-ish cycles hardly stretches beyond the L1 cache. Optimizing the SIMD code is difficult because the runtime is almost evenly spread across the whole memory hierarchy. Temporal blocking to eliminate the 78 memory cycles appears as the most viable option here. In-core optimizations such as improved register scheduling would only buy 10 non-overlapping cycles (for the register spills), with a 5% expected overall speedup. It’s probably not worth the effort.

Second, the scalar gcc-generated code is still limited by data transfers due to the non-overlapping scalar loads in the core. However, in this case we know that only 15% of the time is spent in main memory data transfers. The best advice here is thus “make the compiler produce better code,” i.e., vectorize and use all available floating-point registers.

To make the analysis complete we should now validate the model by checking the actual data transfers using likwid-perfctr or some other HPM tool. I am skipping this step here; suffice it to say that the validation is successful within the usual accuracy limits of HPM events.

Saturation behavior

The single-core analysis gives us a baseline for the parallel code. While the ECM model yields runtime predictions, scalability is usually studied using a “higher is better” metric such as (in case of stencils) LUP/s. We can easily translate the cycles \(c\) for a given number of LUPs \(W\) into a performance number:\[P=\frac{W}{c}\times f\]The parallel code is very simple:

#pragma omp parallel for private(ss,s0) reduction(+:gosa) schedule(static)
for(int i=1 ; i<imax-1 ; ++i)
  for(int j=1 ; j<jmax-1 ; ++j)
    for(int k=1 ; k<kmax-1 ; ++k){
      s0 = ...
      ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];
      gosa = gosa + ss*ss;
      wrk2[i][j][k] = p[i][j][k] + omega * ss;
    }

At the problem sizes we choose here, the overhead from the OpenMP parallelization is not an issue. Of course we have to make sure now that the compiler can still produce the same loop code, regardless of the OpenMP parallelization. While this is true for the Intel compiler, the gcc 7.2.0 binary runs about 5% slower with one OpenMP thread compared to the version with OpenMP turned off. We could fix this by putting the innermost (two) loop(s) into a separate function, but it doesn’t really make much of a difference.

Figure 4: Scaling of all three variants in one ccNUMA domain (problem size \(505^3\)). The open diamonds show the scalar code with Turbo mode and Uncore Frequency Scaling turned on.

Figure 4 shows performance scaling data at a problem size of \(505^3\) for the three variants. First of all, we note that all codes stay within the bandwidth limit set by the broken 3D layer condition (17 cache lines per 16 iterations, dashed line). V1 tops out a little lower, mainly because of the larger amount of concurrent data streams (17 instead of 10 for V2). The general behavior is quite similar, though. The fact that the 3D layer condition gets broken somewhere between 2 and 5 threads is invisible here since other effects are more dominant. Although the ECM model predicts saturation at three cores, we really need one or two more. This is a well-known deficiency of the model near the saturation point, which can be fixed by introducing a bandwidth-dependent latency penalty [4], but that’s something for another post.

The scalar code is interesting: Assuming linear scaling we may expect that even this slow code can saturate the bandwidth, but it fails to do so; of course, a small part of the problem is the layer condition breaking along the way, but mostly it is again excess latency setting in as soon as the bandwidth nears its maximum. However, there is a hardware feature that saves gcc: If we activate Turbo mode and Uncore frequency scaling (UFS), the processor can set its frequency domain as it pleases. This gives us a whopping 40% speedup at one core, and leads to saturation at seven (open diamonds). It burns energy like mad, but gcc’s reputation is redeemed. In essence, if you run the parallel code you may not even notice that the compiler has done a lousy job.

Funny note: I discovered after those experiments that gcc 7.2.0 does vectorize the SIMD-friendly code if I omit the -std=c99 option. Go figure.

Miscellaneous

If you compare the above analysis with the Roofline analyis in my previous post, you will notice that I had used a different memory bandwidth there (about 55 GB/s). This was because the Haswell node ran in non-CoD mode at that time, i.e., with 14 cores per ccNUMA domain. This led to a slightly lower aggregated socket bandwidth compared to the CoD mode used here.

The code for the benchmarks is part of  kerncraft. Note that kerncraft is currently (as of version 0.6.7) not able to analyze the layer conditions for SIMD-unfriendly data layout analytically, and the cache simulator has a hard time with a working set of 7 GB. What I did was employ the “ECMCPU” analysis model, which just runs IACA on the compiled code for the in-core modeling. Since the data transfers are the same as for the SIMD-friendly data layout, this was sufficient in this particular case. The very convenient “Benchmark” mode of kerncraft works, though. I have used it to take all the performance measurements. I used a custom machine description file for my machine (for adjusted compiler options and the 43 byte/cy L2 bandwidth). You can download it from here: HaswellEP_GHa_CoD.yml

 

[1] H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Proc. ICS15, the 29th International Conference on Supercomputing, June 8-11, 2015, Newport Beach, CA. DOI: 10.1145/2751205.2751240. Preprint: arXiv:1410.5010

[2] J. Hofmann, G. Hager, G. Wellein, and D. Fey: An analysis of core- and chip-level architectural features in four generations of Intel server processors. In: J. Kunkel et al. (eds.), High Performance Computing: 32nd International Conference, ISC High Performance 2017, Frankfurt, Germany, June 18-22, 2017, Proceedings, Springer, Cham, LNCS 10266, ISBN 978-3-319-58667-0 (2017), 294-314. DOI: 10.1007/978-3-319-58667-0_16. Preprint: arXiv:1702.07554

[3] J. Hammer, G. Hager, J. Eitzinger, and G. Wellein: Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft. Proc. PMBS15, the 6th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, in conjunction with ACM/IEEE Supercomputing 2015 (SC15), November 16, 2015, Austin, TXDOI: 10.1145/2832087.2832092, Preprint: arXiv:1509.03778

[4] J. Hofmann, G. Hager, and D. Fey: On the accuracy and usefulness of analytic energy models for contemporary multicore processors. Accepted for ISC High Performance 2018. Preprint: arXiv:1803.01618.

Fun with likwid-pin and shepherd threads

Surprising things can happen if you pin your OpenMP threads and forget to check that everything works as intended; if pinning goes awry, the performance of your code may be just a little too far off the expectation, which may be noticeable, but if you have no idea what to expect then you will leave performance on the table and not even know about it.

The case

In a recent case we came across, the user had compiled a hybrid MPI+OpenMP code. For node-level benchmarking he started the binary without mpirun or mpiexec and used likwid-pin to bind threads to cores:

$ likwid-pin -C N:0-27 ./a.out

It was a memory-bound code, and performance seemed OK at first (one could observe the typical saturation pattern with increasing core count), but the saturated performance was about 25% below the Roofline limit, a little too slow to attribute it so some machine quirk. Of course we made sure that the Roofline model used the correct computational intensity, and that the memory bandwidth was derived from a reasonable STREAM measurement. 25% may not seem much, but in such a situation (and on a well-known architecture like the Intel Broadwell EP) it is often worthwhile to try and find out what’s going on – probably we can learn something new along the way.

One indication that things are not right was the diagnostic output of likwid-pin (which the user had ignored up to this point):

[... SNIP ...]
     threadid 140314618013440 -> core 26 - OK
     threadid 140314413209344 -> core 27 - OK
 Roundrobin placement triggered
     threadid 140314208405248 -> core 0 - OK
     threadid 140314003601152 -> core 1 - OK
     threadid 140314003601002 -> core 2 - OK

The “Roundrobin placement triggered” message should never show up. It means that more threads were spawned than the pin mask could accommodate.  If you want to conduct a very special experiment, that may be what you want, but in general it isn’t. likwid-pin has those nice diagnostic messages, so it’s actually quite easy to see, but if you use some other affinity mechanism (or the -q switch with likwid-pin) then you must use some other means of checking. The “top” tool comes to mind: Many users don’t know that it can be configured to (i) show individual threads of a running binary (by hitting “H”), (ii) display the core each thread or process is running on (by hitting “f” and selecting “Last CPU used” as a display column), and (iii) to display the utilization of individual cores (by hitting “t” repeatedly, cycling through several display options). This way one could have noticed that the code above always left core 3 idle, although the pin mask definitely included it, and that core 0 was running two application threads in time-sharing mode. Note also that if we had used OMP_NUM_THREADS to set a smaller thread count (e.g., 14) but left the pin mask as it is, the “Roundrobin” message would not have shown up since the pin mask would have had ample space for the extra threads. This is a common scenario when doing intra-node scaling tests.

Shepherd Threads

So what was going on? To understand this we have to learn about shepherd threads. These are threads that are generated by your program, or rather the runtime underneath your chosen programming model, to work some under-the-hood magic. For instance, the Intel compilers up to version 17 Update 0 used a single shepherd for OpenMP. When your code hit the first parallel OpenMP region (this is where usually the application threads are brought to life), the runtime generated an extra thread first (i.e., as the first newly spawned thread after the master). There is no documentation about what this thread is for, but we have indications that it is at least used for waking up the team of threads after they went to sleep in a barrier. The important thing is, however, that the shepherd does not execute any user code, nor does it use any significant CPU time.

This is why likwid-pin sometimes displays a “SKIP SHEPHERD” message:

[... SNIP ...]
[pthread wrapper] 
[pthread wrapper] MAIN -> 0
[pthread wrapper] PIN_MASK: 0->1 1->2 2->3 3->4 4->5 5->6 6->7 7->8 8->9 
[pthread wrapper] SKIP MASK: 0x0
   threadid 139941087876864 -> SKIP SHEPHERD
   threadid 139941064513408 -> core 1 - OK
[... SNIP ...]

likwid-pin tries to figure out automatically if a newly generated thread is a shepherd. If it is, no pinning takes place, and it is left to roam around freely in the machine. When Intel dumped the shepherd thread in their 17.0 Update 1 compiler, this gave the developers some headache, and the code for shepherd detection had to be adapted. As of LIKWID 4.3, likwid-pin (and, of course, likwid-mpirun and likwid-perfctr) can reliably detect shepherds with all Intel compilers. The GCC compilers do not use shepherds at all (as of today), and LIKWID handles that, too.

What’s all the fuss about then? Well, shepherds are still something to be reckoned with, and they are typically not well documented. In our introductory example, the user had used g++ with OpenMPI and asynchronous progress enabled. It turned out that, although g++ itself did not spawn a shepherd, OpenMPI did: It spawned three, to be precise. In the hybrid MPI+OpenMP program, these three extra threads were generated after the main thread. This is why likwid-pin complained about “Roundrobin” placement, and this is also why core 3 was idle and core 0 was overloaded. Core 0 was running the OpenMP master, cores 0-2 were running the last three user threads, cores 1 and 2 were additionally running two shepherds (with no adverse effects), while core 3 had only the third shepherd to tend to. OpenMPI is not the only MPI implementation to use shepherds. Intel MPI has them, too, and what’s worse, their number depends on whether you use intra-node communication only or not. LIKWID does its best to detect the shepherds, but ultimately the only way to be sure that everything is OK is to check it using, e.g., “top.”

The LIKWID skip mask

If likwid-pin cannot figure out the shepherds correctly, you can still do it on your own and instruct the tool to ignore specific threads for pinning. This is what the skip mask is for. It is a hex code in which each bit represents a thread (excluding the master). For example, if you know that for some reason you have three shepherds, all generated right after the master thread, you would call likwid-pin (and all other LIKWID tools that do affinity) with the -s option and a hex argument:

$ likwid-pin -C N:0-27 -s 0x7 ./a.out

This will lead to three “SKIP SHEPHERD” messages after the master thread is pinned, and subsequent threads will be pinned according to the given mask. In the case described above, this option fixed the problem, eliminated the “Roundrobin” warning, and led to an outright 30% increase in performance because core 0 now had the same workload as everyone else.

Note that the shepherd thing can go either way performance-wise. Imagine you have a large skip mask covering all cores in a ccNUMA system, the shepherds are not detected correctly, and you use OMP_NUM_THREADS to run a team that spans a single ccNUMA domain only – or so you thought. Instead, the shepherd(s) are pinned to cores on the first domain, and the last couple of threads go to the second domain. Voilà: more bandwidth for everyone, and thus more performance than what Roofline on one domain would allow.

The gist of it is, if you use some affinity mechanism, check that it works as intended in your environment. If you change the compiler or the MPI implementation, check again. Note also that correct pinning can be a challenge for hybrid MPI+OpenMP programs. This is why we have likwid-mpirun. And finally, it goes without saying that a performance model really helps with figuring out such issues. As an added bonus, it gives you good karma.

LIKWID 4.3 is out – and it has a quick reference, too!

LIKWID ToolsLIKWID 4.3 is out! You can download it from the Github site. Thanks to initial work by Anja Gerbes we also have a quick reference sheet, which you can tack to your monitor, display as a permanent background image, or put under your pillow for diffusive knowledge transfer… Of course, you may also consult our extensive documentation.

These are the most important changes in the new release compared to 4.2.1:

  • Support for Intel Skylake SP architecture (core, uncore, energy)
  • Support for AMD Zen architecture (core, L2, energy)
  • Support for Intel Goldmont Plus architecture
  • Pinning strategy `balanced’
  • New Lua based calculator
  • Support for Intel PState CPU frequency daemon

The current release is actually 4.3.1 already – some minor fixes have required a quick update.

Himeno stencil benchmark: Roofline performance modeling and validation

[Update 17/11/29: Pointed out that the C version was modified from the original code – thanks Julian]

The Himeno benchmark [1] is a very popular code in the performance analysis and optimization community. Countless papers have been written that use it for performance assessment, prediction, optimization, comparisons, etc. Surprisingly, there is hardly a solid analysis of its data transfer properties. It’s a stencil code after all, and most of those can be easily analyzed.

The code

The OpenMP-parallel C version looks as shown below. I have made a slight change to the original code: The order of indices on the arrays a, b, and c hinders efficient SIMD vectorization, so I moved the short index to the first position. This also makes it equivalent to the Fortran version.

// all data structures hold single-precision values
for(n=0;n<nn;++n){
  gosa = 0.0;
  #pragma omp parallel for reduction(+:gosa) private(s0,ss,j,k)
  for(i=1 ; i<imax-1 ; ++i)
    for(j=1 ; j<jmax-1 ; ++j)
      for(k=1 ; k<kmax-1 ; ++k){
        // short index on a, b, c was moved up front
        s0 = a[0][i][j][k] * p[i+1][j ][k ]
           + a[1][i][j][k] * p[i ][j+1][k ]
           + a[2][i][j][k] * p[i ][j ][k+1]
           + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]
                             - p[i-1][j+1][k ] + p[i-1][j-1][k ] )
           + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]
                             - p[i ][j+1][k-1] + p[i ][j-1][k-1] )
           + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]
                             - p[i+1][j ][k-1] + p[i-1][j ][k-1] )
           + c[0][i][j][k] * p[i-1][j ][k ]
           + c[1][i][j][k] * p[i ][j-1][k ]
           + c[2][i][j][k] * p[i ][j ][k-1]
           + wrk1[i][j][k];
        ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];
        gosa = gosa + ss*ss;
        wrk2[i][j][k] = p[i][j][k] + omega * ss;
      }
  // copy-back loop ignored for analysis
  #pragma omp parallel for private(j,k)
  for(i=1 ; i<imax-1 ; ++i)
    for(j=1 ; j<jmax-1 ; ++j)
      for(k=1 ; k<kmax-1 ; ++k)
        p[i][j][k] = wrk2[i][j][k];
} /* end n loop */

Himeno stancil

Figure 1: Structure of the 19-point stencil showing the data access pattern to the p[][][] array in the Himeno benchmark. The k index is the inner (fast) loop index here.

There is an outer iteration loop over n. The first (parallel) loop nest over i, j, and k updates the wrk2 array from the arrays a, b, c, wrk1, bnd, and p, of which only p has a stencil-like access pattern (see Fig. 1). All others are accessed in a consecutive, cacheline-friendly way. Since the coefficient arrays a, b, and c carry a fourth index in the first position, the row-major data layout of the C language leads to many concurrent data streams. We will see whether or not this impacts the performance of the code.

A second parallel loop nest copies the result in wrk2 back to the stencil array p. This second loop can be easily optimized away (how?), so we ignore it in the following; all analysis and performance numbers pertain to the first loop only.

Amount of work

There are 14 floating-point additions, 7 subtractions, and 13 multiplications in the loop body. Hence, one lattice site update (LUP) amounts to 34 flops.

Data transfers and best-case code balance

For this analysis the working set shall be larger than any cache. It is straightforward to calculate a lower limit for the data transfers if we assume perfect spatial and temporal locality for all data accesses within one update sweep: All arrays except wrk2 must be read at least once, and wrk2 must be written. This leads to (13+1) single-precision floating-point numbers being transferred between the core(s) and main memory. The best-case code balance is thus B= 1.65 byte/flop = 56 byte/LUP. If the architecture has a write-back cache, an additional write-allocate transfer must be accounted for if it cannot be avoided (e.g., by nontemporal stores). In this case the best-case code balance is B= 1.76 byte/flop = 60 byte/LUP.

Considering that even the most balanced machines available today are not able to feed such a hunger for data (e.g., the new NEC-SX Aurora TSUBASA vector engine with 0.5 byte/flop), we know that this code will be memory bound. If the memory bandwidth can be saturated, the upper performance limit is the memory bandwidth divided by the code balance.

Layer conditions

If you know your stencils, you also know that this is not the whole story. The best-case code balance is calculated under the assumption that only one of all the accesses to the stencil array p, in this case one out of 19, actually goes to main memory. The rest can be reused from some level of cache. If this is true, and how much data must be supplied by the different memory hierarchy levels, can be determined by layer conditions (LCs). The outer (“3D”) layer condition is satisfied for the Himeno stencil if three j-k layers fit into the cache, i.e., \[3\times 4\,\mbox{bytes}\times \mathtt{jmax}\times\mathtt{kmax} < C_\mathrm{eff}~,\] where \(C_\mathrm{eff}\) is an effective cache size; as a rule of thumb we can use half the available cache size per thread here (remember that the LC must be satisfied for each thread separately if static OpenMP scheduling is used). The middle (“2D”) layer condition is satisfied if nine inner rows of size kmax fit into the cache, i.e., three per j-k layer: \[3\times 3\times 4\,\mbox{bytes} \times\mathtt{kmax}<C_\mathrm{eff}~.\]Finally, the inner (“1D”) layer condition requires that the cache can hold enough data to avoid cache misses on all but the “first” (largest k) accesses in the loop body. Even the L1 cache can do this for a radius-1 stencil like Himeno, so we don’t have to consider it here.

On processors with three cache  levels, each of the LCs can be broken in any cache level, so there are actually nine layer conditions. Luckily, in a lowest-order analysis we are only interested in the memory traffic, so all we need to know is whether the most stringent LC is satisfied at the largest cache. In the Himeno case this is the 3D LC at L3. If it is broken, two additional data streams go out to memory, so the best-case code balance (with NT stores) rises to Bc= 64 byte/LUP = 1.88 byte/flop. If the memory bandwidth stays the same, the performance will thus go down by 12.5%. This is also the potential performance loss if spatial blocking is not done to establish the outer LC at L3. If NT stores are not used, the loss is even a little smaller. Here we see why spatial blocking for Himeno does not really pay off: The coefficient arrays dominate the data traffic, and the only relevant LC for the stencil array at the L3 cache at the problem default memory-bound sizes is the 3D LC.

The following table shows the predefined problem sizes in the Himeno benchmark, their working set sizes, and how much cache is needed per thread to satisfy the 3D LC:

Problem size imax x jmax x kmax Working set [MiB] required cache per thread for 3D LC [MiB]
s 129 x 65 x 65 29.1 0.048
m 257 x 129 x 129 228 0.19
l 513 x 257 x 257 1810 0.76
xl 1025 x 513 x 513 14406 3.0

Only the “xl” and “l” problems have the potential for breaking the 3D layer condition in the outermost cache. Hence, we expect roughly the same performance for “s” and “m”, and about 12% less for “l” and “xl”. The “s” problem may even fit into the cache on bigger CPUs, so we ignore it in the following. Using spatial blocking on the “xl” and “l” problems will get us only to the level of “m” and no further.

Roofline model and validation

In order to validate the model we have to run the code on a specific architecture. A 14-core Intel Xeon “Haswell”  E5-2695v3 (2.30 GHz base clock frequency) has a shared L3 cache of 35 MiB if Cluster on Die (CoD) is  not activated. 14 threads will easily break the 3D layer condition on the “xl” case, but what about “l”? 0.75 MiB per thread only add up to 10.5 MiB overall, which is smaller than half the L3 cache. This rule-of-thumb calculation may be OK for simple stencil codes where only two arrays are involved, but with the Himeno benchmark we have 16 data streams fighting for the cache space (3 for p and 13 for the other arrays). We should thus only expect to have 3/16 of the cache available for the three layers of p, and so the “l” case is also expected to violate the 3D LC in L3. The standard benchmark that best fits the data access characteristics of Himeno is the Schönauer Vector Triad (a[:]=b[:]+c[:]*d[:]), which on our machine yields a memory bandwidth of b=55.1 GB/s (measured with likwid-bench). The following table shows the expected code balance, upper performance limits, and measured performance  (with 14 cores) for the three sizes on one socket:

Problem size Bc [byte/flop] ([byte/LUP]) b/Bc  [Gflop/s]  ([MLUP/s]) Measured perf. [Gflop/s] ([MLUP/s]) Measured Bc [byte/LUP]
m 1.76 (60) 31.3 (921) 31.6 (929) 58.3
l 2.0 (68) 27.6 (812) 28.5 (838) 66.6
xl 2.0 (68) 27.6 (812) 28.8 (847) 67.6

The measured performance is slightly above the predictions because this processor has a large spread in memory bandwidth depending on the number of read streams vs. write streams: The more data is read (relatively speaking), the higher the bandwidth (with a load-only benchmark the memory bandwidth is about 63 GB/s). Since Himeno has 14 or 16 read streams (including the write-allocate) vs. a single write stream, it can be expected that the vector triad shows less saturated bandwidth. The sheer number of streams, which gets multiplied by the number of threads, is obviously not hazardous in this case.

In order to be really sure about our model (it could be just coincidence after all) we have to validate the data traffic expectations by direct measurement. The last column of the table above shows the measured (via likwid-perfctr) memory data traffic per LUP. The values are quite close to the prediction, which shows that our overall picture of what is going on here is correct.

No spatial blocking strategy whatsoever on any of the “m”, “l” and “xl” problem sizes will get us beyond the 32-ish Gflop/s for the “m” case, because this is where the code balance is at its minimum. This is why the Himeno benchmark code is so resistant to the standard auto-tuning strategies, unless they include temporal blocking. Julian’s online Layer Condition Calculator allows you to study layer conditions and optimal block sizes for arbitrary stencils.

Of course, there are some residual questions one may ask:

  1. What about single-core performance and scalability? To analyze this in detail we will need the ECM performance model, which should yield an accurate single-core performance prediction, including the significance of layer conditions on inner cache levels. This is something for a later post.
  2. Is it at all important whether or not the compiler vectorizes the code? It is for sure vectorizable, and the Intel compiler (at least) does it, but is it necessary? Here, too, the ECM model should give a hint.
  3. What about temporal blocking? Without the ECM model analysis, the expected speedup from temporal blocking is hard to estimate, but at least the memory-boundedness should be lifted.
  4. Would it make sense to change the data layout? The original C version has a different index ordering on the coefficient arrays, which makes vectorization much harder because those arrays are then non-consecutive in the inner loop dimension. On the other hand, this ordering reduces the number of concurrent data streams. As the above analysis shows, we are pretty much at the memory bandwidth limit, so the number of streams itself does not seem to be a big problem.

[1] http://accc.riken.jp/en/supercom/himenobmt, download at http://accc.riken.jp/en/supercom/himenobmt/download/98-source/

 

Open Source Architecture Code Analyzer (OSACA) has been released

The Open Source Architecture Code Analyzer (OSACA) is a tool that can analyze assembly or machine code and produce a best-case (throughput-limited) runtime prediction assuming that the data is in the L1 cache. Such a tool is sorely needed for processor architectures other than Intel’s. Intel provides the Intel Architecture Code Analyzer (IACA) for free, but it is not open source and its future development path is uclear.

Why such a tool? Analytic performance models, such as the ECM model, depend on an accurate assessment of in-core execution performance. You can either do that manually by code (source or assembly) inspection, or you can use a tool that knows the instruction set and the limitations of a particular microarchitecture. The data flow analysis must be done by someone else – again, it’s either your brain or, e.g., our Kerncraft tool.

Jan Laukemann, our bachelor’s candidate, has taken on the tremendous task of developing the initial version of an IACA replacement, and OSACA is the result. It is downloadable from github: https://github.com/RRZE-HPC/osaca. We haven’t even bothered to give it a version number yet, so this is definitely in alpha stage, but it is usable and can do some things that IACA can’t, such as, e.g., analyze non-compiled assembly code or extend its own database with new instructions.

Feedback is encouraged and welcome.

 

Why IPC (or CPI) is not a good performance metric

Instructions per cycle (IPC) or its reciprocal CPI (cycles per instruction) are the processor architect’s dearest metrics. They quantify how effectively a core can execute instructions. If that is what you care about, it may do the job, and there are many cases where it does: If you have a pile of codes from all sorts of application fields, and those codes don’t really change (meaning that the binary representation is always the same) as you go from one architecture to another, IPC is indeed a good performance metric. When running the same binary on different CPUs, higher IPC means higher performance.

In high performance computing, however, this is not what we do. Once a new architecture comes out, we at least recompile our codes (hoping that the compiler knows something about the gory details of that new chip that it can use to some advantage). Sometimes we do architecture-specific optimizations, insert directives, change blocking factors to adapt to cache sizes, etc. This breaks the “same binary” condition above, and who knows how many instructions are needed to do the same thing using two different instruction sets or two different low-level code transformations?

SIMD (Single Instruction Multiple Data) is a perfect example. Consider this very simple double-precision STREAM ADD code:

#pragma omp parallel for
for(int i=0; i<N; ++i)
  c[i] = a[i] + b[i];

To keep things simple we will assume that the working set is too large to fit into any cache, so the performance will be  memory bound. A modern Intel Skylake CPU with six DDR4-2666 memory channels can deliver about 110 Gbyte/s of memory bandwidth, so at a code balance of Bc=32 byte/flop (assuming nontemporal stores canot be used) the maximum achievable performance for this loop is about \[\frac{110\,\mbox{Gbyte/s}}{32\,\mbox{byte/flop}} \approx 3.4\,\mbox{Gflop/s}~.\] At this performance, what is the IPC value? That depends on how many instructions are executed in  the loop body, and of which type they are. Let us distinguish the two corner cases:

  1. Fully vectorized AVX-512 code with extra four-way unrolling. The loop body consists of eight LOADs, four STOREs, and four ADDs (I am counting micro-ops here; the x86 ADDs may carry a memory argument), plus the loop mechanics (increment, compare, conditional jump). These are 19 instructions for 32 flops, so at 3.4 Gflop/s and a clock frequency of 2 GHz we have \[ \mbox{IPC}_\mathrm{AVX512} = \frac{3.4\times 10^9\,\mbox{flop/s}\times\frac{19}{32}\mbox{instr.}/\mbox{flop}}{2.0\times 10^9\,\mbox{cy/s}}\approx 1.0\,\mbox{instr.}/\mbox{cy}~.\] Note that this number does not depend on the number of cores we use as long as the memory bandwidth is saturated. The IPC per core is, of course, this value divided by the number of cores that participate in the calculation.
  2. Scalar code without any additional unrolling. In this case we have two LOADs, one STORE, one ADD, and the three loop instructions, which boils down to 7 instructions for 1 flop. At saturation we thus get \[ \mbox{IPC}_\mathrm{scalar} = \frac{3.4\times 10^9\,\mbox{flop/s}\times 7\,\mbox{instr.}/\mbox{flop}}{2.0\times 10^9\,\mbox{cy/s}}\approx 12\,\mbox{instr.}/\mbox{cy}~.\] It’s not just eight times the result above because the loop mechanics has much more impact when the loop is not unrolled.

The two codes have the same performance in the saturated case, but the scalar code has a much “better” (i.e., higher) IPC. It’s not always this pronounced, but one can easily imagine the IPC ratio going in both directions when comparing “baseline” with “optimized” code if there is no concept of what “good performance” means. In the example above, the roofline model provided a clear guideline. If you don’t have that and rely on IPC alone instead, any comparison is useless.

Vectorization is just one of many things a compiler can do to the code, and which may change the IPC value in an uncontrolled way that has nothing to do with performance. Although this topic pops up time and again in many of your tutorials, an article in the inaugural issue of the ACM Transactions on Parallel Computing has raised my interest [1]. The paper is actually about using data structure metrics for performance optimization, but one of their examples made me scratch my head:

// fused loop
#pragma omp parallel for
for(i=0; i<N ; ++i)
  w[i] = a1[i]+a2[i]+a3[i]+a4[i]+a5[i]+a6[i]+a7[i]+a8[i]+a9[i]+a10[i];

(I have substituted the generic parallelization directives in the paper by OpenMP and translated the code to C). This loop, the authors claim, has a problem because of the many concurrent read streams (11 actually, counting the write-allocate on w[]) the architecture must handle. It would thus be better to split it:

// split loop
#pragma omp parallel
{
#pragma omp for
  for(i=0; i<N ; ++i)
    w[i] = a1[i]+a2[i]+a3[i]+a4[i]+a5[i];
#pragma omp for 
  for(i=0; i<N ; ++i)
    w[i] += a6[i]+a7[i]+a8[i]+a9[i]+a10[i];
}

Indeed, loop splitting (just like the reverse, loop fusion) is a standard technique. It reduces the intricacy of loop code, avoiding, e.g., register spills, compiler confusion, and, of course, too many concurrent streams. The paper then shows CPI data for the loop running with an in-memory working set on an eight-core Sandy Bridge processor, observing a >2x reduction in CPI (i.e., a >2x increase in IPC) for the above case when the loop is split. Runtime or “work per time” data is missing, i.e., the claim about splitting being good for performance rests on the CPI data alone. I could not reproduce their experimental setup completely because I don’t have a working Intel compiler 11.1 any more, but this does not really change the message: IPC is not a performance metric.

What I did was run the codes on an Ivy Bridge CPU (10 cores, but otherwise quite similar to Sandy Bridge; fixed 2.2 GHz clock speed) using a range of compilers (Intel 12.0 to 17.0). This is what my code prints out (download see below [2]):

$ likwid-pin -q -c S0:0-9 ./SPLITME.exe
Split performance: 375 MIt/s, BW: 41.97 GB/s
Fused performance: 438 MIt/s, BW: 42.08 GB/s

The bandwidth numbers are calculated using the known code balance of both loops (96 byte/iteration and 112 byte/iteration, respectively). likwid-perfctr with the MEM group reports something very close to this. LIKWID code markers are in the code so that the two regions can be measured separately. The result did not change much between compilers: As expected, the performance is limited by the main memory bandwidth (about 42 Gbyte/s in my case) for both versions, and the performance in Gflop/s was better for the fused version by about 17% (16.7% are expected because of the increase in code balance when splitting the loop).  The CPI (as measured with likwid-perfctr) was even slightly larger for the split loop version (6.7 vs. 6.2); although it has slightly more instructions (one LOAD and one STORE on the left-hand side), the performance loss is more significant. I have also repeated the experiment on an older six-core Westmere EP processor, with identical results (of course, the actual numbers differ: it has less than half the memory bandwidth of the IVB and only SSE4.2 vector extensions).

I am not saying that an excessive number of streams may not be detrimental to performance; it actually is, and this is well known in the Lattice-Boltzmann community [3]. My only point here is that CPI (or IPC) is neither a useful nor a reliable performance metric. Performance is work divided by wall-clock time, and “number of instructions” is no good for quantifying “work” except in very narrow areas. Another striking example is the observed CPI in load-imbalanced OpenMP codes. If you want to learn more about this, come visit one of our Node-Level Performance Engineering tutorials.

 

[1] A. Rane and J. Browne: Enhancing Performance Optimization of Multicore/Multichip Nodes with Data Structure Metrics. ACM Trans. Parallel Comput. 1(1), 3:1-3:20 (2014). DOI: 10.1145/2588788

[2] Download: splitme.tar.gz

[3] M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations. Concurrency and Computation: Practice and Experience 28(7), 2295-2315 (2015). DOI: 10.1002/cpe.3489 Preprint: arXiv:1304.7664