Georg Hager's Blog

Random thoughts on High Performance Computing


PMACS 2020 Workshop Call for Papers

Euro-Par 2020, Warsaw, PolandThe Euro-Par conference is a well-established venue for presenting and discussing research on parallel computing in Europe. In association with the Euro-Par 2020 conference, which takes place from August 24-28 in Warsaw, Poland, the workshop “PMACS – Performance Monitoring and Analysis of Cluster Systems” is organized by Thomas Gruber from our group. The workshop offers presentations and exchange of experience about various solutions on the collection, transmission, storage, evaluation, and visualization of runtime data about the hard- and software of whole cluster systems.

Continue reading

Node-Level Performance Engineering Course at LRZ

We proudly present a retake of our PRACE course on “Node-Level Performance Engineering” on December 3-4, 2019 at LRZ Garching.

This course covers performance engineering approaches on the compute node level. Even application developers who are fluent in OpenMP and MPI often lack a good grasp of how much performance could at best be achieved by their code. This is because parallelism takes us only half the way to good performance. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted. This course conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. We introduce the basic architectural features and bottlenecks of modern processors and compute nodes. Pipelining, SIMD, superscalarity, caches, memory interfaces, ccNUMA, etc., are covered. A cornerstone of node-level performance analysis is the Roofline model, which is introduced in due detail and applied to various examples from computational science. We also show how simple software tools can be used to acquire knowledge about the system, run code in a reproducible way, and validate hypotheses about resource consumption. Finally, once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of code changes can often be predicted, replacing hope-for-the-best optimizations by a scientific process.

This is a two-day course with demos. It is provided free of charge to members of European research institutions and universities.

 Date: Tuesday, Dec 3, 2019 09:00 – 17:00
Wednesday, Dec 4, 2019 09:00 – 17:00
Location: LRZ Building, University campus Garching, near Munich, Hörsaal H.E.009 (Lecture hall)
Course webpage with detailed agenda:
Registration: Via

LIKWID 5.0 is here

LIKWID stickers

Laptop decorations available at SC19!

Just in time for SC19, version 5 of our popular LIKWID tool suite has been released. There are tons of new developments in there; these are the most important ones:

  • Support for ARM architectures, especially for Marvell Thunder X2
  • Support for IBM POWER architectures (POWER8 and POWER9)
  • Support for AMD Zen2 and for data fabric counters of the AMD Zen microarchitecture
  • Support for Nvidia GPU monitoring (with NvMarkerAPI)
  • New clock frequency backend (with less overhead)
  • Generation of benchmarks for likwid-bench on-the-fly from ptt files
  • Integration of GOTCHA for hooking into client applications at runtime
  • Thread-local initialization of streams for likwid-bench
  • Enhanced support for SLURM with likwid-mpirun
  • New MPI and Hybrid pinning features for likwid-mpirun
  • JSON output filter file (use -o output.json)
  • Updated quick reference sheet with all the new options

The full list is available at the github release page. And if you need something really cool to cover that empty spot on your laptop lid, we’ll have LIKWID stickers available during our SC19 tutorial “Node-Level Performance Engineering” and at the Bavarian Supercomputing booth (#2063).

Direct download from FAU FTP

LIKWID documentation Wiki

Github project

SC19 incoming!

This year at SC19 in Denver, CO, members of our group will be part of numerous contributions:

  • Our master student Jan Laukemann will present the paper “Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels” at the PMBS 2019 workshop. It describes recent improvements to our “Open Source Architecture Code Analyzer” (OSACA), notably support for ARM architectures and critical path detection. This paper has received the Best Short Paper Award at the workshop.
  • Our accepted research poster “INSPECT Intranode Stencil Performance Evaluation Collection” by Julian Hammer et al. will showcase INSPECT, our open and extensible collection of performance data and models for stencil codes.
  • Our accepted research poster “LIKWID 5: Lightweight Performance Tools” by Thomas Gruber et al. will showcase the latest developments in our LIKWID performance tool suite.
  • The popular full-day tutorial “Node-Level Performance Engineering” will be presented again by Gerhard Wellein and myself.

And finally, we are again part of the activities at the LRZ booth (#2063), where “Bits, Bytes, Brezn & Beer – Supercomputing in Bavaria” is on the agenda. Drop by during the opening gala to chat and get your share of Bavarian (and Franconian) ambience.

LIKWID 4.3.4 released

LIKWID 4.3.4 is a bugfix release.  These are the relevant changes:

  • For systems using Intel Cluster-on-Die (CoD) or Sub-NUMA Clustering (SNC):
    • Fix for detecting PCI devices
    • Workaround for topology detection. The Linux kernel does not detect it properly sometimes.
  • Don’t pin accessDaemon to SMT threads to avoid long access latencies due to busy hardware thread
  • Fix for calculations in likwid-bench if streams are used for input and output
  • Fix for LIKWID_MARKER_REGISTER with perf_event backend
  • Support for Intel Atom (Tremont) (nothing new, same as Intel Atom Goldmont Plus)
  • Minor updates for build system
  • Minor updates for documentation

Download the new version from our FTP server or directly from github:

A LIKWID bouquet

The note says: “For the name ‘LIKWID’ – because it keeps conjuring a smile on my lips.”

Just got this from an anonymous fan. The note says: “For the name ‘LIKWID’ – because it keeps conjuring a smile on my lips.” You’re most welcome, whoever you are.

For those who don’t know, LIKWID is our multicore performance tool suite. I’m not a developer (Thomas Gruber [né Röhl] does the hard work there), but I happen to be the one who came up with the acronym: “Like I Knew What I’m Doing.”

LIKWID marker overhead and “Meltdown” patches

The Marker API of likwid-perfctr lets you count hardware events on your CPU core(s) separately for different execution regions. E.g., in order to count events for a loop, you would use it like this:

#include <likwid.h>

int main(...) {
  // always required once
  // ...
  for(int i=0; i<n; ++i) {
  // ...
  return 0;

An arbitrary number of regions is allowed, and you can use the LIKWID_MARKER_START and LIKWID_MARKER_STOP macros in parallel regions to get per-core readings. The events to be counted are configured on the likwid-perfctr command line. As with anything that is not part of the actual work in a code, one may ask about the cost of the marker API calls. Do they impact the runtime of the code? Does the number of cores play a role? Continue reading

Fun with likwid-pin and shepherd threads

Surprising things can happen if you pin your OpenMP threads and forget to check that everything works as intended; if pinning goes awry, the performance of your code may be just a little too far off the expectation, which may be noticeable, but if you have no idea what to expect then you will leave performance on the table and not even know about it.

The case

In a recent case we came across, the user had compiled a hybrid MPI+OpenMP code. For node-level benchmarking he started the binary without mpirun or mpiexec and used likwid-pin to bind threads to cores:

$ likwid-pin -C N:0-27 ./a.out

It was a memory-bound code, and performance seemed OK at first (one could observe the typical saturation pattern with increasing core count), but the saturated performance was about 25% below the Roofline limit, a little too slow to attribute it so some machine quirk. Of course we made sure that the Roofline model used the correct computational intensity, and that the memory bandwidth was derived from a reasonable STREAM measurement. 25% may not seem much, but in such a situation (and on a well-known architecture like the Intel Broadwell EP) it is often worthwhile to try and find out what’s going on – probably we can learn something new along the way.

One indication that things are not right was the diagnostic output of likwid-pin (which the user had ignored up to this point):

[... SNIP ...]
     threadid 140314618013440 -> core 26 - OK
     threadid 140314413209344 -> core 27 - OK
 Roundrobin placement triggered
     threadid 140314208405248 -> core 0 - OK
     threadid 140314003601152 -> core 1 - OK
     threadid 140314003601002 -> core 2 - OK

The “Roundrobin placement triggered” message should never show up. It means that more threads were spawned than the pin mask could accommodate.  If you want to conduct a very special experiment, that may be what you want, but in general it isn’t. likwid-pin has those nice diagnostic messages, so it’s actually quite easy to see, but if you use some other affinity mechanism (or the -q switch with likwid-pin) then you must use some other means of checking. The “top” tool comes to mind: Many users don’t know that it can be configured to (i) show individual threads of a running binary (by hitting “H”), (ii) display the core each thread or process is running on (by hitting “f” and selecting “Last CPU used” as a display column), and (iii) to display the utilization of individual cores (by hitting “t” repeatedly, cycling through several display options). This way one could have noticed that the code above always left core 3 idle, although the pin mask definitely included it, and that core 0 was running two application threads in time-sharing mode. Note also that if we had used OMP_NUM_THREADS to set a smaller thread count (e.g., 14) but left the pin mask as it is, the “Roundrobin” message would not have shown up since the pin mask would have had ample space for the extra threads. This is a common scenario when doing intra-node scaling tests.

Shepherd Threads

So what was going on? To understand this we have to learn about shepherd threads. These are threads that are generated by your program, or rather the runtime underneath your chosen programming model, to work some under-the-hood magic. For instance, the Intel compilers up to version 17 Update 0 used a single shepherd for OpenMP. When your code hit the first parallel OpenMP region (this is where usually the application threads are brought to life), the runtime generated an extra thread first (i.e., as the first newly spawned thread after the master). There is no documentation about what this thread is for, but we have indications that it is at least used for waking up the team of threads after they went to sleep in a barrier. The important thing is, however, that the shepherd does not execute any user code, nor does it use any significant CPU time.

This is why likwid-pin sometimes displays a “SKIP SHEPHERD” message:

[... SNIP ...]
[pthread wrapper] 
[pthread wrapper] MAIN -> 0
[pthread wrapper] PIN_MASK: 0->1 1->2 2->3 3->4 4->5 5->6 6->7 7->8 8->9 
[pthread wrapper] SKIP MASK: 0x0
   threadid 139941087876864 -> SKIP SHEPHERD
   threadid 139941064513408 -> core 1 - OK
[... SNIP ...]

likwid-pin tries to figure out automatically if a newly generated thread is a shepherd. If it is, no pinning takes place, and it is left to roam around freely in the machine. When Intel dumped the shepherd thread in their 17.0 Update 1 compiler, this gave the developers some headache, and the code for shepherd detection had to be adapted. As of LIKWID 4.3, likwid-pin (and, of course, likwid-mpirun and likwid-perfctr) can reliably detect shepherds with all Intel compilers. The GCC compilers do not use shepherds at all (as of today), and LIKWID handles that, too.

What’s all the fuss about then? Well, shepherds are still something to be reckoned with, and they are typically not well documented. In our introductory example, the user had used g++ with OpenMPI and asynchronous progress enabled. It turned out that, although g++ itself did not spawn a shepherd, OpenMPI did: It spawned three, to be precise. In the hybrid MPI+OpenMP program, these three extra threads were generated after the main thread. This is why likwid-pin complained about “Roundrobin” placement, and this is also why core 3 was idle and core 0 was overloaded. Core 0 was running the OpenMP master, cores 0-2 were running the last three user threads, cores 1 and 2 were additionally running two shepherds (with no adverse effects), while core 3 had only the third shepherd to tend to. OpenMPI is not the only MPI implementation to use shepherds. Intel MPI has them, too, and what’s worse, their number depends on whether you use intra-node communication only or not. LIKWID does its best to detect the shepherds, but ultimately the only way to be sure that everything is OK is to check it using, e.g., “top.”

The LIKWID skip mask

If likwid-pin cannot figure out the shepherds correctly, you can still do it on your own and instruct the tool to ignore specific threads for pinning. This is what the skip mask is for. It is a hex code in which each bit represents a thread (excluding the master). For example, if you know that for some reason you have three shepherds, all generated right after the master thread, you would call likwid-pin (and all other LIKWID tools that do affinity) with the -s option and a hex argument:

$ likwid-pin -C N:0-27 -s 0x7 ./a.out

This will lead to three “SKIP SHEPHERD” messages after the master thread is pinned, and subsequent threads will be pinned according to the given mask. In the case described above, this option fixed the problem, eliminated the “Roundrobin” warning, and led to an outright 30% increase in performance because core 0 now had the same workload as everyone else.

Note that the shepherd thing can go either way performance-wise. Imagine you have a large skip mask covering all cores in a ccNUMA system, the shepherds are not detected correctly, and you use OMP_NUM_THREADS to run a team that spans a single ccNUMA domain only – or so you thought. Instead, the shepherd(s) are pinned to cores on the first domain, and the last couple of threads go to the second domain. Voilà: more bandwidth for everyone, and thus more performance than what Roofline on one domain would allow.

The gist of it is, if you use some affinity mechanism, check that it works as intended in your environment. If you change the compiler or the MPI implementation, check again. Note also that correct pinning can be a challenge for hybrid MPI+OpenMP programs. This is why we have likwid-mpirun. And finally, it goes without saying that a performance model really helps with figuring out such issues. As an added bonus, it gives you good karma.

Himeno stencil benchmark: Roofline performance modeling and validation

[Update 17/11/29: Pointed out that the C version was modified from the original code – thanks Julian]

The Himeno benchmark [1] is a very popular code in the performance analysis and optimization community. Countless papers have been written that use it for performance assessment, prediction, optimization, comparisons, etc. Surprisingly, there is hardly a solid analysis of its data transfer properties. It’s a stencil code after all, and most of those can be easily analyzed.

The code

The OpenMP-parallel C version looks as shown below. I have made a slight change to the original code: The order of indices on the arrays a, b, and c hinders efficient SIMD vectorization, so I moved the short index to the first position. This also makes it equivalent to the Fortran version.

// all data structures hold single-precision values
  gosa = 0.0;
  #pragma omp parallel for reduction(+:gosa) private(s0,ss,j,k)
  for(i=1 ; i<imax-1 ; ++i)
    for(j=1 ; j<jmax-1 ; ++j)
      for(k=1 ; k<kmax-1 ; ++k){
        // short index on a, b, c was moved up front
        s0 = a[0][i][j][k] * p[i+1][j ][k ]
           + a[1][i][j][k] * p[i ][j+1][k ]
           + a[2][i][j][k] * p[i ][j ][k+1]
           + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]
                             - p[i-1][j+1][k ] + p[i-1][j-1][k ] )
           + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]
                             - p[i ][j+1][k-1] + p[i ][j-1][k-1] )
           + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]
                             - p[i+1][j ][k-1] + p[i-1][j ][k-1] )
           + c[0][i][j][k] * p[i-1][j ][k ]
           + c[1][i][j][k] * p[i ][j-1][k ]
           + c[2][i][j][k] * p[i ][j ][k-1]
           + wrk1[i][j][k];
        ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];
        gosa = gosa + ss*ss;
        wrk2[i][j][k] = p[i][j][k] + omega * ss;
  // copy-back loop ignored for analysis
  #pragma omp parallel for private(j,k)
  for(i=1 ; i<imax-1 ; ++i)
    for(j=1 ; j<jmax-1 ; ++j)
      for(k=1 ; k<kmax-1 ; ++k)
        p[i][j][k] = wrk2[i][j][k];
} /* end n loop */
Himeno stancil

Figure 1: Structure of the 19-point stencil showing the data access pattern to the p[][][] array in the Himeno benchmark. The k index is the inner (fast) loop index here.

There is an outer iteration loop over n. The first (parallel) loop nest over i, j, and k updates the wrk2 array from the arrays a, b, c, wrk1, bnd, and p, of which only p has a stencil-like access pattern (see Fig. 1). All others are accessed in a consecutive, cacheline-friendly way. Since the coefficient arrays a, b, and c carry a fourth index in the first position, the row-major data layout of the C language leads to many concurrent data streams. We will see whether or not this impacts the performance of the code.

A second parallel loop nest copies the result in wrk2 back to the stencil array p. This second loop can be easily optimized away (how?), so we ignore it in the following; all analysis and performance numbers pertain to the first loop only.

Amount of work

There are 14 floating-point additions, 7 subtractions, and 13 multiplications in the loop body. Hence, one lattice site update (LUP) amounts to 34 flops.

Data transfers and best-case code balance

For this analysis the working set shall be larger than any cache. It is straightforward to calculate a lower limit for the data transfers if we assume perfect spatial and temporal locality for all data accesses within one update sweep: All arrays except wrk2 must be read at least once, and wrk2 must be written. This leads to (13+1) single-precision floating-point numbers being transferred between the core(s) and main memory. The best-case code balance is thus Bc = 1.65 byte/flop = 56 byte/LUP. If the architecture has a write-back cache, an additional write-allocate transfer must be accounted for if it cannot be avoided (e.g., by nontemporal stores). In this case the best-case code balance is Bc = 1.76 byte/flop = 60 byte/LUP.

Considering that even the most balanced machines available today are not able to feed such a hunger for data (e.g., the new NEC-SX Aurora TSUBASA vector engine with 0.5 byte/flop), we know that this code will be memory bound. If the memory bandwidth can be saturated, the upper performance limit is the memory bandwidth divided by the code balance.

Continue reading