Georg Hager's Blog

Random thoughts on High Performance Computing


LIKWID marker overhead and “Meltdown” patches

The Marker API of likwid-perfctr lets you count hardware events on your CPU core(s) separately for different execution regions. E.g., in order to count events for a loop, you would use it like this:

#include <likwid.h>

int main(...) {
  // always required once
  // ...
  for(int i=0; i<n; ++i) {
  // ...
  return 0;

An arbitrary number of regions is allowed, and you can use the LIKWID_MARKER_START and LIKWID_MARKER_STOP macros in parallel regions to get per-core readings. The events to be counted are configured on the likwid-perfctr command line. As with anything that is not part of the actual work in a code, one may ask about the cost of the marker API calls. Do they impact the runtime of the code? Does the number of cores play a role?

I ran benchmark tests on a dual-Xeon “Broadwell” E5-2697 v4 (2.3 GHz) node with 18 cores per socket in non-CoD mode. The OS was Ubuntu 16.04.4 LTS with kernel 4.4.0-119-generic and the latest patches. The clock speed (core and Uncore) was set to 2.3 GHz.

This is the benchmark code (slightly simplified – see below for a download link):

#pragma omp parallel

do {
  // time measurement

  #pragma omp parallel private(k)
    for(k=0; k<NITER; ++k) {
      if(divsleep(200000) < 0.0)

} while (wct_end-wct_start < 0.2);


Figure 1: LIKWID marker overhead vs. number of cores (compact pinning, physical cores only) for four different metric groups on the 36-core Broadwell system (LIKWID 4.3.2).

The divsleep() function performs the indicated number of floating-point divides and is a reliable busy-waiting in-core routine that always takes the same amount of time (about 4 cycles per scalar divide on a Broadwell CPU, 7 on an Ivy Bridge). Running the benchmark without markers activated (i.e., without the -m option to likwid-perfctr) , it reports exactly the expected number of cycles (800k on BDW, 1.4 million on IVY). The marker calls return immediately in this case, and of course the runtime does not depend on the number of OpenMP threads. The macro LIKWID_MARKER_REGISTER avoids excessive overhead when LIKWID_MARKER_START is called for the first time. In our case it’s not strictly necessary since the actual measurement is taken multiple times with increasing NITER anyway, but in general it’s a good idea to call LIKWID_MARKER_REGISTER for every marker region that is used later in the code.

With markers activated, the overhead becomes visible. I subtracted the constant runtime of the delay routine from the measured runtime. The result has a slight statistical variation, but the chosen runtime of at least 0.2 seconds makes the measurements quite reproducible.

Figure 1 shows the resulting overhead cycles on the Broadwell system versus the number of threads (compact pinning, physical cores only) with the current version of LIKWID (4.3.2). Out of the considerable number of available metric groups in LIKWID, I chose four: DATA (core-only, loads/stores), CACHES (full account of cacheline traffic through the memory hierarchy), ENERGY (RAPL measurements of energy consumption), and MEM_DP (combination of flops, memory traffic, and energy). These are the relevant observations:

Figure 2: Overhead for the LIKWID marker API calls on a patched Ivy Bridge node (LIKWID 4.3.1).

  1. The overhead depends weakly on the number of cores. There is a noticeable jump when the second socket is involved, but the overhead cycles always stay in the same ballpark.
  2. The overhead depends strongly on the events that are counted. Roughly speaking, “the more events, the more expensive.” CACHES is most toxic in this respect, with over 5 million cycles on the whole node.
  3. The marker overhead is much larger than typical OpenMP overheads (which are around a couple of 1000s cycles for barriers, for instance, when using the Intel OpenMP runtime).

Since starting and stopping counters requires access to the MSR registers and/or PCI devices (for Uncore events), do the recent microcode and kernel patches to mitigate the “Meltdown” hardware vulnerability scenario worsen the overhead? All our test cluster machines have the latest patches installed, but we have kept one node of our Ivy-Bridge-based Emmy cluster in a non-patched state.  Emmy has Intel Xeon “Ivy Bridge” E5-2660 v2 CPUs at 2.2 GHz and runs CentOS with kernel 3.10.0-693.21.1.el7.x86_64; on the non-patched node we run 3.10.0-693.11.6.el7.x86_64. There are also no microcode patches on the latter, so it is in the state it was in before Meltdown was discovered. We don’t have the latest LIKWID version on Emmy yet, but 4.3.1 is available, and there haven’t been any changes recently that would influence the marker overhead.

Figure 3: LIKWID marker overhead on a non-patched Ivy Bridge node (LIKWID 4.3.1).

Figure 2 shows the results for the same experiment as above on a patched node. The general observations are the same, even the relative overhead between different metric groups is comparable, and we are also in the range between 1 and 5 million cycles per start/stop pair.

Figure 3 shows the results on the non-patched node. The overhead is roughly a factor of two smaller. So, while the Meltdown patches haven’t let the marker overhead grow without bounds, they still caused a significant increase.

A couple of million cycles may be insignificant for many applications, but this is in the millisecond range – knowing the cost enables us to calculate how fine-grained the marker regions may become before impacting the code performance.

Benchmark code for download: The makefile assumes that the variables LIKWID_LIB and LIKWID_INC are set appropriately, which is done automatically by our modules system. Adapt as needed.


LIKWID 4.3.2 released

LIKWID ToolsLIKWID 4.3.2 has just been released. This is an important maintenance release if you need support for the latest architectures:

  • Support for Intel Knights Mill (core, RAPL, Uncore)
  • AMD Zen: Use RETIRED_INSTRUCTIONS instead of fixed-purpose counter for metric calculation
  • Intel Skylake X: Some fixes for events and performance groups

Bug fixes and minor enhancements:

  • Set KMP_INIT_AT_FORK to bypass bug in Intel OpenMP memory allocator
  • All FLOPS_* groups now have vectorization ratio
  • Fix for Marker API with perf_event backend
  • Fix for maximal/minimal Uncore frequency
  • Skip counters that are already in use, don’t exit
  • likwid-mpirun: minor fix when overloading a host
  • Improved detection of PCI devices
  • Fix in internal metric calculator

You can either clone the git repository or download the package from our FTP server.


Fun with likwid-pin and shepherd threads

Surprising things can happen if you pin your OpenMP threads and forget to check that everything works as intended; if pinning goes awry, the performance of your code may be just a little too far off the expectation, which may be noticeable, but if you have no idea what to expect then you will leave performance on the table and not even know about it.

The case

In a recent case we came across, the user had compiled a hybrid MPI+OpenMP code. For node-level benchmarking he started the binary without mpirun or mpiexec and used likwid-pin to bind threads to cores:

$ likwid-pin -C N:0-27 ./a.out

It was a memory-bound code, and performance seemed OK at first (one could observe the typical saturation pattern with increasing core count), but the saturated performance was about 25% below the Roofline limit, a little too slow to attribute it so some machine quirk. Of course we made sure that the Roofline model used the correct computational intensity, and that the memory bandwidth was derived from a reasonable STREAM measurement. 25% may not seem much, but in such a situation (and on a well-known architecture like the Intel Broadwell EP) it is often worthwhile to try and find out what’s going on – probably we can learn something new along the way.

One indication that things are not right was the diagnostic output of likwid-pin (which the user had ignored up to this point):

[... SNIP ...]
     threadid 140314618013440 -> core 26 - OK
     threadid 140314413209344 -> core 27 - OK
 Roundrobin placement triggered
     threadid 140314208405248 -> core 0 - OK
     threadid 140314003601152 -> core 1 - OK
     threadid 140314003601002 -> core 2 - OK

The “Roundrobin placement triggered” message should never show up. It means that more threads were spawned than the pin mask could accommodate.  If you want to conduct a very special experiment, that may be what you want, but in general it isn’t. likwid-pin has those nice diagnostic messages, so it’s actually quite easy to see, but if you use some other affinity mechanism (or the -q switch with likwid-pin) then you must use some other means of checking. The “top” tool comes to mind: Many users don’t know that it can be configured to (i) show individual threads of a running binary (by hitting “H”), (ii) display the core each thread or process is running on (by hitting “f” and selecting “Last CPU used” as a display column), and (iii) to display the utilization of individual cores (by hitting “t” repeatedly, cycling through several display options). This way one could have noticed that the code above always left core 3 idle, although the pin mask definitely included it, and that core 0 was running two application threads in time-sharing mode. Note also that if we had used OMP_NUM_THREADS to set a smaller thread count (e.g., 14) but left the pin mask as it is, the “Roundrobin” message would not have shown up since the pin mask would have had ample space for the extra threads. This is a common scenario when doing intra-node scaling tests.

Shepherd Threads

So what was going on? To understand this we have to learn about shepherd threads. These are threads that are generated by your program, or rather the runtime underneath your chosen programming model, to work some under-the-hood magic. For instance, the Intel compilers up to version 17 Update 0 used a single shepherd for OpenMP. When your code hit the first parallel OpenMP region (this is where usually the application threads are brought to life), the runtime generated an extra thread first (i.e., as the first newly spawned thread after the master). There is no documentation about what this thread is for, but we have indications that it is at least used for waking up the team of threads after they went to sleep in a barrier. The important thing is, however, that the shepherd does not execute any user code, nor does it use any significant CPU time.

This is why likwid-pin sometimes displays a “SKIP SHEPHERD” message:

[... SNIP ...]
[pthread wrapper] 
[pthread wrapper] MAIN -> 0
[pthread wrapper] PIN_MASK: 0->1 1->2 2->3 3->4 4->5 5->6 6->7 7->8 8->9 
[pthread wrapper] SKIP MASK: 0x0
   threadid 139941087876864 -> SKIP SHEPHERD
   threadid 139941064513408 -> core 1 - OK
[... SNIP ...]

likwid-pin tries to figure out automatically if a newly generated thread is a shepherd. If it is, no pinning takes place, and it is left to roam around freely in the machine. When Intel dumped the shepherd thread in their 17.0 Update 1 compiler, this gave the developers some headache, and the code for shepherd detection had to be adapted. As of LIKWID 4.3, likwid-pin (and, of course, likwid-mpirun and likwid-perfctr) can reliably detect shepherds with all Intel compilers. The GCC compilers do not use shepherds at all (as of today), and LIKWID handles that, too.

What’s all the fuss about then? Well, shepherds are still something to be reckoned with, and they are typically not well documented. In our introductory example, the user had used g++ with OpenMPI and asynchronous progress enabled. It turned out that, although g++ itself did not spawn a shepherd, OpenMPI did: It spawned three, to be precise. In the hybrid MPI+OpenMP program, these three extra threads were generated after the main thread. This is why likwid-pin complained about “Roundrobin” placement, and this is also why core 3 was idle and core 0 was overloaded. Core 0 was running the OpenMP master, cores 0-2 were running the last three user threads, cores 1 and 2 were additionally running two shepherds (with no adverse effects), while core 3 had only the third shepherd to tend to. OpenMPI is not the only MPI implementation to use shepherds. Intel MPI has them, too, and what’s worse, their number depends on whether you use intra-node communication only or not. LIKWID does its best to detect the shepherds, but ultimately the only way to be sure that everything is OK is to check it using, e.g., “top.”

The LIKWID skip mask

If likwid-pin cannot figure out the shepherds correctly, you can still do it on your own and instruct the tool to ignore specific threads for pinning. This is what the skip mask is for. It is a hex code in which each bit represents a thread (excluding the master). For example, if you know that for some reason you have three shepherds, all generated right after the master thread, you would call likwid-pin (and all other LIKWID tools that do affinity) with the -s option and a hex argument:

$ likwid-pin -C N:0-27 -s 0x7 ./a.out

This will lead to three “SKIP SHEPHERD” messages after the master thread is pinned, and subsequent threads will be pinned according to the given mask. In the case described above, this option fixed the problem, eliminated the “Roundrobin” warning, and led to an outright 30% increase in performance because core 0 now had the same workload as everyone else.

Note that the shepherd thing can go either way performance-wise. Imagine you have a large skip mask covering all cores in a ccNUMA system, the shepherds are not detected correctly, and you use OMP_NUM_THREADS to run a team that spans a single ccNUMA domain only – or so you thought. Instead, the shepherd(s) are pinned to cores on the first domain, and the last couple of threads go to the second domain. Voilà: more bandwidth for everyone, and thus more performance than what Roofline on one domain would allow.

The gist of it is, if you use some affinity mechanism, check that it works as intended in your environment. If you change the compiler or the MPI implementation, check again. Note also that correct pinning can be a challenge for hybrid MPI+OpenMP programs. This is why we have likwid-mpirun. And finally, it goes without saying that a performance model really helps with figuring out such issues. As an added bonus, it gives you good karma.

Open Source Architecture Code Analyzer (OSACA) has been released

The Open Source Architecture Code Analyzer (OSACA) is a tool that can analyze assembly or machine code and produce a best-case (throughput-limited) runtime prediction assuming that the data is in the L1 cache. Such a tool is sorely needed for processor architectures other than Intel’s. Intel provides the Intel Architecture Code Analyzer (IACA) for free, but it is not open source and its future development path is uclear.

Why such a tool? Analytic performance models, such as the ECM model, depend on an accurate assessment of in-core execution performance. You can either do that manually by code (source or assembly) inspection, or you can use a tool that knows the instruction set and the limitations of a particular microarchitecture. The data flow analysis must be done by someone else – again, it’s either your brain or, e.g., our Kerncraft tool.

Jan Laukemann, our bachelor’s candidate, has taken on the tremendous task of developing the initial version of an IACA replacement, and OSACA is the result. It is downloadable from github: We haven’t even bothered to give it a version number yet, so this is definitely in alpha stage, but it is usable and can do some things that IACA can’t, such as, e.g., analyze non-compiled assembly code or extend its own database with new instructions.

Feedback is encouraged and welcome.