Georg Hager's Blog

Random thoughts on High Performance Computing

Content

Intel Architecture Code Analyzer (IACA) R.I.P.! All hail OSACA!

Intel has announced recently that their popular Architecture Code Analyzer (IACA) “has reached its End of Life” (sic!). Frankly speaking, it was never an official product anyway, but performance-aware bitfiddlers like my colleagues and me found it extremely useful. It’s strange that Intel decided to dump it right after a complete rewrite with version 3.0. Big mistake. Think “A380”.

Given a piece of object code, the latest version of IACA was able to calculate a prediction about its runtime, assuming no dependencies between instructions and full pipeline throughput. This is quite an optimistic assumption –  earlier versions (here’s another useful thing they dumped)  could also produce a “pessimistic” prediction based on the instruction latencies along the critical path. In reality, the actual runtime was typically in between, and an experienced performance engineer could read a lot out of the IACA output. Furthermore, the IACA predictions were one input to the ECM performance model and Kerncraft, our loop performance modeling tool.

Fortunately, alternatives exist. Besides LLVM-MCA, which may or may not be useful for some, our OSACA tool set out to become a full-fledged replacement for IACA, batteries included. Right now it can handle throughput analysis for Intel and AMD CPUs [1]; work on critical path analysis and support for ARM architectures is ongoing. Some undisclosed insight that was coded into IACA is unavailable to us, so predictions may differ. It’s work in progress, but you can check it out. Feedback is always welcome!

[1]  J. Laukemann, J. Hammer, J. Hofmann, G. Hager, and G. Wellein: Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures. 2018 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Dallas, TX, USA, 2018, pp. 121-131. DOI: 10.1109/PMBS.2018.8641578. Preprint: arXiv:1809.00912

 

Christie Alappat comes second place in ACM Student Research Competition Grand Finals

Our PhD student Christie Louis Alappat, by winning the ACM Student Research Competition (SRC) at SC18 at the graduate level last year, has advanced to the ACM SRC Grand Finals, where the winners from 26 ACM conferences contend for the Grand Prize. For this last round he had to prepare a five-page paper about his research. This paper, and the whole body of his work, was evaluated again by a panel of judges.

We are now happy to announce that Christie has come second place in the Grand SRC Finale. Together with his advisor, Prof. Gerhard Wellein, he is invited to the awards ceremony which will take place in San Francisco on June 15. This is the very same ceremony at which Yoshua Bengio, Geoffrey Hinton, and Yann LeCun will receive the prestigious ACM Turing Award 2018 for their seminal work on deep learning algorithms. Talk about good company!

Christie’s research revolves around a long-standing problem in computer science: How must a graph be colored to enable parallel processing in the presence of dependencies? His solution, the “Recursive Algebraic Coloring Engine,” can be used to parallelize many sparse algorithms in a hardware-efficient way, taking the specific properties of modern multicore chips into account. It outperforms existing approaches and libraries by a significant margin at such an important operation as symmetric sparse matrix-vector multiplication (SymmSpMV), but its range of applicability is much broader. Christie has prepared a walk-through of his SC18 poster to explain the details:

 

 

LIKWID 4.3.4 released

LIKWID 4.3.4 is a bugfix release.  These are the relevant changes:

  • For systems using Intel Cluster-on-Die (CoD) or Sub-NUMA Clustering (SNC):
    • Fix for detecting PCI devices
    • Workaround for topology detection. The Linux kernel does not detect it properly sometimes.
  • Don’t pin accessDaemon to SMT threads to avoid long access latencies due to busy hardware thread
  • Fix for calculations in likwid-bench if streams are used for input and output
  • Fix for LIKWID_MARKER_REGISTER with perf_event backend
  • Support for Intel Atom (Tremont) (nothing new, same as Intel Atom Goldmont Plus)
  • Minor updates for build system
  • Minor updates for documentation

Download the new version from our FTP server or directly from github:

ftp://ftp.rrze.uni-erlangen.de/mirrors/likwid/likwid-4.3.4.tar.gz
https://github.com/RRZE-HPC/likwid/releases/tag/4.3.4

The McCalpin STREAM benchmark: How to do it right and interpret the results

The STREAM benchmark [1] is more than 20 years old, and it has become the de facto standard for measuring the maximum achievable memory bandwidth of processors. However, some of the published numbers are misinterpreted. This is partly because STREAM is, contrary to expectations, not a “black box” benchmark that you can trust to yield the right answer. In order to understand the STREAM results, it is necessary to grasp the following concepts:

  1. Machine topology and thread affinity
  2. Write-allocate transfers

Most of the mistakes people make when taking STREAM measurements are based on a mis- or non-understanding of these two concepts. Continue reading

Christie Louis Alappat wins the SC18 ACM Student Research Competition Award

Picture (c) SC Conference. Used with permission.

Our PhD student Christie Louis Alappat just took first place in the ACM Student Research Competition at SC18. His work revolved around the “Recursive Algebraic Coloring Engine,” which is a new method and library for hardware-efficient graph coloring. This means that he will advance to the grand finale next year. Congratulations!

Christie’s project is part of the activities in ESSEX-II, a project funded by the German Science Foundation (DFG) within the priority programme SPPEXA.

As part of the competition, Christie has prepared a video with details about his work:

 

SC18 incoming!

This year at SC18 in Dallas, TX, members of our group will be part of numerous contributions:

And finally, we are part of the activities at the LRZ booth (#1841), where “Bits, Bytes, Brezel & Bier – Supercomputing in Bavaria” is on the agenda. Drop by during the opening gala to chat and get your share of Bavarian (and Franconian) ambience.

A LIKWID bouquet

The note says: “For the name ‘LIKWID’ – because it keeps conjuring a smile on my lips.”

Just got this from an anonymous fan. The note says: “For the name ‘LIKWID’ – because it keeps conjuring a smile on my lips.” You’re most welcome, whoever you are.

For those who don’t know, LIKWID is our multicore performance tool suite. I’m not a developer (Thomas Gruber [né Röhl] does the hard work there), but I happen to be the one who came up with the acronym: “Like I Knew What I’m Doing.”

And the 2018 Gauss Award goes to: Erlangen!

The Gauss Award is sponsored by the Gauss Centre for Supercomputing (GCS), which is a collaboration of the German national supercomputing centers at Garching, Jülich and Stuttgart. The winner receives a cash prize of 3,000 €, courtesy of the Gauss Center, which is traditionally presented during the ISC Conference Opening Session. This year, the Gauss Award committee has selected a paper which reports on the outcome of a collaboration between the Chair for Computer Architecture and the HPC group at the computing center (RRZE) of FAU:

Johannes Hofmann, Georg Hager, and Dietmar Fey:

On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors

In this paper we have expanded the execution-cache-memory (ECM) performance model developed at RRZE to describe more accurately the saturation behavior of memory-bound code when the number of cores in increased. Together with an improved power consumption model, which takes into account frequency- (and thus voltage-) dependent static power dissipation and the presence of a separate Uncore clock domain in recent Intel CPUs, we can now very accurately describe the performance and the energy consumption of steady-state loops over a wide range of clock frequency settings and core numbers. Although the paper mostly deals with Xeon Intel Sandy Bridge and Broadwell CPUs and “simple” kernels such as STREAM and DGEMM, our models work very well for other architectures (like, e.g., the AMD Epyc) and codes (such as stencil algorithms), too.

Johannes Hofmann will present our work during the ISC award session on June 25, 2018: https://2018.isc-program.com/?page_id=10&id=pap122&sess=sess201

The paper is available at DOI: 10.1007/978-3-319-92040-5_2. You can download a (pre-review) preprint version at arXiv:1803.01618.

LIKWID marker overhead and “Meltdown” patches

The Marker API of likwid-perfctr lets you count hardware events on your CPU core(s) separately for different execution regions. E.g., in order to count events for a loop, you would use it like this:

#include <likwid.h>

int main(...) {
  // always required once
  LIKWID_MARKER_INIT;
  // ...
  LIKWID_MARKER_START("loop");
  for(int i=0; i<n; ++i) {
    do_some_work();
  }
  LIKWID_MARKER_STOP("loop");
  // ...
  LIKWID_MARKER_CLOSE;
  return 0;
}

An arbitrary number of regions is allowed, and you can use the LIKWID_MARKER_START and LIKWID_MARKER_STOP macros in parallel regions to get per-core readings. The events to be counted are configured on the likwid-perfctr command line. As with anything that is not part of the actual work in a code, one may ask about the cost of the marker API calls. Do they impact the runtime of the code? Does the number of cores play a role? Continue reading

Himeno stencil benchmark: ECM model, SIMD, data layout

In a previous post I have shown how to construct and validate a Roofline performance model for the Himeno benchmark. The relevant findings were:

  • The Himeno benchmark is a rather standard stencil code that is amenable to the well-known layer condition analysis. For in-memory data sets it achieves a performance that is well described by the Roofline model.
  • The performance potential of spatial blocking is limited to about 10% in the saturated case (on a Haswell-EP socket), because the data transfers are dominated by coefficient arrays with no temporal reuse.
  • The large number of concurrent data streams through the cache hierarchy and into memory does not hurt the performance, at least not too much. We had chosen a version of the code which was easy to vectorize but had a lot of parallel data streams (at least 15, probably more if layer conditions are broken).

Some further questions pop up if you want more insight: Is SIMD vectorization relevant at all? Does the data layout matter? What is the single-core performance in relation to the saturated performance, and why? All these questions can be answered by a detailed ECM model, and this is what we are going to do here. This is a long post, so I provide some links to the sections below:

Continue reading