Georg Hager's Blog

Random thoughts on High Performance Computing

Content

SC19 incoming!

This year at SC19 in Denver, CO, members of our group will be part of numerous contributions:

  • Our master student Jan Laukemann will present the paper “Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels” at the PMBS 2019 workshop. It describes recent improvements to our “Open Source Architecture Code Analyzer” (OSACA), notably support for ARM architectures and critical path detection. This paper has received the Best Short Paper Award at the workshop.
  • Our accepted research poster “INSPECT Intranode Stencil Performance Evaluation Collection” by Julian Hammer et al. will showcase INSPECT, our open and extensible collection of performance data and models for stencil codes.
  • Our accepted research poster “LIKWID 5: Lightweight Performance Tools” by Thomas Gruber et al. will showcase the latest developments in our LIKWID performance tool suite.
  • The popular full-day tutorial “Node-Level Performance Engineering” will be presented again by Gerhard Wellein and myself.

And finally, we are again part of the activities at the LRZ booth (#2063), where “Bits, Bytes, Brezn & Beer – Supercomputing in Bavaria” is on the agenda. Drop by during the opening gala to chat and get your share of Bavarian (and Franconian) ambience.

ISC 2020 call for research papers: publish with Gold Open Access – deadline extended

The ISC High Performance 2020 paper deadline is on October 21 November 4. This time, ISC has decided to provide free-of-charge Gold Open Access for all accepted papers. Presenters will receive a free conference day pass for the day of their talk. In addition, there are prestigious awards to win: The Hans Meuer Award and the GCS Award, each including a cash prize of 5000 €.

ISC has been improving the quality of their scientific program significantly in recent years. In 2018 and 2019, the paper acceptance rate was about 25%. The program committees comprise renowned researchers from all sectors of HPC.

PPAM 2019 Workshops Best Paper Award for Dominik Ernst

Best Paper Award PPAM 2019

Dominik and his PPAM Best Poster Award

Our paper “Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs” by Dominik Ernst, Georg Hager, Jonas Thies, and Gerhard Wellein just received the best workshop paper award at the PPAM 2019, the 13th International Conference on Parallel Processing and Applied Mathematics, in Bialystok, Poland. In this paper, Dominik investigated different methods to optimize an important but often neglected computational kernel: the multiplication of two extremely non-square matrices, with millions of rows but very few (tens of) columns. Vendor libraries still do not perform well in this situation, so you have to roll your own implementation, which is a daunting task because of the huge optimization parameter space, especially on GPGPUs.

The optimizations were guided by the Roofline model (hence “Performance Engineering”), which provides an upper limit for the performance of the kernel. On an Nvidia V100 GPGPU, Dominik’s solution achieves more than 90% of the maximum performance for matrices with up to 40 columns, and more than 50% at up to 64 columns. This is significantly faster than vendor libraries at the time of writing.

The work was funded by the project ESSEX (Equipping Sparse Solvers for Exascale) within the DFG priority programme 1648 (SPPEXA). A preprint of the paper is available at arXiv:1905.03136.

Intel Architecture Code Analyzer (IACA) R.I.P.! All hail OSACA!

Intel has announced recently that their popular Architecture Code Analyzer (IACA) “has reached its End of Life” (sic!). Frankly speaking, it was never an official product anyway, but performance-aware bitfiddlers like my colleagues and me found it extremely useful. It’s strange that Intel decided to dump it right after a complete rewrite with version 3.0. Big mistake. Think “A380”.

Given a piece of object code, the latest version of IACA was able to calculate a prediction about its runtime, assuming no dependencies between instructions and full pipeline throughput. This is quite an optimistic assumption –  earlier versions (here’s another useful thing they dumped)  could also produce a “pessimistic” prediction based on the instruction latencies along the critical path. In reality, the actual runtime was typically in between, and an experienced performance engineer could read a lot out of the IACA output. Furthermore, the IACA predictions were one input to the ECM performance model and Kerncraft, our loop performance modeling tool.

Fortunately, alternatives exist. Besides LLVM-MCA, which may or may not be useful for some, our OSACA tool set out to become a full-fledged replacement for IACA, batteries included. Right now it can handle throughput analysis for Intel and AMD CPUs [1]; work on critical path analysis and support for ARM architectures is ongoing. Some undisclosed insight that was coded into IACA is unavailable to us, so predictions may differ. It’s work in progress, but you can check it out. Feedback is always welcome!

[1]  J. Laukemann, J. Hammer, J. Hofmann, G. Hager, and G. Wellein: Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures. 2018 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Dallas, TX, USA, 2018, pp. 121-131. DOI: 10.1109/PMBS.2018.8641578. Preprint: arXiv:1809.00912

 

Christie Alappat comes second place in ACM Student Research Competition Grand Finals

Our PhD student Christie Louis Alappat, by winning the ACM Student Research Competition (SRC) at SC18 at the graduate level last year, has advanced to the ACM SRC Grand Finals, where the winners from 26 ACM conferences contend for the Grand Prize. For this last round he had to prepare a five-page paper about his research. This paper, and the whole body of his work, was evaluated again by a panel of judges.

We are now happy to announce that Christie has come second place in the Grand SRC Finale. Together with his advisor, Prof. Gerhard Wellein, he is invited to the awards ceremony which will take place in San Francisco on June 15. This is the very same ceremony at which Yoshua Bengio, Geoffrey Hinton, and Yann LeCun will receive the prestigious ACM Turing Award 2018 for their seminal work on deep learning algorithms. Talk about good company!

Christie’s research revolves around a long-standing problem in computer science: How must a graph be colored to enable parallel processing in the presence of dependencies? His solution, the “Recursive Algebraic Coloring Engine,” can be used to parallelize many sparse algorithms in a hardware-efficient way, taking the specific properties of modern multicore chips into account. It outperforms existing approaches and libraries by a significant margin at such an important operation as symmetric sparse matrix-vector multiplication (SymmSpMV), but its range of applicability is much broader. Christie has prepared a walk-through of his SC18 poster to explain the details:

 

 

LIKWID 4.3.4 released

LIKWID 4.3.4 is a bugfix release.  These are the relevant changes:

  • For systems using Intel Cluster-on-Die (CoD) or Sub-NUMA Clustering (SNC):
    • Fix for detecting PCI devices
    • Workaround for topology detection. The Linux kernel does not detect it properly sometimes.
  • Don’t pin accessDaemon to SMT threads to avoid long access latencies due to busy hardware thread
  • Fix for calculations in likwid-bench if streams are used for input and output
  • Fix for LIKWID_MARKER_REGISTER with perf_event backend
  • Support for Intel Atom (Tremont) (nothing new, same as Intel Atom Goldmont Plus)
  • Minor updates for build system
  • Minor updates for documentation

Download the new version from our FTP server or directly from github:

ftp://ftp.rrze.uni-erlangen.de/mirrors/likwid/likwid-4.3.4.tar.gz
https://github.com/RRZE-HPC/likwid/releases/tag/4.3.4

The McCalpin STREAM benchmark: How to do it right and interpret the results

The STREAM benchmark [1] is more than 20 years old, and it has become the de facto standard for measuring the maximum achievable memory bandwidth of processors. However, some of the published numbers are misinterpreted. This is partly because STREAM is, contrary to expectations, not a “black box” benchmark that you can trust to yield the right answer. In order to understand the STREAM results, it is necessary to grasp the following concepts:

  1. Machine topology and thread affinity
  2. Write-allocate transfers

Most of the mistakes people make when taking STREAM measurements are based on a mis- or non-understanding of these two concepts. Continue reading

Christie Louis Alappat wins the SC18 ACM Student Research Competition Award

Picture (c) SC Conference. Used with permission.

Our PhD student Christie Louis Alappat just took first place in the ACM Student Research Competition at SC18. His work revolved around the “Recursive Algebraic Coloring Engine,” which is a new method and library for hardware-efficient graph coloring. This means that he will advance to the grand finale next year. Congratulations!

Christie’s project is part of the activities in ESSEX-II, a project funded by the German Science Foundation (DFG) within the priority programme SPPEXA.

As part of the competition, Christie has prepared a video with details about his work:

 

SC18 incoming!

This year at SC18 in Dallas, TX, members of our group will be part of numerous contributions:

And finally, we are part of the activities at the LRZ booth (#1841), where “Bits, Bytes, Brezel & Bier – Supercomputing in Bavaria” is on the agenda. Drop by during the opening gala to chat and get your share of Bavarian (and Franconian) ambience.

A LIKWID bouquet

The note says: “For the name ‘LIKWID’ – because it keeps conjuring a smile on my lips.”

Just got this from an anonymous fan. The note says: “For the name ‘LIKWID’ – because it keeps conjuring a smile on my lips.” You’re most welcome, whoever you are.

For those who don’t know, LIKWID is our multicore performance tool suite. I’m not a developer (Thomas Gruber [né Röhl] does the hard work there), but I happen to be the one who came up with the acronym: “Like I Knew What I’m Doing.”