If you ever wondered what this mysterious LIKWID tool suite is: Jan Eitzinger has made a posh little intro video that gives you the gist of it in minutes. View it on YouTube:
The Euro-Par conference is a well-established venue for presenting and discussing research on parallel computing in Europe. In association with the Euro-Par 2020 conference, which takes place from August 24-28 in Warsaw, Poland, the workshop “PMACS – Performance Monitoring and Analysis of Cluster Systems” is organized by Thomas Gruber from our group. The workshop offers presentations and exchange of experience about various solutions on the collection, transmission, storage, evaluation, and visualization of runtime data about the hard- and software of whole cluster systems.
Everyone’s got their pet peeves: For Poempelfox it’s Schiphol Airport, for Andreas Stiller it’s the infamous A20 gate. Compared to those glorious fails, my favorite tech blunder is a rather measly one, and it may not be relevant to many users in practice. However, it’s not so much the importance of it but the total mystery of how it came to happen. So here’s the thing.
Loads, stores, and AGUs
The Intel Sandy Bridge and Ivy Bridge architectures have six execution ports, two of which (#2 & #3) feed one LOAD pipeline each and one (#4) feeds a STORE pipe. These units are capable of transferring 16 bytes of data per cycle each. With AVX code, the core is thus able to sustain one full-width 32-byte LOAD (in two adjacent 16-byte chunks) and one half of a 32-byte STORE per cycle. But the LOAD and STORE ports are not the only thing that’s needed to execute these instructions – the core must also generate the corresponding memory addresses, which can be rather complicated. In a LOAD instruction like:
vmovupd ymm0, [rdx+rsi*8+32]
the memory address calculation involves two integer add operations and a shift. It is the task of the address generation units (AGUs) to do this. Each of ports 2 and 3 serves an AGU in addition to the LOAD unit, so the core can generate two addresses per cycle – more than enough to sustain the maximum LOAD and STORE throughput with AVX.
The peculiar configuration of LOAD and STORE units and AGUs causes some strange effects. For instance, if we execute the Schönauer vector triad: Continue reading
This course covers performance engineering approaches on the compute node level. Even application developers who are fluent in OpenMP and MPI often lack a good grasp of how much performance could at best be achieved by their code. This is because parallelism takes us only half the way to good performance. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted. This course conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. We introduce the basic architectural features and bottlenecks of modern processors and compute nodes. Pipelining, SIMD, superscalarity, caches, memory interfaces, ccNUMA, etc., are covered. A cornerstone of node-level performance analysis is the Roofline model, which is introduced in due detail and applied to various examples from computational science. We also show how simple software tools can be used to acquire knowledge about the system, run code in a reproducible way, and validate hypotheses about resource consumption. Finally, once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of code changes can often be predicted, replacing hope-for-the-best optimizations by a scientific process.
This is a two-day course with demos. It is provided free of charge to members of European research institutions and universities.
|Date:||Tuesday, Dec 3, 2019 09:00 – 17:00
Wednesday, Dec 4, 2019 09:00 – 17:00
|Location:||LRZ Building, University campus Garching, near Munich, Hörsaal H.E.009 (Lecture hall)|
|Course webpage with detailed agenda:||https://events.prace-ri.eu/event/901/|
Our paper “Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels” has just won the “Best Late-Breaking Paper Award” at the 10th Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS19), a renowned workshop co-located with the SC19 conference. The lead author, our master student Jan Laukemann, presented his work on a new version of the OSACA tool (Open-Source Architecture Code Analyzer), which now supports throughput, critical path, and loop-carried dependency analysis for assembly loop kernels on x86 and ARM architectures. It is thus a critical component for ECM and Roofline modeling and can be used as a more capable substitute for Intel’s discontinued IACA tool.
Just in time for SC19, version 5 of our popular LIKWID tool suite has been released. There are tons of new developments in there; these are the most important ones:
- Support for ARM architectures, especially for Marvell Thunder X2
- Support for IBM POWER architectures (POWER8 and POWER9)
- Support for AMD Zen2 and for data fabric counters of the AMD Zen microarchitecture
- Support for Nvidia GPU monitoring (with NvMarkerAPI)
- New clock frequency backend (with less overhead)
- Generation of benchmarks for likwid-bench on-the-fly from ptt files
- Integration of GOTCHA for hooking into client applications at runtime
- Thread-local initialization of streams for likwid-bench
- Enhanced support for SLURM with likwid-mpirun
- New MPI and Hybrid pinning features for likwid-mpirun
- JSON output filter file (use -o output.json)
- Updated quick reference sheet with all the new options
The full list is available at the github release page. And if you need something really cool to cover that empty spot on your laptop lid, we’ll have LIKWID stickers available during our SC19 tutorial “Node-Level Performance Engineering” and at the Bavarian Supercomputing booth (#2063).
This year at SC19 in Denver, CO, members of our group will be part of numerous contributions:
- Our master student Jan Laukemann will present the paper “Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels” at the PMBS 2019 workshop. It describes recent improvements to our “Open Source Architecture Code Analyzer” (OSACA), notably support for ARM architectures and critical path detection. This paper has received the Best Short Paper Award at the workshop.
- Our accepted research poster “INSPECT Intranode Stencil Performance Evaluation Collection” by Julian Hammer et al. will showcase INSPECT, our open and extensible collection of performance data and models for stencil codes.
- Our accepted research poster “LIKWID 5: Lightweight Performance Tools” by Thomas Gruber et al. will showcase the latest developments in our LIKWID performance tool suite.
- The popular full-day tutorial “Node-Level Performance Engineering” will be presented again by Gerhard Wellein and myself.
And finally, we are again part of the activities at the LRZ booth (#2063), where “Bits, Bytes, Brezn & Beer – Supercomputing in Bavaria” is on the agenda. Drop by during the opening gala to chat and get your share of Bavarian (and Franconian) ambience.
The ISC High Performance 2020 paper deadline is on October 21 November 4. This time, ISC has decided to provide free-of-charge Gold Open Access for all accepted papers. Presenters will receive a free conference day pass for the day of their talk. In addition, there are prestigious awards to win: The Hans Meuer Award and the GCS Award, each including a cash prize of 5000 €.
ISC has been improving the quality of their scientific program significantly in recent years. In 2018 and 2019, the paper acceptance rate was about 25%. The program committees comprise renowned researchers from all sectors of HPC.
Our paper “Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs” by Dominik Ernst, Georg Hager, Jonas Thies, and Gerhard Wellein just received the best workshop paper award at the PPAM 2019, the 13th International Conference on Parallel Processing and Applied Mathematics, in Bialystok, Poland. In this paper, Dominik investigated different methods to optimize an important but often neglected computational kernel: the multiplication of two extremely non-square matrices, with millions of rows but very few (tens of) columns. Vendor libraries still do not perform well in this situation, so you have to roll your own implementation, which is a daunting task because of the huge optimization parameter space, especially on GPGPUs.
The optimizations were guided by the Roofline model (hence “Performance Engineering”), which provides an upper limit for the performance of the kernel. On an Nvidia V100 GPGPU, Dominik’s solution achieves more than 90% of the maximum performance for matrices with up to 40 columns, and more than 50% at up to 64 columns. This is significantly faster than vendor libraries at the time of writing.
Intel has announced recently that their popular Architecture Code Analyzer (IACA) “has reached its End of Life” (sic!). Frankly speaking, it was never an official product anyway, but performance-aware bitfiddlers like my colleagues and me found it extremely useful. It’s strange that Intel decided to dump it right after a complete rewrite with version 3.0. Big mistake. Think “A380”.
Given a piece of object code, the latest version of IACA was able to calculate a prediction about its runtime, assuming no dependencies between instructions and full pipeline throughput. This is quite an optimistic assumption – earlier versions (here’s another useful thing they dumped) could also produce a “pessimistic” prediction based on the instruction latencies along the critical path. In reality, the actual runtime was typically in between, and an experienced performance engineer could read a lot out of the IACA output. Furthermore, the IACA predictions were one input to the ECM performance model and Kerncraft, our loop performance modeling tool.
Fortunately, alternatives exist. Besides LLVM-MCA, which may or may not be useful for some, our OSACA tool set out to become a full-fledged replacement for IACA, batteries included. Right now it can handle throughput analysis for Intel and AMD CPUs ; work on critical path analysis and support for ARM architectures is ongoing. Some undisclosed insight that was coded into IACA is unavailable to us, so predictions may differ. It’s work in progress, but you can check it out. Feedback is always welcome!
 J. Laukemann, J. Hammer, J. Hofmann, G. Hager, and G. Wellein: Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures. 2018 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Dallas, TX, USA, 2018, pp. 121-131. DOI: 10.1109/PMBS.2018.8641578. Preprint: arXiv:1809.00912