MachineState is a python3 module and CLI application for documenting and comparing settings known to affect application performance: e.g., CPU/Uncore frequencies, hardware prefetchers, memory capacity, but also OS and software settings like NUMA balancing, writeback workqueues, scheduling, or the versions of common tools and libraries (e.g., compilers and MPI). All this information can be essential for reproduction of benchmark results. The MachineState tool gathers all (known) settings and presents them as a JSON document. A state file written earlier can be compared to the current machine state to uncover deviations from the original test system.
The Euro-Par conference is a well-established venue for presenting and discussing research on parallel computing in Europe. In association with the Euro-Par 2020 conference, which takes place from August 24-28 in Warsaw, Poland, the workshop “PMACS – Performance Monitoring and Analysis of Cluster Systems” is organized by Thomas Gruber from our group. The workshop offers presentations and exchange of experience about various solutions on the collection, transmission, storage, evaluation, and visualization of runtime data about the hard- and software of whole cluster systems.
Everyone’s got their pet peeves: For Poempelfox it’s Schiphol Airport, for Andreas Stiller it’s the infamous A20 gate. Compared to those glorious fails, my favorite tech blunder is a rather measly one, and it may not be relevant to many users in practice. However, it’s not so much the importance of it but the total mystery of how it came to happen. So here’s the thing.
Loads, stores, and AGUs
Sandy Bridge and Ivy Bridge LOAD and STORE units, AGUs, and their respective ports.
The Intel Sandy Bridge and Ivy Bridge architectures have six execution ports, two of which (#2 & #3) feed one LOAD pipeline each and one (#4) feeds a STORE pipe. These units are capable of transferring 16 bytes of data per cycle each. With AVX code, the core is thus able to sustain one full-width 32-byte LOAD (in two adjacent 16-byte chunks) and one half of a 32-byte STORE per cycle. But the LOAD and STORE ports are not the only thing that’s needed to execute these instructions – the core must also generate the corresponding memory addresses, which can be rather complicated. In a LOAD instruction like:
vmovupd ymm0, [rdx+rsi*8+32]
the memory address calculation involves two integer add operations and a shift. It is the task of the address generation units (AGUs) to do this. Each of ports 2 and 3 serves an AGU in addition to the LOAD unit, so the core can generate two addresses per cycle – more than enough to sustain the maximum LOAD and STORE throughput with AVX.
The peculiar configuration of LOAD and STORE units and AGUs causes some strange effects. For instance, if we execute the Schönauer vector triad: Continue reading →
We proudly present a retake of our PRACE course on “Node-Level Performance Engineering” on December 3-4, 2019 at LRZ Garching.
This course covers performance engineering approaches on the compute node level. Even application developers who are fluent in OpenMP and MPI often lack a good grasp of how much performance could at best be achieved by their code. This is because parallelism takes us only half the way to good performance. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted. This course conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. We introduce the basic architectural features and bottlenecks of modern processors and compute nodes. Pipelining, SIMD, superscalarity, caches, memory interfaces, ccNUMA, etc., are covered. A cornerstone of node-level performance analysis is the Roofline model, which is introduced in due detail and applied to various examples from computational science. We also show how simple software tools can be used to acquire knowledge about the system, run code in a reproducible way, and validate hypotheses about resource consumption. Finally, once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of code changes can often be predicted, replacing hope-for-the-best optimizations by a scientific process.
This is a two-day course with demos. It is provided free of charge to members of European research institutions and universities.
Tuesday, Dec 3, 2019 09:00 – 17:00
Wednesday, Dec 4, 2019 09:00 – 17:00
LRZ Building, University campus Garching, near Munich, Hörsaal H.E.009 (Lecture hall)
This year at SC19 in Denver, CO, members of our group will be part of numerous contributions:
Our master student Jan Laukemann will present the paper “Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels” at the PMBS 2019 workshop. It describes recent improvements to our “Open Source Architecture Code Analyzer” (OSACA), notably support for ARM architectures and critical path detection. This paper has received the Best Short Paper Award at the workshop.
The ISC High Performance 2020 paper deadline is on October 21November 4. This time, ISC has decided to provide free-of-charge Gold Open Access for all accepted papers. Presenters will receive a free conference day pass for the day of their talk. In addition, there are prestigious awards to win: The Hans Meuer Award and the GCS Award, each including a cash prize of 5000 €.
ISC has been improving the quality of their scientific program significantly in recent years. In 2018 and 2019, the paper acceptance rate was about 25%. The program committees comprise renowned researchers from all sectors of HPC.
Our paper “Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs” by Dominik Ernst, Georg Hager, Jonas Thies, and Gerhard Wellein just received the best workshop paper award at the PPAM 2019, the 13th International Conference on Parallel Processing and Applied Mathematics, in Bialystok, Poland. In this paper, Dominik investigated different methods to optimize an important but often neglected computational kernel: the multiplication of two extremely non-square matrices, with millions of rows but very few (tens of) columns. Vendor libraries still do not perform well in this situation, so you have to roll your own implementation, which is a daunting task because of the huge optimization parameter space, especially on GPGPUs.
The optimizations were guided by the Roofline model (hence “Performance Engineering”), which provides an upper limit for the performance of the kernel. On an Nvidia V100 GPGPU, Dominik’s solution achieves more than 90% of the maximum performance for matrices with up to 40 columns, and more than 50% at up to 64 columns. This is significantly faster than vendor libraries at the time of writing.
The work was funded by the project ESSEX (Equipping Sparse Solvers for Exascale) within the DFG priority programme 1648 (SPPEXA). A preprint of the paper is available at arXiv:1905.03136.