Thomas Gruber (a.k.a. TomTheBear), the main developer of the LIKWID tool suite, has published a short tutorial about constructing empirical Roofline models with likwid-perfctr. An empirical Roofline model uses measurements of computational intensity and performance to compare the resource utilization of running code with the limits set by the hardware.
This is something that often comes up as a question in our node-level or tools courses. Keep in mind that the computational intensity can also be predicted analytically if you know enough about the loop(s) in your application and the properties of the hardware. Comparing the analytical prediction with the measurement and the machine limits is a powerful way to analyze the performance of code. You can learn more about this, and more, in one of our Node-Level Performance Engineering tutorials.
The Euro-Par conference is a well-established venue for presenting and discussing research on parallel computing in Europe. In association with the Euro-Par 2020 conference, which takes place from August 24-28 in Warsaw, Poland, the workshop “PMACS – Performance Monitoring and Analysis of Cluster Systems” is organized by Thomas Gruber from our group. The workshop offers presentations and exchange of experience about various solutions on the collection, transmission, storage, evaluation, and visualization of runtime data about the hard- and software of whole cluster systems.
We proudly present a retake of our PRACE course on “Node-Level Performance Engineering” on December 3-4, 2019 at LRZ Garching.
This course covers performance engineering approaches on the compute node level. Even application developers who are fluent in OpenMP and MPI often lack a good grasp of how much performance could at best be achieved by their code. This is because parallelism takes us only half the way to good performance. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted. This course conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. We introduce the basic architectural features and bottlenecks of modern processors and compute nodes. Pipelining, SIMD, superscalarity, caches, memory interfaces, ccNUMA, etc., are covered. A cornerstone of node-level performance analysis is the Roofline model, which is introduced in due detail and applied to various examples from computational science. We also show how simple software tools can be used to acquire knowledge about the system, run code in a reproducible way, and validate hypotheses about resource consumption. Finally, once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of code changes can often be predicted, replacing hope-for-the-best optimizations by a scientific process.
This is a two-day course with demos. It is provided free of charge to members of European research institutions and universities.
Tuesday, Dec 3, 2019 09:00 – 17:00
Wednesday, Dec 4, 2019 09:00 – 17:00
LRZ Building, University campus Garching, near Munich, Hörsaal H.E.009 (Lecture hall)
This year at SC19 in Denver, CO, members of our group will be part of numerous contributions:
Our master student Jan Laukemann will present the paper “Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels” at the PMBS 2019 workshop. It describes recent improvements to our “Open Source Architecture Code Analyzer” (OSACA), notably support for ARM architectures and critical path detection. This paper has received the Best Short Paper Award at the workshop.
The note says: “For the name ‘LIKWID’ – because it keeps conjuring a smile on my lips.”
Just got this from an anonymous fan. The note says: “For the name ‘LIKWID’ – because it keeps conjuring a smile on my lips.” You’re most welcome, whoever you are.
For those who don’t know, LIKWID is our multicore performance tool suite. I’m not a developer (Thomas Gruber [né Röhl] does the hard work there), but I happen to be the one who came up with the acronym: “Like I Knew What I’m Doing.”
An arbitrary number of regions is allowed, and you can use the LIKWID_MARKER_START and LIKWID_MARKER_STOP macros in parallel regions to get per-core readings. The events to be counted are configured on the likwid-perfctr command line. As with anything that is not part of the actual work in a code, one may ask about the cost of the marker API calls. Do they impact the runtime of the code? Does the number of cores play a role? Continue reading →