On October 14, 2021 I gave an invited online talk at Stony Brook University‘s Institute for Advanced Computational Science (IACS). I talked about white/gray-box approaches to performance modeling and how they can fail in interesting ways on highly parallel systems because of desynchronization effects. The slides and a video recording are now available:
Abstract: High-performance parallel computers are complex systems. There seems to be a general consensus among developers that the performance of application programs is to be taken for granted, and that it cannot really be understood in terms of simple rules and models. This talk is about using analytic performance models to make sense of performance numbers. By means of examples from computational science, I will motivate that it makes a lot of sense to try and set up performance models even if their accuracy is sometimes limited. In fact, it is when a model yields false predictions that we learn more about the problem because our assumptions are challenged. I will start with a general categorization of performance models and then turn to ECM and Roofline models for loop-based code on multicore CPUs. Going beyond the compute node level and adding communication models to the mix, I will show how stacking models on top of each other may not work as intended but instead open new insights and a fresh view on how massively parallel code is executed.
IMB multi-mode Ping-Pong demystified?
Everybody knows the ubiquitous PingPong benchmark, first published in the Pallas MPI benchmarks and today contained in Intels IMB suite. Using one process on each node and plotting bandwidth versus message size, you see the well-known behaviour – saturating bandwidth for large messages and latency domination for small ones (blue curve – measurements were done on our Woody cluster between nodes that are connected to a common IB switch nodeboard, i.e. with DDR speed):
If you run this code in multi-mode, i.e. two or more processes on one node sending and two or more on the other node receiving, the bandwidth pattern changes considerably (black squares). There is a significant “overshoot” for medium-sized messages and a sudden drop at about 1 MByte. Eventually the saturation bandwidth is the same as in the simple case.
It turns out that one can explain this, at least for the 2-2 case, by assuming that MPI (or the IB driver or whoever) splits messages that are larger that a certain limit (e.g., 1 MByte) into chunks and transmits/receives those chunks alternating between the two connections. Using this model and fitting the parameters we can predict the multi-mode Ping-Pong bandwidth quite well (orange curve). In the 3-3 case, however, things get more complicated (green curve) and there is an additional plateau. I’m not sure whether one can really extend the model to encompass this effect as well.
If you are interested, the gory details have been written up: mm-pingpong.pdf