Georg Hager's Blog

#include <likwid.h> int main(...) { // always required once LIKWID_MARKER_INIT; // ... LIKWID_MARKER_START("loop"); for(int i=0; i<n; ++i) { do_some_work(); } LIKWID_MARKER_STOP("loop"); // ... LIKWID_MARKER_CLOSE; return 0; }

In a previous post I have shown how to construct and validate a Roofline performance model for the Himeno benchmark. The relevant findings were:

The Himeno benchmark is a rather standard stencil code that is amenable to the well-known layer condition analysis. For in-memory data sets it achieves a performance that is well described by the Roofline model.
The performance potential of spatial blocking is limited to about 10% in the saturated case (on a Haswell-EP socket), because the data transfers are dominated by coefficient arrays with no temporal reuse.
The large number of concurrent data streams through the cache hierarchy and into memory does not hurt the performance, at least not too much. We had chosen a version of the code which was easy to vectorize but had a lot of parallel data streams (at least 15, probably more if layer conditions are broken).

Some further questions pop up if you want more insight: Is SIMD vectorization relevant at all? Does the data layout matter? What is the single-core performance in relation to the saturated performance, and why? All these questions can be answered by a detailed ECM model, and this is what we are going to do here. This is a long post, so I provide some links to the sections below:

Hardware and code
In-core model
Layer conditions and data transfers
ECM model
Performance check
Saturation behavior
Miscellaneous

Continue reading →

Random thoughts on High Performance Computing

Content

LIKWID marker overhead and “Meltdown” patches

Himeno stencil benchmark: ECM model, SIMD, data layout