# Himeno stencil benchmark: Roofline performance modeling and validation

[Update 17/11/29: Pointed out that the C version was modified from the original code – thanks Julian]

The Himeno benchmark [1] is a very popular code in the performance analysis and optimization community. Countless papers have been written that use it for performance assessment, prediction, optimization, comparisons, etc. Surprisingly, there is hardly a solid analysis of its data transfer properties. It’s a stencil code after all, and most of those can be easily analyzed.

#### The code

The OpenMP-parallel C version looks as shown below. I have made a slight change to the original code: The order of indices on the arrays a, b, and c hinders efficient SIMD vectorization, so I moved the short index to the first position. This also makes it equivalent to the Fortran version.

// all data structures hold single-precision values
for(n=0;n<nn;++n){
gosa = 0.0;
#pragma omp parallel for reduction(+:gosa) private(s0,ss,j,k)
for(i=1 ; i<imax-1 ; ++i)
for(j=1 ; j<jmax-1 ; ++j)
for(k=1 ; k<kmax-1 ; ++k){
// short index on a, b, c was moved up front
s0 = a[0][i][j][k] * p[i+1][j ][k ]
+ a[1][i][j][k] * p[i ][j+1][k ]
+ a[2][i][j][k] * p[i ][j ][k+1]
+ b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]
- p[i-1][j+1][k ] + p[i-1][j-1][k ] )
+ b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]
- p[i ][j+1][k-1] + p[i ][j-1][k-1] )
+ b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]
- p[i+1][j ][k-1] + p[i-1][j ][k-1] )
+ c[0][i][j][k] * p[i-1][j ][k ]
+ c[1][i][j][k] * p[i ][j-1][k ]
+ c[2][i][j][k] * p[i ][j ][k-1]
+ wrk1[i][j][k];
ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];
gosa = gosa + ss*ss;
wrk2[i][j][k] = p[i][j][k] + omega * ss;
}
// copy-back loop ignored for analysis
#pragma omp parallel for private(j,k)
for(i=1 ; i<imax-1 ; ++i)
for(j=1 ; j<jmax-1 ; ++j)
for(k=1 ; k<kmax-1 ; ++k)
p[i][j][k] = wrk2[i][j][k];
} /* end n loop */

Figure 1: Structure of the 19-point stencil showing the data access pattern to the p[][][] array in the Himeno benchmark. The k index is the inner (fast) loop index here.

There is an outer iteration loop over n. The first (parallel) loop nest over i, j, and k updates the wrk2 array from the arrays a, b, c, wrk1, bnd, and p, of which only p has a stencil-like access pattern (see Fig. 1). All others are accessed in a consecutive, cacheline-friendly way. Since the coefficient arrays a, b, and c carry a fourth index in the first position, the row-major data layout of the C language leads to many concurrent data streams. We will see whether or not this impacts the performance of the code.

A second parallel loop nest copies the result in wrk2 back to the stencil array p. This second loop can be easily optimized away (how?), so we ignore it in the following; all analysis and performance numbers pertain to the first loop only.

#### Amount of work

There are 14 floating-point additions, 7 subtractions, and 13 multiplications in the loop body. Hence, one lattice site update (LUP) amounts to 34 flops.

#### Data transfers and best-case code balance

For this analysis the working set shall be larger than any cache. It is straightforward to calculate a lower limit for the data transfers if we assume perfect spatial and temporal locality for all data accesses within one update sweep: All arrays except wrk2 must be read at least once, and wrk2 must be written. This leads to (13+1) single-precision floating-point numbers being transferred between the core(s) and main memory. The best-case code balance is thus B= 1.65 byte/flop = 56 byte/LUP. If the architecture has a write-back cache, an additional write-allocate transfer must be accounted for if it cannot be avoided (e.g., by nontemporal stores). In this case the best-case code balance is B= 1.76 byte/flop = 60 byte/LUP.

Considering that even the most balanced machines available today are not able to feed such a hunger for data (e.g., the new NEC-SX Aurora TSUBASA vector engine with 0.5 byte/flop), we know that this code will be memory bound. If the memory bandwidth can be saturated, the upper performance limit is the memory bandwidth divided by the code balance.

#### Layer conditions

If you know your stencils, you also know that this is not the whole story. The best-case code balance is calculated under the assumption that only one of all the accesses to the stencil array p, in this case one out of 19, actually goes to main memory. The rest can be reused from some level of cache. If this is true, and how much data must be supplied by the different memory hierarchy levels, can be determined by layer conditions (LCs). The outer (“3D”) layer condition is satisfied for the Himeno stencil if three j-k layers fit into the cache, i.e., $3\times 4\,\mbox{bytes}\times \mathtt{jmax}\times\mathtt{kmax} < C_\mathrm{eff}~,$ where $$C_\mathrm{eff}$$ is an effective cache size; as a rule of thumb we can use half the available cache size per thread here (remember that the LC must be satisfied for each thread separately if static OpenMP scheduling is used). The middle (“2D”) layer condition is satisfied if nine inner rows of size kmax fit into the cache, i.e., three per j-k layer: $3\times 3\times 4\,\mbox{bytes} \times\mathtt{kmax}<C_\mathrm{eff}~.$Finally, the inner (“1D”) layer condition requires that the cache can hold enough data to avoid cache misses on all but the “first” (largest k) accesses in the loop body. Even the L1 cache can do this for a radius-1 stencil like Himeno, so we don’t have to consider it here.

On processors with three cache  levels, each of the LCs can be broken in any cache level, so there are actually nine layer conditions. Luckily, in a lowest-order analysis we are only interested in the memory traffic, so all we need to know is whether the most stringent LC is satisfied at the largest cache. In the Himeno case this is the 3D LC at L3. If it is broken, two additional data streams go out to memory, so the best-case code balance (with NT stores) rises to Bc= 64 byte/LUP = 1.88 byte/flop. If the memory bandwidth stays the same, the performance will thus go down by 12.5%. This is also the potential performance loss if spatial blocking is not done to establish the outer LC at L3. If NT stores are not used, the loss is even a little smaller. Here we see why spatial blocking for Himeno does not really pay off: The coefficient arrays dominate the data traffic, and the only relevant LC for the stencil array at the L3 cache at the problem default memory-bound sizes is the 3D LC.

The following table shows the predefined problem sizes in the Himeno benchmark, their working set sizes, and how much cache is needed per thread to satisfy the 3D LC:

 Problem size imax x jmax x kmax Working set [MiB] required cache per thread for 3D LC [MiB] s 129 x 65 x 65 29.1 0.048 m 257 x 129 x 129 228 0.19 l 513 x 257 x 257 1810 0.76 xl 1025 x 513 x 513 14406 3.0

Only the “xl” and “l” problems have the potential for breaking the 3D layer condition in the outermost cache. Hence, we expect roughly the same performance for “s” and “m”, and about 12% less for “l” and “xl”. The “s” problem may even fit into the cache on bigger CPUs, so we ignore it in the following. Using spatial blocking on the “xl” and “l” problems will get us only to the level of “m” and no further.

#### Roofline model and validation

In order to validate the model we have to run the code on a specific architecture. A 14-core Intel Xeon “Haswell”  E5-2695v3 (2.30 GHz base clock frequency) has a shared L3 cache of 35 MiB if Cluster on Die (CoD) is  not activated. 14 threads will easily break the 3D layer condition on the “xl” case, but what about “l”? 0.75 MiB per thread only add up to 10.5 MiB overall, which is smaller than half the L3 cache. This rule-of-thumb calculation may be OK for simple stencil codes where only two arrays are involved, but with the Himeno benchmark we have 16 data streams fighting for the cache space (3 for p and 13 for the other arrays). We should thus only expect to have 3/16 of the cache available for the three layers of p, and so the “l” case is also expected to violate the 3D LC in L3. The standard benchmark that best fits the data access characteristics of Himeno is the Schönauer Vector Triad (a[:]=b[:]+c[:]*d[:]), which on our machine yields a memory bandwidth of b=55.1 GB/s (measured with likwid-bench). The following table shows the expected code balance, upper performance limits, and measured performance  (with 14 cores) for the three sizes on one socket:

 Problem size Bc [byte/flop] ([byte/LUP]) b/Bc  [Gflop/s]  ([MLUP/s]) Measured perf. [Gflop/s] Measured Bc [byte/LUP] m 1.76 (60) 31.3 (921) 31.6 58.3 l 2.0 (68) 27.6 (812) 28.5 66.6 xl 2.0 (68) 27.6 (812) 28.8 67.6

The measured performance is slightly above the predictions because this processor has a large spread in memory bandwidth depending on the number of read streams vs. write streams: The more data is read (relatively speaking), the higher the bandwidth (with a load-only benchmark the memory bandwidth is about 63 GB/s). Since Himeno has 14 or 16 read streams (including the write-allocate) vs. a single write stream, it can be expected that the vector triad shows less saturated bandwidth. The sheer number of streams, which gets multiplied by the number of threads, is obviously not hazardous in this case.

In order to be really sure about our model (it could be just coincidence after all) we have to validate the data traffic expectations by direct measurement. The last column of the table above shows the measured (via likwid-perfctr) memory data traffic per LUP. The values are quite close to the prediction, which shows that our overall picture of what is going on here is correct.

No spatial blocking strategy whatsoever on any of the “m”, “l” and “xl” problem sizes will get us beyond the 32-ish Gflop/s for the “m” case, because this is where the code balance is at its minimum. This is why the Himeno benchmark code is so resistant to the standard auto-tuning strategies, unless they include temporal blocking. Julian’s online Layer Condition Calculator allows you to study layer conditions and optimal block sizes for arbitrary stencils.

Of course, there are some residual questions one may ask:

1. What about single-core performance and scalability? To analyze this in detail we will need the ECM performance model, which should yield an accurate single-core performance prediction, including the significance of layer conditions on inner cache levels. This is something for a later post.
2. Is it at all important whether or not the compiler vectorizes the code? It is for sure vectorizable, and the Intel compiler (at least) does it, but is it necessary? Here, too, the ECM model should give a hint.
3. What about temporal blocking? Without the ECM model analysis, the expected speedup from temporal blocking is hard to estimate, but at least the memory-boundedness should be lifted.
4. Would it make sense to change the data layout? The original C version has a different index ordering on the coefficient arrays, which makes vectorization much harder because those arrays are then non-consecutive in the inner loop dimension. On the other hand, this ordering reduces the number of concurrent data streams. As the above analysis shows, we are pretty much at the memory bandwidth limit, so the number of streams itself does not seem to be a big problem.

# Preview for SC15 tutorial on “Node-Level Performance Engineering” now available

SC15 solicits video previews of accepted tutorials for the first time this year. So watch the commercial for our SC15 full-day tutorial “Node-Level Performance Engineering“!

Kudos to Jörn Rüggeberg from the RRZE multimedia center for putting together this great piece of art.

# Fooling the masses – Stunt 10: Always emphasize the “interesting” part of your work!

(See the prelude for some general information on what this is all about)

Have you ever thought about how to improve the aerodynamic properties of bulldozers? Yes? Then this stunt is for you.

Figure 1: Always focus on the interesting stuff!

Code optimization is quite rewarding at times. You put all your effort into mutilating a chunk of code beyond all recognition, and in the end you get a 50% speedup. Often enough, this work requires a great deal of  expertise at the interface between hardware and software. And you certainly want to present your success to your peers so they understand how advanced you are.

If you do such a thing, don’t bother picking the low-hanging fruits. When 95% of the run time goes into a boring loop nest which multiplies a dense matrix with a vector, there is not much you can do beyond the standard textbook stuff – or use a library altogether. However, if the remaining 5% consist of a convoluted communication scheme, you’ve found your target. The fact that the possible gain is at most 5% shouldn’t stop you, because it’s all a matter of presentation; see Fig. 1: 3D bars and a proper choice of scales on the axes guide the reader’s eye and help you drive home your point. It’s like presenting the optimized drag coefficient of your aerodynamically improved bulldozer on high gloss paper, and the boring technical data in small black print on a dark blue background, on the back page.

Interestingly, if the optimization target is communication, or synchronization, or any other non-computational overhead, this stunt can be made to work even better: When you strongly scale to a pointless number of workers, bringing down such overheads will give you arbitrary speedups, far beyond the factor of two that is theoretically possible if you restrict yourself to a sensible minimum parallel efficiency of 50%. Coming back to the heavy gear analogy, if you accelerate your bulldozer to 200 mph, the drag coefficient does become essential, and you may feel vindicated. It won’t solve any real-world problem, but hey, this is research after all.

Thanks to Gerhard Wellein for discovering this stunt in a paper. Many papers, actually.

# A case for the non-temporal store

Non-temporal stores (also called “streaming stores”) were added to x86 processors with SSE2 (i.e. when the first Pentium IV called “Willamette” appeared in 2000). There was also some support in MMX before, but nobody should be required to use MMX any more. In contrast to standard store instructions which transfer data from registers to memory, NT stores do not require a prior cache line read for ownership (RFO) but write to memory “directly” (well not quite – but to first order it’s true). Beware the usual explanations for RFO that Google finds for you – they almost all describe RFO as something inherently connected to cache coherency in multi-processor systems, but RFO is also required on a single core, whenever there is a cache that is organized in lines.

What are NT stores good for? Again, most references cite them as a means to avoid cache pollution in cases where stored data will not be re-used soon. That’s true to some extent; every word that you don’t store in your cache means more room for other, more frequently used data. But that’s only one half the story. Imagine you have a code whose working set fits into the outer-level cache and which produces data to be stored in memory. Using NT stores for such a code will probably slow it down because, depending on the LD/ST balance, performance is then limited by the memory interface. The true benefit of NT stores can be seen with a truly memory-bound code which is not dominated by loads, like, e.g., the STREAM COPY benchmark, relaxation methods, the lattice-Boltzmann algorithm, or anything that can be formulated in terms of stride-one store streams: By saving the RFO, the pressure on the memory subsystem is reduced. See, e.g., my talk at the KONWIHR results and review workshop in December 2007 at LRZ where the performance boost through NT stores was demonstrated using a 2D Jacobi (5-point stencil) code.

The most important non-temporal store instruction is movntpd, which writes the contents of a full 16-byte SSE register (xmm0…xmm15) to memory. The memory address has to be a multiple of 16, else an exception will be generated. Also, this instruction exists in a “packed” variant only, i.e. there is no movntsd that would only store the lower 8 bytes of the register (AMD has included it, though, into their SSE4a extensions which are implemented with the K10 CPUs – read a nice writeup by Rahul Chaturvedi from AMD for more information). So, if you are stuck with Intel processors or a compiler which does not know about movntsd on K10, loops which use NT stores must be vectorizable and the store stream(s) must be 16-byte aligned.

Or so I thought, for a long time. This assumption was backed, of course, by stupid compilers who blatantly refused to use NT stores if they weren’t 100% sure at compile time whether the store stream was actually aligned. Never heard of loop peeling, eh? Thank goodness great Intel gave us the vector aligned pragma, which you can put in front of a loop in order to tell the compiler that the streams appearing inside the loop were 16-byte aligned! But that always refers to all streams, including reads, and if you somehow forget to pay proper attention, you’re immediately punished by an exception again. To make a long story short, you never really know what’s going on inside the compiler’s twisted little brain, and the diagnostics don’t help either (“LOOP WAS VECTORIZED”, yes, thank you very much).

In hope for some future improvement (unaligned NT store? Or *gasp* maybe even a scalar one?) I looked into Intel’s AVX extensions for SSE, to be implemented in the Sandy Bridge processor, which is to follow Westmere in 2010. That would be a “Streaming SIMD Extension Extension”, right? Whatever, there is neither a scalar nor an unaligned packed NT store in there. But I stumbled across something that had been there since SSE2: The beautiful maskmovdqu instruction. From the manual:

MASKMOVDQU xmm1,xmm2 – Store Selected Bytes of Double Quadword with NT Hint

Stores selected bytes from the source operand (first operand) into an 128-bit memory location. The mask operand (second operand) selects which bytes from the source operand are written to memory. The source and mask operands are XMM registers. The location of the first byte of the memory location is specified by DI/EDI/RDI and DS registers. The memory location does not need to be aligned on a natural boundary. […]

Way cool. Although the fixed address register is somewhat inflexible, the instruction is even better than pure scalar or pure unaligned NT store – it can store any part of an XMM register, down to the byte level, and with no alignment restrictions. And there’s a C/C++ compiler intrinsic, too. That has made it easy to implement our beloved vector triad in explicit SSE. These are two possible variants:

 Scalar (unaligned) Packed unaligned  __m128d xmm1,xmm2,xmm3,xmm4; __m128i mask_low=_mm_set_epi32(0,0,-1,-1); #pragma omp parallel private(j) { if(size > 1000000) { for(j=0; j(xmm4), mask_low, (char*)(a+i)); } } } else { // same with standard store }  __m128d xmm1,xmm2,xmm3,xmm4; __m128i mask_all=_mm_set_epi32(-1,-1,-1,-1); #pragma omp parallel private(j) { if(size > 1000000) { for(j=0; j(xmm4), mask_all, (char*)(a+i)); } } } else { // same with standard store }

The left one is an example for a purely scalar code which employs NT stores. The right one shows how a vectorized code can be endowed with NT stores if 16-byte alignment cannot be enforced. The compiler does neither by itself. Of course, NT stores only make sense if the arrays are large. I have set the lower limit to N=1000000, which is largish compared to the cache sizes but neatly shows the effect of NT vs. standard store:

The figure includes standard scalar versions without NT stores as well. Obviously, the maskmovdqu instruction does the job. There is hardly any performance loss in the serial case, and none at all with four threads. I have omitted any comment on in-cache behavior of the different variants; this is a separate story (and one which becomes more interesting once Nehalem is out!).

If you want to know more about using SSE intrinsics in C/C++ code, the Intel C++ Compiler for Linux Intrinsics Reference is for you.

# Array summation benchmark

A question came up on the OpenMP mailing list today concerning scalability of simple array summation on an Opteron processor. I have done some tests with the following code, using the Intel C++ compiler version 9.1:

#pragma omp parallel for private(j) reduction(+: sum)
#pragma vector always
for (j = 0; j < N; j++){
sum += array2[j];
}

There is a loop around that to ensure that for small sizes we actually see the cache effect. Here is the result:

The number of threads (1T, 2T,…) is indicated. In case of the Opteron system, this was a 2-socket dual-core 2GHz box and the 2-thread data was correspondingly measured on one (1S) or two (2S) sockets, respectively. Proper NUMA placement was implemented. The “Conroe” system is my standard Core2 workstation.

Data on purely serial runs (no -openmp) is shown for reference. In contrast to low-level benchmarks like the stream or vector triads which have more read streams and at least one write stream, there seems to be a lot of “headroom” for the second thread even for large N on an Opteron socket.