# Stepanov test faster than light?

If you program in C++ and care about performance, you have probably heard about the Stepanov abstraction benchmark [1]. It is a simple sum reduction that adds 2000 double-precision floating-point numbers using 13 code variants. The variants are successively harder to see through by the compiler because they add layers upon layers of abstractions. The first variant (i.e., the baseline) is plain C code and looks like this:

// original source of baseline sum reduction
void test0(double* first, double* last) {
start_timer();
for(int i = 0; i < iterations; ++i) {
double result = 0;
for (int n = 0; n < last - first; ++n) result += first[n];
check(result);
}
result_times[current_test++] = timer();
}


It is quite easy to figure out how fast this code can possibly run on a modern CPU core. There is one LOAD and one ADD in the inner loop, and there is a loop-carried dependency due to the single accumulation variable result. If the compiler adheres to the language standard it cannot reorder the operations, i.e., modulo variable expansion to overlap the stalls in the ADD pipeline is ruled out. Thus, on a decent processor such as, e.g., a moderately modern Intel design, each iteration will take as many cycles as there are stages in the ADD pipeline. All current Intel CPUs have an ADD pipeline of depth three, so the expected performance will be one third of the clock speed in GFlop/s.

If we allow some deviation from the language standard, especially unsafe math optimizations, then the performance may be much higher, though. Modulo variable expansion (unrolling the loop by at least three and using three different result variables) can overlap several dependency chains and fill the bubbles in the ADD pipelines if no other bottlenecks show up. Since modern Intel CPUs can do at least one LOAD per cycle, this will boost the performance to one ADD per cycle. On top of that, the compiler can do another modulo variable expansion for SIMD vectorization. E.g., with AVX four partial results can be computed in parallel in a 32-byte register. This gives us another factor of four.

-O3 -march=native -O3 -ffast-math -march=native Original baseline assembly code  vxorpd %xmm0, %xmm0, %xmm0 .L17: vaddsd (%rax), %xmm0, %xmm0 addq $8, %rax cmpq %rbx, %rax jne .L17  vxorpd %xmm1, %xmm1, %xmm1 .L26: addq$1, %rcx vaddpd (%rsi), %ymm1, %ymm1 addq $32, %rsi cmpq %rcx, %r13 ja .L26 vhaddpd %ymm1, %ymm1, %ymm1 vperm2f128$1, %ymm1, %ymm1, %ymm3 vaddpd %ymm3, %ymm1, %ymm1 vaddsd %xmm1, %xmm0, %xmm0 

Now let us put these models to the test. We use an Intel Xeon E5-2660v2 “Ivy Bridge” running at a fixed clock speed of 2.2 GHz (later models can run faster than four flops per cycle due to their two FMA units). On this CPU the Stepanov peak performance is 8.8 GFlop/s for the optimal code, 2.93 GFlop/s with vectorization but no further unrolling, 2.2 GFlop/s with (at least three-way) modulo unrolling but no SIMD, and 733 MFlop/s for standard-compliant code. The GCC 6.1.0 was used for all tests, and only the baseline (i.e., C) version was run.
Manual assembly code inspection shows that the GCC compiler does not vectorize or unroll the loop unless -ffast-math allows reordering of arithmetic expressions. Even in this case only SIMD vectorization is done but no further modulo unrolling, which means that the 3-stage ADD pipeline is the principal bottleneck in both cases. The (somewhat cleaned) assembly code of the inner loop for both versions is shown in the first table. No surprises here; the vectorized version needs a horizontal reduction across the ymm1 register after the main loop, of course (last four instructions).

g++ options Measured [MFlop/s] Expected [MFlop/s] Original baseline code performance -O3 -march=native 737.46 733.33 -O3 -ffast-math -march=native 2975.2 2933.3

In defiance of my usual rant I give the performance measurements with five significant digits; you will see why in a moment. I also selected the fastest among several measurements, because we want to compare the highest measured performance with the theoretically achievable performance. Statistical variations do not matter here. The performance values are quite close to the theoretical values, but there is a very slight deviation of 1.3% and 0.5%, respectively. In performance modeling at large, such a good coincidence of measurement and model would be considered a success. However, the circumstances are different here. The theoretical performance numbers are absolute upper limits (think “Roofline”)! The ADD pipeline depth is not 2.96 cycles but 3 cycles after all. So why is the baseline version of the Stepanov test faster than light? Can the Intel CPU by some secret magic defy the laws of physics? Is the compiler smarter than we think?

A first guess in such cases is usually “measuring error,” but this was ruled out: The clock speed was measured by likwid-perfctr to be within 2.2 GHz with very high precision, and longer measurement times (by increasing the number of outer iterations) do not change anything. Since the assembly code looks reasonable, the only conclusion left is that the dependency chain on the target register, which is completely intact in the inner loop, gets interrupted between iterations of the outer loop because the register is assigned a new value. The next run of the inner loop can thus start already before the previous run has ended, leading to overlap. A simple test supports this assumption: If we cut the array size in half, the relative deviation doubles. If we reduce it to 500, the deviation doubles again. This strongly indicates an overlap effect (absolute runtime reduction) that is independent of the actual loop size.

In order to get a benchmark that stays within the light speed limit, we modify the code so that the result is only initialized once, before the outer loop (see second listing).

// modified code with intact (?) dependency chain
void test0(double* first, double* last) {
start_timer();
double result = 0; // moved outside
for(int i = 0; i < iterations; ++i) {
for (int n = 0; n < last - first; ++n) result += first[n];
if(result<0.) check(result);
}
result_times[current_test++] = timer();
}


The result check is masked out since it would fail now, and the branch due to the if condition can be predicted with 100% accuracy. As expected, the performance of the non-SIMD code now falls below the theoretical maximum. However, the SIMD code is still faster than light.

g++ options Measured [MFlop/s] Expected [MFlop/s] Modified baseline code performance -O3 -march=native 733.14 733.33 -O3 -ffast-math -march=native 2985.1 2933.3

How is this possible? Well, the dependency chain is doomed already once SIMD vectorization is done, and the assembly code of the modified version is very similar to the original one. The horizontal sum over the ymm1 register puts the final result into the lowest slot of ymm0, so that ymm1 can be initialized with zero for another run of the inner loop. From a dependencies point of view there is no difference between the two versions. Accumulation into another register is ruled out for the standard-conforming code because it would alter the order of operations. Once this requirement has been dropped, the compiler is free to choose any order. This is why the -ffast-math option makes such a difference: Only the standard-conforming code  is required to maintain an unbroken dependency chain.

Of course, if the g++ compiler had the guts to add another layer of modulo unrolling on top of SIMD (this is what the Intel V16 compiler does here), the theoretical performance limit would be ADD peak (four additions per cycle, or 8.8 GFlop/s). Such a code must of course stay within the limit, and indeed the best Intel-compiled code ends up at 93% of peak.

Note that this is all not meant to be a criticism of the abstraction benchmark; I refuse to embark on a discussion of religious dimensions. It just happened to be the version of the sum reduction I have investigated closely enough to note a performance level that is 1.3% faster than “the speed of light.”

# The 400x GPU speedup baloney

Recently, in the HPC services office…

(… phone rings …)

“Computing Center, HPC services. How can I help you?”

“Yes, High Performance Computing.”

“You want to use our GPGPU cluster? Great! The load on this baby could be higher anyway. It’s hard to believe, but people seem to avoid it like the plague (jovial laughter). Do you have a code that runs on GPUs already?”

“I see, the compiler should be able to handle this. But the code is SIMD vectorized for standard processors, right?”

“No, this has nothing to do with cell phones. SIMD means ‘Single Instruction Multiple Data,’ i.e., several operations can be performed on different operands with a single machine instruction. If that works, chances are that the program can be run efficiently on a GPU as well. After all, GPUs implement the SIMD principle quite extensively.”

“Hm? I think I don’t understand…”

“Ah, ok. No, you don’t have to learn assembly programming to do this. But you may have to think a little more about how the compiler looks at your code. Often you can help it by supplying additional information, like source directives. And of course you need to use a compiler that understands what SIMD is. Alas, the GNU compilers don’t have a clue about it, mostly. By the way, how  have you parallelized the code?”

“Um, no. Compilers can’t help you much here, except for very simple cases where a 10-year-old can do it just as well. But you do have to parallelize. How do you want to draw a meaningful comparison to the GPU version?”

“What do you mean, you don’t need to do this?”

“Um, yes, I think I’ve seen this paper recently. It should be on my desk somewhere… (paper rustling) And what exactly are you referring to?”

“Section 4.3, just a sec… here we are: ‘As shown in Fig. whatever, we have achieved a 400x speedup on an NVIDIA Tesla GPU as compared with our CPU implementation.’ (long pause)”

“Sorry, no, I’m still here. I’ve just been looking for the details of the CPU implementation. One moment… (longer pause) Ok, here’s something in the caption of the pretty CFD visualization: ‘In order to avoid issues with OpenMP parallelization we have run the CPU version on a single core.’ Wow. This has to sink in. And if I’m not mistaken, they compare a single-precision GPU code with a double precision version on the CPU. Truly hilarious.”

“No, I’m not making fun of those scientists. But ‘scientists’ is not the word I would use. They obviously think that everyone else is stupid.”

“Why? Because a factor of 400 is impossible. Neither the floating-point peak performance nor the memory bandwidth of the GPU is 400 times larger than that of a current standard compute node, or a chip, or even a single core. So they must have fabricated a slow CPU code on purpose. Realistically one may expect a factor of 2-4 if you compare a reasonably optimized CPU code on a single node with a single GPU, and use the same precision on both.”

“Yes, I agree. 2-4 isn’t bad at all. But that’s just counting the raw GPU. How would the data transfer affect the performance of you code? Can you estimate this?”

“Well, somehow the data must be brought into the GPU and the results must be copied back so that they don’t start rotting over there…”

“Yes (sighing with resignation). Compilers are smart. But there are limits. If you copy the whole problem through the PCI bus after every step, the only way to exploit the performance advantage of the GPU is to perform ridiculously many flops. It’s all a matter of code balance.”

“Code balance tells you how much data transfer your code needs per executed floating-point operation – and you have to count everything, including communication via buses, the network, the memory interface, etc. This can add up to quite some data volume. And then you compare with the amount of data the hardware can transfer per peak flop executed, which gives you an estimate for the performance.”

“You don’t know how many flops and how much data transfer your code needs? Can’t you just count that by looking at it? In most cases, compute time is spent in a very limited number of numerical kernel loops.”

“No, the compiler doesn’t count that for you (keeping composure with obvious effort). It’s also a matter of optimization; perhaps one can reduce the data transfers a bit. One would have to take a look at the code.”

“The compiler? (devastated) Yes, you should try that. Definitely… No listen, people are much too enthusiastic about the compiler’s abilities. Compilers can not read your mind, they can’t even look through the standard C++ template tricks that programmers are so fond of. Let alone generate optimal code for GPUs.”

“Ok, err, sorry, are you crying? (nonplussed) Please don’t. May I suggest that we sit together over a cup of coffee and I give you a crash course in basic performance modeling? Don’t worry, everything’s going to be alright. There, there…”

# Allocatable array alignment with Intel Fortran compilers

Aligning heap data in C is simple: Just use the standard memalign() or better posix_memalign() functions instead of malloc() and you’re done. Intel compilers also feature special library calls to achieve the same thing, but you don’t really need them (you do need compiler support, though, for stack data, structures etc.). It should be clear for everyone familiar with current x86 architectures what properly aligned data can do for you: Packed aligned loads and non-temporal stores become possible. Even though the compiler can still employ aligned data movement by itself in some cases by loop peeling, one may want to align all references properly to enable the use of the #pragma vector aligned sledgehammer (why don’t they provide an argument list for this directive?).

In Fortran there is no standard way to make allocatable data automatically aligned on some address boundary (standard alignment is on 8 byte). The Intel compiler, however, provides a special directive to do just that. In order to enforce 16-byte alignment you can write:

 double precision, allocatable, dimension(:) :: array
!DEC\$ ATTRIBUTES ALIGN: 16 :: array

! ...

allocate(array(100000))


Although the compiler docs say at some point that this doesn’t work for allocatables, it does at least for versions 9.1 and 10.1 (I’ve checked by printing out the address explicitly).

This should enable the same vectorization stunts as in C without too much hassle.