Georg Hager's Blog

Random thoughts on High Performance Computing

Content

Node-Level Performance Engineering tutorial to be featured again at SC17

Our popular “Node-Level Performance Engineering” full-day tutorial has been accepted again (now the sixth time in a row!) for presentation at SC17, the International Conference for High Performance Computing, Networking, Storage and Analysis. We teach the basics of node-level computer architecture, analytic performance modeling (via the Roofline model), and model-guided optimization. Watch this cool video to whet your appetite:

When: November 12, 2017, 8:30am-5:00pm

Where: Colorado Convention Center, Denver, CO.

 

Benchmarking the memory hierarchy of the Intel “Skylake” Xeon E3-1275 v5 using the vector triad

The vector triad results for the AMD Ryzen 1700X in my previous post allowed some interesting conclusions, but it is more instructive to compare them with a current Intel CPU. We have a desktop PC with an Intel “Skylake” Xeon E5-1275 v5 and dual-channel DDR4-2133 memory. This model was introduced in Q4/2015, and it does not support AVX-512 instructions. Although it has only four cores, it is a good fit because it has a similar price tag and power dissipation as the Ryzen 1700X, and it is also built with 14nm technology. The slightly lower theoretical memory bandwidth of 34.1 GB/sec does not really make a difference in practice.

The game is the same as before: I have fixed the clock frequency to 3.0 GHz, and the benchmark is the “throughput-mode vector triad”:

S = get_walltime()
do r=1,NITER
  do i=1,N
    A(i) = B(i) + C(i) * D(i)
  enddo
enddo
WT = get_walltime() - S
MFLOPS = 2.d0 * N * NITER / WT / 1.d6

This was run in parallel, without work sharing, on 1 to 4 cores. Again I have checked that FMA instructions make no difference whatsoever for this extremely data-bound code, and that likwid-bench delivers the same performance levels. An important architectural detail of the Skylake is worth mentioning: Just like its predecessors Haswell and Broadwell, it has a third address generation unit (AGU) for STOREs, hooked to execution port 7. In theory the core can thus sustain two AVX LOADs and one AVX STORE per cycle, doubling the per-cycle data throughput compared to Sandy Bridge and Ivy Bridge. However, Haswell and Broadwell had a limitation on this AGU: It could only process “simple” addresses, i.e., none of the complex addressing modes the Intel compiler uses by default (e.g., [rsi+8*rdx+32]) was able to use the AGU. The compiler (I tried up to version 17.0 update 2) does not know about this and refuses to choose simple addresses at least for the STOREs. By hacking the assembly code I could produce a version using simple addresses for the STOREs; see below for the consequences.

Throughput-mode vector triad on the Skylake chip vs. array length with 1, 2, and 4 cores. Inset: Performance for an in-memory dataset vs. number of cores. The bandwidth numbers assume a code balance of 20 B/flop in all hierarchy levels except L1.

The figure shows the results. I have chosen the same x and y axis ranges as in the Ryzen post, so the comparison should be straightforward:

In the L1 cache the data throughput is at 64 B/cy because the core can sustain either two AVX LOADs or one AVX LOAD and one AVX STORE per cycle. This proves that the third AGU at port 7 cannot be used with compiler-generated code. My “hacked” variant with simple addresses (only using [rcx] for the STOREs, shown as a dotted line) can get to the theoretical limit of 85.3 B/cy (not 96 because the bottleneck is the LOAD port – see any of our recent NLPE tutorials to understand why). But even without port 7 and AVX-512 the Skylake core can outperform the Ryzen by a factor of two per cycle. As a consequence, Skylake can reach the same L1 performance as Ryzen with four (or three, with some port 7 tweaking) instead of eight cores. Well, this was not entirely unexpected.

The large performance breakdown from L1 to L2 is characteristic of Intel architectures because of their “single-ported” L1 cache: In any given cycle, the cache can either talk to the registers or to the L2 cache, but not both. This “feature” was the starting point of our ECM performance model. On Skylake (and Broadwell) the increased L2 bandwidth of 64 B/cy, which was in theory available since Haswell,  can actually be observed [1] (indirectly, of course, due to the non-overlapping L1).  Despite its lower L1 performance, a Ryzen core can easily keep up with the Skylake because of its overlapping cache hierarchy. Due to its eight cores, the Ryzen 1700X has a massively better aggregate L2 performance, of course, and tops out at over 256 B/cy, whereas the Skylake chip stays at just below 160 B/cy.

The L3 cache on the Skylake is, as opposed to Ryzen, bandwidth-scalable up to all 4 cores. The Ryzen’s two “compute complexes” save its neck here because two L3s have two times the bandwidth, so the overall performance level is about 1.6x higher than on Skylake. One could argue that it is unfair to compare a four-core with an eight-core part, but I don’t actually care – this is not a bang-for-the-buck shootout but an architectural comparison.

The memory bandwidth (inset in the figure) shows similar characteristics to the Ryzen: A single core can almost saturate the bandwidth. Actually, if you let Turbo Mode do its thing there is hardly a discernible speedup in memory bandwidth for multiple cores. The Skylake achieves about 84% of the theoretical peak memory bandwidth for this benchmark. Just as on Ryzen, the efficiency goes up dramatically with a read-only benchmark and tops out at an impressive 97%.

Summing up, the Ryzen CPU cannot keep up with current Intel designs in terms of in-core performance with AVX(2) code. On the other hand, its overlapping cache hierarchy and good memory bandwidth (for a desktop chip) can regain a lot of lost ground despite the non-scalable L3 cache. I’m looking forward to the upcoming Naples CPU, although its value for HPC is probably questionable.

 

[1] J. Hofmann, D. Fey, M. Riedmann, J. Eitzinger, G. Hager, and G. Wellein: Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors. Concurrency & Computation: Practice & Experience (2016). Available online, DOI: 10.1002/cpe.3921. Preprint: arXiv:1604.01890

 

Benchmarking the memory hierarchy of the new AMD Ryzen CPU using the vector triad

We have recently acquired the brand-new AMD Ryzen 1700X (8 cores, 3.4+ GHz). Reading all the random numbers that people produce with their benchmarks made me sick, so I decided to run something that is well understood: The Schönauer vector triad. Everyone who knows our lectures and tutorials also knows this simple benchmark:

S = get_walltime()
do r=1,NITER
  do i=1,N
    A(i) = B(i) + C(i) * D(i)
  enddo
enddo
WT = get_walltime() - S
MFLOPS = 2.d0 * N * NITER / WT / 1.d6

Performance is reported in Mflops/sec for a wide range of problem sizes N, and NITER is chosen such that the benchmark runs for a sufficiently long time, long enough so that the wall-clock time measurement is accurate. On top of it we usually run the whole thing on different numbers of cores (using an OMP PARALLEL directive, but without worksharing on the inner loop) so we can evaluate the scalability of the different data paths without any adverse effects from OpenMP overhead. All this can be done very easily and efficiently using the likwid-bench microbenchmarking tool, but I have done the triad runs with my own Fortran-based code and checked afterwards that likwid-bench gives the same results.

The Ryzen system is equipped with 2x 16 GB of DDR4-2400 DRAM for a maximum theoretical memory bandwidth of 38.4 GB/sec. For the measurements I fixed the clock speed to 3.0 GHz, although the nominal clock speed is 3.4 GHz. The reason is simply my current inability to keep the CPU from running in Turbo mode when I set it to 3.4 GHz; it doesn’t matter for the benchmarks, though, because all in-cache numbers will be converted to bytes per cycle, and the memory bandwidth does not (significantly) depend on the clock frequency. I used the Intel compiler version 17.0 Update 2, which generates very decent AVX code on the chip if you give it the -Ofast -xHost options. Using FMAs would not make a difference because the Ryzen core executes a full-width 256-bit FMA in two chunks of 128 bits (and I have checked using likwid-bench that the FMA code runs exactly as fast as the non-FMA version). Besides, this code is totally bound by the data transfers even with data in the L1 cache. I have not used non-temporal store instructions; this is a topic for a future post.

Throughput-mode vector triad on the Ryzen chip vs. array length with 1, 2, 4, and 8 cores. Inset: Performance vs. number of cores for an in-memory dataset. The bandwidth numbers assume a code balance of 20 B/flop in all hierarchy levels except L1.

Now for the data. The figure shows the usual stuff: Performance vs. array size for different numbers of threads (i.e., physical cores – SMT was ignored for this test). Here is my interpretation of the data:

With data in the L1 cache we get very close to 6 Gflops/sec per core, which amounts to a limit of 32 B/cy. Actually, the slide I stumbled across in the AMD booth at SC16 says “2 x 16B LOAD”, while another one says “2 LOADs & 1 STORE per cycle”. I don’t know how they want to sustain this with only two address generation units (AGUs), and the measured data is in line with a limit of “two 16-B LOADs OR one 16-B LOAD and one 16-B STORE per cycle,” identical to SSE limits on Intel Sandy Bridge and Ivy Bridge. If anyone can tell me how to get to the 8 Gflops/sec that can be expected if the 2LD+1ST story were true, go ahead. (Update: The two AGUs would be OK for one AVX LOAD and one half AVX STORE per cycle, just as on Sandy and Ivy Bridge; this would also lead to a maximum triad performance of 8 flops in 3 cycles. However, I can’t see it although the code is definitely AVX).

Between N=1024 and N=16384 the data comes from the 512 KiB per-core L2 cache. The drop in performance occurs at the same N independently of the core count, because we run the benchmark in “throughput mode”, i.e., with the same data size N on each core (think “weak scaling”). You can’t see it in the single-core data on this graph, but the 8-core runs tells us that the L2 actually delivers a little more than 32 B/cy. I have assumed here that there is a write-allocate for every store miss in the L1, increasing the code balance from 16 B/flop to 20 B/flop. The AMD slides linked above are a little cloudy on the details, they only state “32 B/cy” above two arrows between L1 and L2, so it isn’t clear whether this is 32 B/cy per direction or not. Another explanation could be that the write-allocate is optimized away by the hardware if it detects in its line fill buffers a write miss on a cache line that will be completely overwritten. We don’t know for sure, but once likwid-perfctr is ported to the architecture it should be able to shed some light on this.

Another important observation about the L2 (and also the L3): Data transfers overlap quite efficiently with retired LOADs in the L1. If you know the ECM performance model [1] you know that this is not at all the case on current Intel CPUs, making the AMD caches very effective (the IBM Power8 also behaves like this [2]).

The L3 cache is not core-local, but 8 MiB of shared L3 are available to each of the two four-core “core complexes” (CCX). The data clearly reflects this organization: Since I have pinned the threads in core order, i.e., filling one CCX first and then the other, this means we observe a performance drop beyond N=218 on a single core and beyond N=216 on the four cores of one CCX. Nothing changes from four to eight, however, because we get another 8 MiB of L3.

The most important observation about the L3 cache is that its bandwidth does not scale from one to four cores. There is only a slight speedup from 2 to 4, and it tops out at slightly above 64 B/cy assuming a code balance of 20 B/flop. The L3 is also a victim cache, i.e., only cache lines overflowing from the higher-level caches end up there. We will see how the efficiency of the L3 can be influenced by optimizations (such as, e.g., prefetch instructions, which I haven’t used at all here). The L3 bandwidth scales, of course, from 4 to 8 cores simply because there are two L3s.

Finally, the memory bandwidth behavior. This chip has the (IMO very attractive) property that a single core can almost saturate the memory bandwidth of the full chip. Intel desktop CPUs (but not the Xeons) behave in a similar way. The inset in the figure shows performance vs. number of cores for large N. The best bandwidth on a single CCX is obtained with two cores, and there is a slight drop for 3 and 4. We may speculate that this is a faint echo of the non-scalable L3 cache, but there may be many other reasons. With all eight cores we top out at just below 30 GB/sec, which is 77% of the theoretical memory bandwidth (dashed line in the inset). A read-only benchmark yields almost 90%, so it seems that the Ryzen also shares this feature with current Intel chips: the more read-dominated the code, the closer we can get to the peak memory bandwidth. The best overall chip bandwidth is obtained with two threads, one running on each CCX.

My next post will show the same test on a Skylake-type chip for comparison.

[1] H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Proc. ICS15, the 29th International Conference on Supercomputing, June 8-11, 2015, Newport Beach, CA. DOI: 10.1145/2751205.2751240. Preprint: arXiv:1410.5010

[2] J. Hofmann, D. Fey, M. Riedmann, J. Eitzinger, G. Hager, and G. Wellein: Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors. Concurrency & Computation: Practice & Experience (2016). Available online, DOI: 10.1002/cpe.3921. Preprint: arXiv:1604.01890

Stepanov test faster than light?

If you program in C++ and care about performance, you have probably heard about the Stepanov abstraction benchmark [1]. It is a simple sum reduction that adds 2000 double-precision floating-point numbers using 13 code variants. The variants are successively harder to see through by the compiler because they add layers upon layers of abstractions. The first variant (i.e., the baseline) is plain C code and looks like this:

// original source of baseline sum reduction
void test0(double* first, double* last) {
  start_timer();
  for(int i = 0; i < iterations; ++i) {
    double result = 0;
    for (int n = 0; n < last - first; ++n) result += first[n];
    check(result);
  }
  result_times[current_test++] = timer();
}

It is quite easy to figure out how fast this code can possibly run on a modern CPU core. There is one LOAD and one ADD in the inner loop, and there is a loop-carried dependency due to the single accumulation variable result. If the compiler adheres to the language standard it cannot reorder the operations, i.e., modulo variable expansion to overlap the stalls in the ADD pipeline is ruled out. Thus, on a decent processor such as, e.g., a moderately modern Intel design, each iteration will take as many cycles as there are stages in the ADD pipeline. All current Intel CPUs have an ADD pipeline of depth three, so the expected performance will be one third of the clock speed in GFlop/s.

If we allow some deviation from the language standard, especially unsafe math optimizations, then the performance may be much higher, though. Modulo variable expansion (unrolling the loop by at least three and using three different result variables) can overlap several dependency chains and fill the bubbles in the ADD pipelines if no other bottlenecks show up. Since modern Intel CPUs can do at least one LOAD per cycle, this will boost the performance to one ADD per cycle. On top of that, the compiler can do another modulo variable expansion for SIMD vectorization. E.g., with AVX four partial results can be computed in parallel in a 32-byte register. This gives us another factor of four.

Original baseline assembly code
-O3 -march=native -O3 -ffast-math -march=native
 vxorpd %xmm0, %xmm0, %xmm0
.L17:
  vaddsd  (%rax), %xmm0, %xmm0
  addq    $8, %rax
  cmpq    %rbx, %rax
  jne     .L17
  vxorpd %xmm1, %xmm1, %xmm1
.L26:
  addq    $1, %rcx
  vaddpd  (%rsi), %ymm1, %ymm1
  addq    $32, %rsi
  cmpq    %rcx, %r13
  ja      .L26
  vhaddpd %ymm1, %ymm1, %ymm1
  vperm2f128 $1, %ymm1, %ymm1, %ymm3
  vaddpd  %ymm3, %ymm1, %ymm1
  vaddsd  %xmm1, %xmm0, %xmm0

Now let us put these models to the test. We use an Intel Xeon E5-2660v2 “Ivy Bridge” running at a fixed clock speed of 2.2 GHz (later models can run faster than four flops per cycle due to their two FMA units). On this CPU the Stepanov peak performance is 8.8 GFlop/s for the optimal code, 2.93 GFlop/s with vectorization but no further unrolling, 2.2 GFlop/s with (at least three-way) modulo unrolling but no SIMD, and 733 MFlop/s for standard-compliant code. The GCC 6.1.0 was used for all tests, and only the baseline (i.e., C) version was run.
Manual assembly code inspection shows that the GCC compiler does not vectorize or unroll the loop unless -ffast-math allows reordering of arithmetic expressions. Even in this case only SIMD vectorization is done but no further modulo unrolling, which means that the 3-stage ADD pipeline is the principal bottleneck in both cases. The (somewhat cleaned) assembly code of the inner loop for both versions is shown in the first table. No surprises here; the vectorized version needs a horizontal reduction across the ymm1 register after the main loop, of course (last four instructions).

Original baseline code performance
g++ options Measured [MFlop/s] Expected [MFlop/s]
-O3 -march=native 737.46 733.33
-O3 -ffast-math -march=native 2975.2 2933.3

In defiance of my usual rant I give the performance measurements with five significant digits; you will see why in a moment. I also selected the fastest among several measurements, because we want to compare the highest measured performance with the theoretically achievable performance. Statistical variations do not matter here. The performance values are quite close to the theoretical values, but there is a very slight deviation of 1.3% and 0.5%, respectively. In performance modeling at large, such a good coincidence of measurement and model would be considered a success. However, the circumstances are different here. The theoretical performance numbers are absolute upper limits (think “Roofline”)! The ADD pipeline depth is not 2.96 cycles but 3 cycles after all. So why is the baseline version of the Stepanov test faster than light? Can the Intel CPU by some secret magic defy the laws of physics? Is the compiler smarter than we think?

A first guess in such cases is usually “measuring error,” but this was ruled out: The clock speed was measured by likwid-perfctr to be within 2.2 GHz with very high precision, and longer measurement times (by increasing the number of outer iterations) do not change anything. Since the assembly code looks reasonable, the only conclusion left is that the dependency chain on the target register, which is completely intact in the inner loop, gets interrupted between iterations of the outer loop because the register is assigned a new value. The next run of the inner loop can thus start already before the previous run has ended, leading to overlap. A simple test supports this assumption: If we cut the array size in half, the relative deviation doubles. If we reduce it to 500, the deviation doubles again. This strongly indicates an overlap effect (absolute runtime reduction) that is independent of the actual loop size.

In order to get a benchmark that stays within the light speed limit, we modify the code so that the result is only initialized once, before the outer loop (see second listing).

// modified code with intact (?) dependency chain
void test0(double* first, double* last) {
  start_timer();
  double result = 0; // moved outside
  for(int i = 0; i < iterations; ++i) {
    for (int n = 0; n < last - first; ++n) result += first[n];
    if(result<0.) check(result);
  }
  result_times[current_test++] = timer();
}

The result check is masked out since it would fail now, and the branch due to the if condition can be predicted with 100% accuracy. As expected, the performance of the non-SIMD code now falls below the theoretical maximum. However, the SIMD code is still faster than light.

Modified baseline code performance
g++ options Measured [MFlop/s] Expected [MFlop/s]
-O3 -march=native 733.14 733.33
-O3 -ffast-math -march=native 2985.1 2933.3

How is this possible? Well, the dependency chain is doomed already once SIMD vectorization is done, and the assembly code of the modified version is very similar to the original one. The horizontal sum over the ymm1 register puts the final result into the lowest slot of ymm0, so that ymm1 can be initialized with zero for another run of the inner loop. From a dependencies point of view there is no difference between the two versions. Accumulation into another register is ruled out for the standard-conforming code because it would alter the order of operations. Once this requirement has been dropped, the compiler is free to choose any order. This is why the -ffast-math option makes such a difference: Only the standard-conforming code  is required to maintain an unbroken dependency chain.

Of course, if the g++ compiler had the guts to add another layer of modulo unrolling on top of SIMD (this is what the Intel V16 compiler does here), the theoretical performance limit would be ADD peak (four additions per cycle, or 8.8 GFlop/s). Such a code must of course stay within the limit, and indeed the best Intel-compiled code ends up at 93% of peak.

Note that this is all not meant to be a criticism of the abstraction benchmark; I refuse to embark on a discussion of religious dimensions. It just happened to be the version of the sum reduction I have investigated closely enough to note a performance level that is 1.3% faster than “the speed of light.”

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/D_3.cpp

 

Energy vs. performance: Introducing the Z-plot

Figure 1: Z-plot

Figure 1: Z-plot with relevant iso-lines

Energy consumption and power dissipation have become the latest craze in HPC, for several reasons. Unfortunately there is also a lot of confusion about the interplay of performance and power. Thomas Zeiser from the RRZE HPC group has come up with an interesting way of visualizing energy-performance data on the CPU socket level that makes it easier to reason about energy consumption. The Z-plot shows dissipated energy to solution (i.e., how much energy is spent for solving a given problem) versus the program’s performance, which can be quantified by any appropriate metric that is inversely proportional to the runtime. Figure 1 shows that, if we always solve the same problem, most of the interesting quantities one would like to study in the Z-plot are constant along some simple line: obviously, all program runs with the same energy lie on a horizontal line (black), and all runs with the same performance lie on a vertical (blue), but also all runs with the same energy-delay product (EDP) lie on a straight line through the origin (orange), the slope of the line being proportional to the EDP. Finally, all runs with the same power dissipation lie on a hyperbola (red). If we change something, such as the number of active cores per chip or the clock speed, or if we optimize or otherwise change the code, it is insightful to see how different metrics change as we move through the Z-plot. As a first example we study the behavior of a scalable code on a multi-core CPU socket.

Scalable performance vs. active cores

Figure 2: A scalable program in the Z-plot. Each circle represents a run at a certain clock speed (color coded) and a given number of cores. The solid lines interpolate between points with the same number of cores, i.e., the clock speed changes continuously along the line.

Figure 2: A scalable program in the Z-plot. Each circle represents a run at a certain clock speed (color coded) and a given number of cores. The solid lines interpolate between points with the same number of cores, i.e., the clock speed changes continuously along the line.

A good definition of “scalable” in a multicore context is “performance is linear in the number of active cores \(n\).” Usually such programs also scale linearly in the clock speed \(f\); the performance can thus be described by \[P(n,f)=nP_0f/f_0~~,\]where \(f_0\) is some base or reference clock speed. If we run such a code on a multicore CPU, varying the number of cores and the clock speed, we may get data as shown in Figure 2. Some important observations strike the eye:

  • The more cores we use at a given clock speed (same-color data points from left to right), the lower the energy to solution.
  • Increasing the clock speed at constant number of cores seems to always increase the energy for large core counts, but for smaller core count there appears to be an optimal clock speed at which energy is at a minimum. This optimal frequency goes up as the number of cores goes down. E.g., it is around 3 GHz at one core but below 2 GHz at four cores.
  • The energy-delay product (EDP) goes down as the frequency goes up at fixed core count. In our example it is lowest at 12 cores and 4 GHz (green dashed line).

The data in Figure 2 does not come from measurements; it was constructed from the combination of our “scalable performance” model above with a multi-core power model, which was introduced in [1]. In a follow-up post I will show some measurements on real systems. In [2] we have applied the model to a lattice-Boltzmann (LBM) solver, but LBM is non-scalable across cores, which requires special treatment. Stay tuned.

[1] G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3180 (2014). Preprint: arXiv:1208.2908

[2] M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3489 (2015). Preprint: arXiv:1304.7664

 

WPMVP 2016: 3rd Workshop on Programming Models for SIMD/Vector Processing

Single Instruction Multiple Data (SIMD) parallelism is a major factor in the arithmetic peak performance of modern CPUs, but there is no “silver bullet” for leveraging it in the optimal way for improved application performance. Jan Eitzinger (formerly Treibig) of LIKWID fame is organizing the 3rd workshop on programming models for SIMD/vector processing, co-located with PPoPP 2016 in Barcelona, Spain. The first two workshops in this series in 2014 and 2015 brought together an interesting mix of developers and vendors, sparking lively discussions. This time the workshop proceedings will again be published in the ACM digital library. For more details please take a look at the WPMVP 2016 website.

Important dates:

  • Fri 13 Nov 2015: Paper submission deadline
  • Mon 14 Dec 2015: Notification of acceptance

Fooling the masses – Stunt 16: Worship the God of Automation!

(See the prelude for some general information on what this is all about)

Developing a true understanding of a problem by thorough analysis is so 19th century. When there were no computers, people had to actually sit down with paper and pencil (quill?) and use their brains to figure out stuff. How old-fashioned and boring.

Automation.png

Figure 1: In automation, plugging together many existing frameworks is key. Why not reuse what you already have?

With the advent of computer science we have added substantial momentum to automation. And it’s not just about automating tedious computations, but also about day-to-day tasks such as filtering information, visualizing data, and (now imagine a heroic fanfare in the background) the process of thinking itself!  With automation we don’t have to care about the validity, usefulness, or tenability of our automatically generated results, because understanding or insight was not asked for. The point is that “stuff” drops out of some program. What else could we wish for?

In high performance computing we find a particularly strong case for automation. Parallel computers are so complicated. Code is complicated. Compilers are complicated. The way performance is composed of all those little bottlenecks in the machine is complicated. So let’s design a machine (i.e., a program) that takes all this intricacy and digests it. This usually means we have to produce and interpret lots of data, but computers are good at crunching data, too! So why not use a fancy data analysis framework to dig out the information? It’s all in the data, so the machine should see it.

This is all easier than one might think. Just dance the usual CS tango and plug together all the tools you can lay your hands on (see figure). Then take a bunch of source codes, feed your new tool with it, and draw a diagram that shows in fancy colors that L3 misses and long network latencies are correlated with low performance and high energy consumption. Et voilà: real science!

Of course there are those reactionary die-hards who are still fostering the brain-quill-and-paper approach. But this is 2015! Steal their thunder by repeating the CS mantra: “Can this be automated? This can be automated. Automate this!” (a little rocking back and forth with eyes closed will emphasize the religious dimensions of your message).

If they are still obstinate, throw random pieces of Haskell code at them. That should teach them a lesson.

 

Fooling the masses – Stunt 15: Play mysterious!

(See the prelude for some general information on what this is all about)

Do you know that feeling when you think to yourself “What on earth did they do there?” when reading a paper? After some depressing hours of trying to figure out how those guys got their results, you just give up, shrug, and cite them in your related work section because everybody else does. “My goodness” you think, “I wish I knew how to do this myself.” Here are some ideas.

Make it as hard as possible for others to figure out what you actually did, so that they can’t compare your results to theirs right away. A popular strategy is to present problem sizes, i.e., the “amount of work,” separately (on a different page) from performance results, which in turn should be given in seconds (time to solution). This will force your audience to figure out the conventional “work/time” performance metric by themselves; many will give up and just believe your statement that you are “faster than X.” It will also be harder to compare with performance models, which is a good thing if you’re off the model by some large factor. Runtime is not the only effective metric; anything unusual will do, such as “microflops per picosecond” or “cache misses per core hour.” Combine with stunt 12 as needed.

Test case problem size
[grid points]
# Iterations Runtime [s]
car see page 4 500  2.34521
plane see page 6 sufficient  3.14159
train see page 2 roughly 112  0.11991
chicken see page 12 whatever  42.0

It’s really useful to be secretive about the exact hardware you were using. “eight-core AMD Opteron” or “six-core Intel Xeon” will usually suffice as a description. You may throw in the clock speed for good measure, but don’t mention whether you have activated Turbo Mode or not, or whether you used one socket or two, or how you enforced affinity. On the other hand, never forget to specify the exact Linux distribution and kernel version! No performance paper will be complete without it. We don’t want to hide anything after all, right?

Talking about openness: leave a good impression by making your software framework available for download. But when you do, take care to make it impossible to actually run it. Forge a totally convoluted, mind-twisting build system. Put in dependencies to several huge Java frameworks with version numbers far into the past or the future (requiring gcc 2.95.2 will work fine, too!). Never write plain C code, always generate it, using popular languages such as Logo, Java2kWhitespace, or Brainfuck (pick any two). Because: what they can’t run, they can’t criticize! (Think “Stupida Mouse!”) Give your software framework a fancy, foreign-sounding name, preferably from a romance language. Acronyms are so 1980s1, but “Inocentada,” “Vanitas,” “Nebulosità,” or “Ordure” will really cut the mustard.

If your strategy works you can even afford to be bold in your abstract: “Our framework achieves up to 34% more performance than previous work.” Since nobody can refute that claim due to your bleeding-edge camouflage tactics, you have laid the foundation of what will be known in a few years as “seminal work.”


1 So I was told by Julian

Attendee survey results from the SC14 “Code Optimization War Stories” BoF

At SC14 in New Orleans we (Jan Treibig and I, with help by Thomas William from TU Dresden) conducted, for the first time, a BoF with the title “Code Optimization War Stories.” Its purpose was to share and discuss several interesting and (hopefully) surprising ventures in code optimization. Despite the fact that we were accused by an anonymous reviewer of using “sexist language” that was “off-putting and not welcoming” (the offending terms in our proposal being “man-hours” and “war stories”), the BoF sparked significant interest with some 60 attendees. In parallel to the talks and discussions, participants were asked to fill out a survey about several aspects of their daily work, which programming languages and programming models they use, how they approach program optimization, whether they use tools, etc. The results are now available after I have finally gotten around to typing in the answers given on the paper version; fortunately, more than half of the people went for the online version. Overall we had 29 responses. Download the condensed results here:  CodeOptimizationWarStories_SurveyResults.pdf

The results are, in my opinion, partly surprising and partly expected. Note that there is an inherent bias here since the audience was composed of people specifically interested in code optimization (according to the first survey question, they were mostly developers and domain scientists); believe it or not, such a a crowd is by no means a representative sample of all attendees of SC14, let alone the whole HPC community. The following observations should thus be taken with a grain of salt. Or two.


Management summary: In short, people mostly write OpenMP and MPI-parallel programs in C, C++, or Fortran. The awareness of SIMD is surprisingly strong, which is probably why single-core optimization is almost as popular as socket, node, or massively parallel tuning.  x86 is the dominant architecture, time to solution is the most important target metric, and hardly anyone cares about power dissipation. The most popular tools are the simple ones: gprof and friends mostly, supported by some hardware metrics. Among those, cache misses are still everybody’s darling. Among the few who use performance models at all, it’s almost exclusively Roofline.

lang.jpg

What is the programming language that you mostly deal with when optimizing code?

Here are the details:

  • Which area of science does the code typically come from that you work on? Physics leads the pack (59%)  here, followed by maths, CFD, and chemistry.
  • Programming languages. Fortran is far from dead: 52% answered “Fortran 90/95” when asked for the programming languages they typically deal with. Even Fortran 77 is still in the game with 24%. However, C++98 (69%) and C (66%) are clearly in the lead. For me it was a little surprising that C++11 almost matches Fortran at 45% already. However, this popularity is not caused by the new threading features (see next item).
  • Typical parallel programming models. Most people unsurprisingly answered “OpenMP” or “MPI” (both 76%). The unexpected (at least in my view) runner-up with 69% is SIMD, which is good since it finally seems to get the attention it deserves! CUDA landed at a respectable 31%, and C++11 threads trail behind at 14%, beaten even by the good old POSIX threads (17%). The PGAS pack (GASPI, Global Arrays, Co-Array Fortran, UPC) hits rock bottom, being less popular than Java threads and Cilk (7% and 10%, respectively). OpenCL struggles at a mere 14%. If it weren’t for this “grain of salt” thing (see above), one might be tempted to spot a few sinking ships here…
  • Do you optimize for single core, socket/node, or highly parallel? Code optimization efforts were almost unanimously reported to revolve around the highly parallel (69%) or socket/node level (76%), but 52% also care about single-core issues. The latter does not quite match the 69% who say they write SIMD-parallel code, which leaves some room for speculation. Anyway it is good to see that some people still don’t lose sight of the single core even in a highly parallel world.

    pmodels.jpg

    What is the parallel programming model that you mostly deal with when optimizing code?

  • Computer architectures. x86-based systems are clearly in the lead (90%) in terms of target architectures, because there’s hardly anything left. IBM Blue Gene (7%) and Cray MPPs (3%) came out very low, which is entirely expected: although big machines do get a lot of media attention, most people (have to) deal with the commodity stuff under their desks and in local computing centers. Xeon Phi (34%) and Nvidia GPUs (31%) go almost head to head; a clear success for Intel’s aggressive marketing strategy.
  • Typical optimization target metrics. This was a tricky one since many metrics overlap significantly, and it is not possible to use one and not touch the other. Still it is satisfying that low time to solution comes out first (72%), followed by speedup (59%) and high work/time ratio (52%). It would be interesting to analyze whether there is a correlation or an anti-correlation between these answers across the attendees; maybe I should have added “-O0” as another optimization option in the next question 😉 Low power and/or low energy, although being “pivotal today” (if you believe the umpteenth research paper on the subject) are just slightly above radar level. Probably there’s just not enough pressure from computing centers, or probably there weren’t enough attendees from countries where electric energy isn’t almost free, who knows?
  • Optimization strategies. No surprises here: 90% try to find a good choice of compiler options, and standard strategies like fixing load imbalance, doing loop transformations, making the compiler understand the code better, and optimizing communication patterns and volume, etc., are all quite popular. Autotuning methods mark the low end of the spectrum. It seems that the state-of-the-art autotuning frameworks haven’t made it into the mainstream yet.
  • Approaches for saving energy (or related). The very few people who actually did some energy-related optimization have understood that good performance is the lowest-order effect in getting good energy efficiency. Hardly anyone tried voltage and/or frequency scaling for saving energy. This may have to do with the fact that few centers allow users to play with these parameters, or provide well-defined user interfaces. Btw, since release 3.1.2 of the LIKWID tools we support likwid-setFrequencies, a simple program which allows you to set the CPU clock speed from the shell.
  • tools.jpg

    If you have used performance tools, which ones?

    Popular performance tools. 76% answered that they had used some tool for optimizing application code at least once. However, although optimizing highly parallel applications is quite popular (see above), the typical parallel tools such as Vampir or Scalasca are less fashionable than expected (both 10%, even behind Allinea MAP), and  TAU is only used by 28%. On the other hand, good old gprof or similar compiler-based facilities still seem to be very important (48%). There were some popular tools I hadn’t thought of (unforgivably) when putting together the survey, and they ended up in the “other” category: Linux perf, valgrind, HPC Toolkit. The whole picture shows that people do deal with single-thread/node issues a lot.  LIKWID deserves a little more attention, though 🙂

  • What are the performance tools used for? I put in “producing colorful graphs” as a fun option, but 21% actually selected it! Amazing, they must have read my “Show data! Plenty.” blog post. Joke aside, 76% and 66%, respectively, reported that they use tools for identifying bottlenecks and initial code assessment. Only very few employ tools for validating performance models, which is sad but will hopefully change in the future.
  • What are the most popular hardware performance events? Ahead even of clock cycles (31%) and Flops (41%), cache misses are the all-time winner with 55%. This is not unexpected, cache misses being the black plague of HPC and all. Again, somewhat surprisingly, SIMD-related metrics are strong with 28%. Hardly anyone looks at cross-NUMA domain traffic, and literally nobody uses hardware performance counters to study code balance (inverse computational intensity).
  • Do you use performance models in your optimization efforts? The previous two questions already gave a hint towards the outcome of this one:  41% answered “No,” 21% answered “Yes,” and 24% chose “What is this performance modeling stuff anyway?” Way to go.
  • Which performance model(s) do you use? Among those who do use performance models, Roofline is (expectedly) most popular by far, although there’s not much statistics with six voters.
  • The remaining questions have dealt with how useful it would be to have a dedicated “code optimization community” (whatever that means in practice), which received almost unanimous approval, and how attendees rated the BoF as a whole. 62% want to see more of this stuff in the future, which we interpret as encouraging. Two people even considered it the “best BoF ever.” So there.

As I wrote above, there is a strong bias in these answers, and the results are clearly not representative of the HPC community. But if we accept that those who attended the BoF are really interested in optimizing application code, the conclusion must be that recent developments in tools and models trickle into this community at a very slow rate.