Georg Hager's Blog

Random thoughts on High Performance Computing

Content

Fooling the masses – Stunt 5: Instead of performance, plot absolute runtime versus CPU count!

(See the prelude for some general information about what this is all about)

Runtime versus CPU count

Fig. 1: If you plot the program’s runtime versus CPU count at strong scaling, nobody will be able to tell whether scalability is good or bad at the larger CPU counts.

Apart from using a logarithmic scale (see Stunt 3), deficiencies in scalability can also be hidden effectively by plotting absolute runtime. This works for strong scaling only, since weak scaling shows constant or increasing runtime when the number of CPUs goes up.

Of course you should start at a very low number of CPUs (and not use log scale this time, or you’ll be caught immediately). Then, runtimes at large CPU counts will be compressed to a small ordinate range, and nobody will be able to tell the difference between a 1.5 or a 2.0 speedup — just as you need it (see Fig. 1). The casual observer is always stunned by the strong decrease in runtime, without noticing its insignificance. Even if the curve starts turning upward at some critical CPU count, you can often get away with the statement that your code scales up to this point, even though parallel efficiency has gone down the drain much sooner.

Excel eye candy obfuscation fury

Fig. 2: A little 3D and texture sugar will hide unpleasant truths even better. And yes, the numbers are very different at large CPU counts!

To make the results even more striking you can use the CPU time per core instead of overall wallclock time. Depending on how you determine CPU time, overhead like OS calls (system time) may be omitted and you get better-looking results. Not that it would matter in any way, because it’s all swamped by the scale.

A very popular variant of runtime graphs is the “obfuscation by eye candy fury”. Excel makes it so easy to add visual sugar to your graphs; you can literally click your audience into oblivion: 3D beveled columns, drop shadows, fill gradients on walls (and columns!), fancy textures, etc. etc.  A large number of data sets will also help (but see Stunt 11 for the full potential). Combined with the runtime-vs-CPU-count trick from above, comparing different systems without revealing the truth is a solved problem (see Fig. 2). Don’t forget to choose colors that will be indistinguishable in a grayscale printout; readers can’t criticize what they can’t see…

Fooling the masses – Stunt 4: Quietly employ weak scaling to show off!

(See the prelude for some general information about what this is all about)

As we have seen in Stunt 2 about slowing down code execution, making all parts in the speedup formula except communication overhead more dominant is good for the straightness of your scalability graph. Weak scaling, i.e., keeping the problem size per worker constant while increasing the number of workers, is probably the simplest way to achieve this (\alpha=1 in the speedup formula). Neglecting boundary effects, many codes will then show a constant ratio of communication overhead versus computation time; given a reasonably nonblocking network hierarchy, linear scalability often comes for free, at least for a large enough number of nodes. A positive side effect is that you don’t even have to recompile, let alone change your code.You can then even show real performance numbers, not just speedups!

Fig. 1: Weak scaling with a dominating serial part. The scalability function has a small slope, but the “work” done in the parallel part scales perfectly.

In case there is no communication problem but a large serial part, just by choosing the right metric to look at you can get perfect scaling, i.e., S(N)=N. Here’s how it works: Let’s assume that, despite weak scaling, your code is dominated by a sequential part. Hence, it’s S(N)=s+(1-s)N. The function is linear but has a small slope, which doesn’t look good and leads to a parallel efficiency \varepsilon(N)=S(N)/N that is much smaller than 1. However, looking at how the processors spend their time it is clear that the speedup function for the  purely parallel work per time unit, i.e., not counting any overhead, is linear in N with a slope of 1  (see Fig. 1). How can this be used to polish performance numbers? Just report a performance metric that is valid only for the purely parallel part of the application! MFlop/s or MLup/s (lattice site updates) will do fine, for instance. If the “overhead” part contains no lattice updates, the MLup/s metric has a perfect speedup function S_\mathrm{mlups}(N)=N, the performance graph is linear with no y-intercept, and \varepsilon(N)=1, even though all but one processor are twiddling their thumbs most of the time. Problem solved. Stay tuned for Stunt 10, where we will show that a similar trick is possible for communication-dominated cases as well.

The reason why it works is that we have chosen a different notion of what “work” means in the speedup formula. Speedup is defined as

S(N)=\frac{\text{work/time with \textit{N} workers}}{\text{work/time with 1 worker}}~,

no matter whether the amount of work varies with N or not. While there is no discussion about what “time” means, we can give different meanings to “work”. If “work” is chosen so that only operations in the parallel part are counted, parallel efficiency is perfect at weak scaling (\alpha=1). That trick cannot be pulled at strong scaling, though (it is left to the reader to prove this…).

It is certainly advisable to omit any mention of the fact that weak scaling was used. Your audience will cheer and rejoice in the light of such straight lines! However, don’t let them watch too long or they’ll ask nasty questions…

This stunt is essentially #4 of Bailey’s original “Twelve ways to fool the masses.”

Fooling the masses – Stunt 3: The log scale is your friend!

(See the prelude for some general information about what this is all about)

Sometimes a data plot just isn’t straight enough. You know that your code scales, but the bloody cluster stubbornly ignores this fact and the performance graph looks like a skew-whiff banana:

How are you supposed to drive your point home? Do not despair! Help is on the way. We’ll be guided by the Great Old Ones and just use a log scale. Whether it’s lin-log, log-lin, or log-log should be determined according to your particular needs for obfuscation. The point is that a log scale tends to underemphasize deviations between graphs in the same plot. Here we have chosen the log-log variant:

Doesn’t this look much nicer? You can easily attach the “almost linear scaling” label to it, although parallel efficiency is barely above 60% at the largest number of workers.

Fooling the masses – Stunt 2: Slow down code execution!

(See the prelude for some general information about what this is all about)

Common sense dictates that whenever you speed up any part of an application, be it computation, communication, or I/O, time to solution must go down. Why should one then try to slow down computations? In a sense, this stunt is similar to Stunt 1, but there’s more to it: Whenever there is some parallel overhead that adds to pure code execution time, the denominator in our “speedup” formula from Stunt 1 gets larger, impeding scalability. To make the discussion more general, let’s look at the speedup for parallel execution with N workers and a sequential part s, and scale the parallel problem size with a factor proportional to N^\alpha:

\large S(N)=\frac{s+(1-s)N^\alpha}{s+(1-s)N^{\alpha-1}+c_\alpha(N)}

Here, c_\alpha(N) contains all the overhead that is not directly related to code execution: Communication, I/O, synchronization, etc. The parameter \alpha can be used to set the problem size scaling: \alpha=0 for strong scaling, \alpha=1 for weak scaling. Now if the “computation” parts of this expression, i.e., everything except c_\alpha(N), get larger (e.g., by a factor of \mu>1), the impact of overhead goes down by just this factor:

\large S_\mu(N)=\frac{\mu(s+(1-s)N^\alpha)}{\mu(s+(1-s)N^{\alpha-1})+c_\alpha(N)}=\frac{s+(1-s)N^\alpha}{s+(1-s)N^{\alpha-1}+\color{red}{c_\alpha(N)\mu^{-1}}}

In layman’s terms, this effect can be summarized as “A slow machine scales better,” and it is one of the key reasons why Stunt 1 works.

Three corollaries immediately follow from this:

  1. Do not use high compiler optimization levels or the latest compiler versions. This is always possible if the machine you use just isn’t slow enough.
  2. Use a convoluted C++ framework that hides all performance complexities by neatly overloaded operators and template mechanics. You can then claim that, since the compiler will generate “optimal” code, performance is not your concern any more.
  3. If scalability is still bad, parallelize some short loops with OpenMP. That way you can get some extra bonus for a scalable hybrid code! Everyone knows today that “one should go hybrid”, even if there’s no indication that this will do any good.

If someone asks for time to solution, answer that if you had a bigger machine, you could get the solution as fast as you want. This is of course due to the superior scalability of your code!

However, let’s not forget that there are valid arguments for machines with slow processors like the (now extinct) IBM Blue Gene. Apart from the power consumption issue (a core that is \mu times slower than a standard x86 core consumes far less than 1/\mu times the power), it can be beneficial to use \mu N slow CPUs instead of N fast ones, if communication overhead has a certain dependence on N. See our book for a theoretical treatment of “slow computing”.

Fooling the masses – Stunt 1: Report speedup, not absolute performance!

(See the prelude for some general information about what this is all about.)

Have you ever been stuck with a slow machine, but needed to compare it with something else (much faster) you didn’t have the wits or guts to use? Or even worse, you want to sell one of those slow machines and no matter what you do, you just can’t get your codes run faster than your competition? This stunt may be for you. On a very simple level, if we define “speedup” as

\large S(N)=\frac{\text{work/time with \textit{N} workers}}{\text{work/time with 1 worker}}

it is clear that it is a gift from heaven: If S\approx N we speak of “good scalability”, but there is no indication of how fast a certain problem can be solved, or even how many “operations” per time unit are performed. Note that the speedup definition above works for strong and weak scaling scenarios alike.

Stunt 1 - Performance vs. speedup

In this example we see a comparison between some “big iron” machine, let’s call it “NEC”, and a standard cluster, which you would like to show off in your presentation. As you can see on the left, the big machine outperforms the cluster by far on a worker-by-worker basis; however, if the one-worker performance is normalized to one we see that the cluster “scales better”. Whatever the particular reasons for this may be, presenting scalability (or speedup) instead of absolute performance is key.

Certainly, not all audiences are so easily deceived, but labeling your talk with the word “executive” somewhere in the title will fend off the geeks and leave you with a convenient flock of suits who will eat what you give them.

Fooling the masses with performance results on parallel computers – prelude

In 1991, David H. Bailey published his insightful “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers.” In that humorous article, Bailey pinpointed typical “evade and disguise” techniques for presenting mediocre performance results in the best possible light. These are the original 12 ways:

  1. Quote only 32-bit performance results, not 64-bit results.
  2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application.
  3. Quietly employ assembly code and other low-level language constructs.
  4. Scale up the problem size with the number of processors, but omit any mention of this fact.
  5. Quote performance results projected to a full system.
  6. Compare your results against scalar, unoptimized code on Crays.
  7. When direct run time comparisons are required, compare with an old code on an obsolete system.
  8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation.
  9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar.
  10. Mutilate the algorithm used in the parallel implementation to match the architecture.
  11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment.
  12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance.

There are further explanations in the original paper for each item.

After two decades, it’s high time for an update. In 1991 the supercomputing landscape was governed by the “chicken vs. oxen” debate: The famous question “If you were plowing a field, which would you rather use?… Two strong oxen or 1024 chickens?” is attributed to Seymour Cray who couldn’t have said it better. Cray’s machines were certainly dominating in the oxen department, but competition from massively parallel systems like the Connection Machine was building up. At that time, users were much more used to dive into system-specific optimizations — with no MPI and OpenMP standards, portability of parallel programs was pretty much restricted to a certain vendor. And the use of double precision floating point was probably not as much a matter of course as it is today.

In the past two decades, hybrid, hierarchical systems, multi-core processors, accelerator technology, and the dominating presence of commodity hardware have reshaped the landscape of High Performance Computing. It’s also not so much oxen vs. chickens anymore; ants have received more than their share of hype. However, some things never change. My points (which I prefer to call “stunts”) are derived from Bailey’s original collection, and some are identical or merely reformulated. Others are new, reflecting today’s boundary conditions.

Although these musings are certainly inspired by experience with many publications and talks in HPC, I wish to point out that (i) no offense is intended, (ii) I am not immune to the inherent temptations myself and (iii) this all still just meant to be fun.

This is the list of stunts. It will be extended along the way:

  1. Report speedup instead of absolute performance!
  2. Slow down code execution!
  3. The log scale is your friend!
  4. Quietly employ weak scaling to show off!
  5. Instead of performance, plot absolute runtime versus CPU count!
  6. Ignore affinity and topology issues! 
  7. Be creative when comparing scaled performance!
  8. Impress your audience with awe-inspiring accuracy!
  9. Boast massive speedups with accelerators!
  10. Always emphasize the “interesting” part of your work!
  11. Show data! Plenty. And then some.
  12. Redefine “performance” to suit your needs!
  13. If they get you cornered, blame it all on OS jitter!
  14. Secretly use fancy hardware setups and software tricks!
  15. Play mysterious!
  16. Worship the God of Automation!

 

 

LIKWID tools online

LIKWID stands for “Like I Knew What I’m Doing”.

Knowing a little more can never be wrong in HPC. LIKWID is a small set of Linux tools (developed by Thomas Röhl and Jan Eitzinger) that simplify getting information about a multi-core CPU and the code running on it:

  • likwid-topology: Show the thread and cache group structure (“topology”) on Intel and AMD processors
  • likwid-features: Show and Toggle hardware prefetch control bits on Intel Core 2 processors
  • likwid-perfctr: Measure hardware performance counters on Intel and AMD processors
  • likwid-pin: Pin threads and processes to cores from outside an application

These tools can help users a lot with “knowing what they are doing” when looking at code performance on their multi-core CPUs. likwid-perfctr is meant to be a rough equivalent of the perfex tool, which was available on SGI Origins as part of the Speedshop performance toolset (ah, sweet memories). It counts a configurable set of hardware performance metrics across the whole runtime of a program, or between markers you can insert into your code (similar to the calipers in Speedshop). It is much simpler and less powerful than things like Oprofile, CodeAnalyst, or VTune, but sometimes an overview is more than sufficient to know what’s going on. Plus, it does not require any kernel patches and works with Intel and AMD processors alike.

LIKWID is under active development. Check out the project, try the tools and provide feedback!

A case for the non-temporal store

Non-temporal stores (also called “streaming stores”) were added to x86 processors with SSE2 (i.e. when the first Pentium IV called “Willamette” appeared in 2000). There was also some support in MMX before, but nobody should be required to use MMX any more. In contrast to standard store instructions which transfer data from registers to memory, NT stores do not require a prior cache line read for ownership (RFO) but write to memory “directly” (well not quite – but to first order it’s true). Beware the usual explanations for RFO that Google finds for you – they almost all describe RFO as something inherently connected to cache coherency in multi-processor systems, but RFO is also required on a single core, whenever there is a cache that is organized in lines.

What are NT stores good for? Again, most references cite them as a means to avoid cache pollution in cases where stored data will not be re-used soon. That’s true to some extent; every word that you don’t store in your cache means more room for other, more frequently used data. But that’s only one half the story. Imagine you have a code whose working set fits into the outer-level cache and which produces data to be stored in memory. Using NT stores for such a code will probably slow it down because, depending on the LD/ST balance, performance is then limited by the memory interface. The true benefit of NT stores can be seen with a truly memory-bound code which is not dominated by loads, like, e.g., the STREAM COPY benchmark, relaxation methods, the lattice-Boltzmann algorithm, or anything that can be formulated in terms of stride-one store streams: By saving the RFO, the pressure on the memory subsystem is reduced. See, e.g., my talk at the KONWIHR results and review workshop in December 2007 at LRZ where the performance boost through NT stores was demonstrated using a 2D Jacobi (5-point stencil) code.

The most important non-temporal store instruction is movntpd, which writes the contents of a full 16-byte SSE register (xmm0…xmm15) to memory. The memory address has to be a multiple of 16, else an exception will be generated. Also, this instruction exists in a “packed” variant only, i.e. there is no movntsd that would only store the lower 8 bytes of the register (AMD has included it, though, into their SSE4a extensions which are implemented with the K10 CPUs – read a nice writeup by Rahul Chaturvedi from AMD for more information). So, if you are stuck with Intel processors or a compiler which does not know about movntsd on K10, loops which use NT stores must be vectorizable and the store stream(s) must be 16-byte aligned.

Or so I thought, for a long time. This assumption was backed, of course, by stupid compilers who blatantly refused to use NT stores if they weren’t 100% sure at compile time whether the store stream was actually aligned. Never heard of loop peeling, eh? Thank goodness great Intel gave us the vector aligned pragma, which you can put in front of a loop in order to tell the compiler that the streams appearing inside the loop were 16-byte aligned! But that always refers to all streams, including reads, and if you somehow forget to pay proper attention, you’re immediately punished by an exception again. To make a long story short, you never really know what’s going on inside the compiler’s twisted little brain, and the diagnostics don’t help either (“LOOP WAS VECTORIZED”, yes, thank you very much).

In hope for some future improvement (unaligned NT store? Or *gasp* maybe even a scalar one?) I looked into Intel’s AVX extensions for SSE, to be implemented in the Sandy Bridge processor, which is to follow Westmere in 2010. That would be a “Streaming SIMD Extension Extension”, right? Whatever, there is neither a scalar nor an unaligned packed NT store in there. But I stumbled across something that had been there since SSE2: The beautiful maskmovdqu instruction. From the manual:


MASKMOVDQU xmm1,xmm2 – Store Selected Bytes of Double Quadword with NT Hint

Stores selected bytes from the source operand (first operand) into an 128-bit memory location. The mask operand (second operand) selects which bytes from the source operand are written to memory. The source and mask operands are XMM registers. The location of the first byte of the memory location is specified by DI/EDI/RDI and DS registers. The memory location does not need to be aligned on a natural boundary. […]


Way cool. Although the fixed address register is somewhat inflexible, the instruction is even better than pure scalar or pure unaligned NT store – it can store any part of an XMM register, down to the byte level, and with no alignment restrictions. And there’s a C/C++ compiler intrinsic, too. That has made it easy to implement our beloved vector triad in explicit SSE. These are two possible variants:

Scalar (unaligned) Packed unaligned
  __m128d xmm1,xmm2,xmm3,xmm4;
  __m128i mask_low=_mm_set_epi32(0,0,-1,-1);
#pragma omp parallel private(j)
{
if(size > 1000000) {
  for(j=0; j<niter; j++){
#pragma omp for private(xmm1,xmm2,xmm3,xmm4)
    for(i=0; i<size; i++) {
      xmm1 = _mm_load_sd(d+i);
      xmm2 = _mm_load_sd(c+i);
      xmm3 = _mm_load_sd(b+i);
      xmm4 = _mm_mul_sd(xmm1,xmm2);
      xmm4 = _mm_add_sd(xmm4,xmm3);
      _mm_maskmoveu_si128(reinterpret_cast<__m128i>(xmm4), mask_low, (char*)(a+i));
    }
  }
} else {
// same with standard store
}
  __m128d xmm1,xmm2,xmm3,xmm4;
  __m128i mask_all=_mm_set_epi32(-1,-1,-1,-1);
#pragma omp parallel private(j)
{
if(size > 1000000) {
  for(j=0; j<niter; j++){
#pragma omp for private(xmm1,xmm2,xmm3,xmm4,xmm5)
    for(i=0; i<size; i+=2) {
      xmm1 = _mm_load_pd(d+i);
      xmm2 = _mm_load_pd(c+i);
      xmm3 = _mm_load_pd(b+i);
      xmm4 = _mm_mul_pd(xmm1,xmm2);
      xmm4 = _mm_add_pd(xmm4,xmm3);
      _mm_maskmoveu_si128(reinterpret_cast<__m128i>(xmm4), mask_all, (char*)(a+i));
    }
  }
} else {
// same with standard store
}

The left one is an example for a purely scalar code which employs NT stores. The right one shows how a vectorized code can be endowed with NT stores if 16-byte alignment cannot be enforced. The compiler does neither by itself. Of course, NT stores only make sense if the arrays are large. I have set the lower limit to N=1000000, which is largish compared to the cache sizes but neatly shows the effect of NT vs. standard store:

The figure includes standard scalar versions without NT stores as well. Obviously, the maskmovdqu instruction does the job. There is hardly any performance loss in the serial case, and none at all with four threads. I have omitted any comment on in-cache behavior of the different variants; this is a separate story (and one which becomes more interesting once Nehalem is out!).

If you want to know more about using SSE intrinsics in C/C++ code, the Intel C++ Compiler for Linux Intrinsics Reference is for you.

The “roofline model” for kernel performance assessment

Sam Williams from UCB has come up with a very nice method to illustrate optimization potential for loop kernels on a known architecture. Everyone who knows about things like code and machine balance can estimate the expected fraction of “light speed” for some loop kernel. However, depending on your knowledge (or your assumptions) about the architecture under consideration, machine balance can be a moving target: Do you consider SIMD instructions to be applicable? Does the data set fit into some cache? Can the arithmetic pipelines be used to their full capacity? Are MULTs and ADDs balanced in the code? Is prefetching possible? Can non-temporal stores be used? Usually, we compute different machine balance numbers for all those cases to get our estimates.

Williams has found a very nice way to incorporate all this information into a graphical representation, the roofline diagram. With it, one can illustrate not only the architectural limits for kernel performance, but also the optimization potential of some (given) implementation. Read the full presentation: The Roofline Model: A pedagogical tool for program analysis and optimization. There is also a nice poster.

Floating point computations demystified

For those who constantly wonder why floating-point results for the same numerical problem tend to vary across architectures, compilers, compiler options and parallel program runs, this paper may be interesting:

David Monniaux, The pitfalls of verifying floating-point computations. arXiv:cs/0701192v5

Especially the first chapters form a nice intro on things you always wanted to know about floating-point but never dared to ask…