Georg Hager's Blog

Random thoughts on High Performance Computing

Inhalt

Fooling the masses – Stunt 6: Ignore affinity and topology issues!

AMD Epyc 7451 node

Fig. 1: A dual-socket AMD Epyc 7451 node with four “Zeppelin” dies per socket (eight ccNUMA domains per node). Each 8 MiB L3 cache is shared by three cores, i.e., half the die.

(See the prelude for some general information on what this is all about)

Real scientists are not bothered by details such as multicore, cache groups, ccNUMA, SMT, or network hierarchies. Those are just parts of a vicious plot to take the fun out of computing. Ignoring the associated issues like bandwidth bottlenecks, NUMA locality and contention penalties, file system buffer cache, synchronization overheads, etc., will make them go away.

If you present performance data and people ask specific questions about affinity and topology awareness, answer that it’s the OS’s or the compiler’s job, or the job of some auto-tuning framework, to care about those technicalities. Probably you can show off by proving that your code scales even though you run it without topology awareness; this can be easily achieved by applying Stunt 2 – just slow down code execution enough to make all overheads vanish.

If you run into problems like mysterious performance fluctuations, or performance being too low altogether (a direct but often ignored consequence of applying Stunt 2), blame the system administrators. Being lowly minions of top-achieving scientists like you, they will have the time to figure out what’s wrong with their hardware. If they tell you it’s all your fault, send a scathing e-mail to their boss and cc: your university president or company CTO, just for good measure. Finally, if all else fails, publish a paper at a high-profile conference stating that the hardware of manufacturer X shows horrible performance, especially together with compiler Y and library Z, and that the cluster you had access to was too small to get the results that you wanted in time. That’s why you are about to write a generous research proposal for a federal supercomputing facility. Anything less just won’t cut the mustard!

 

The 400x GPU speedup baloney

Recently, in the HPC services office…

(… phone rings …)

“Computing Center, HPC services. How can I help you?”

“Yes, High Performance Computing.”

“You want to use our GPGPU cluster? Great! The load on this baby could be higher anyway. It’s hard to believe, but people seem to avoid it like the plague (jovial laughter). Do you have a code that runs on GPUs already?”

“I see, the compiler should be able to handle this. But the code is SIMD vectorized for standard processors, right?”

“No, this has nothing to do with cell phones. SIMD means ‘Single Instruction Multiple Data,’ i.e., several operations can be performed on different operands with a single machine instruction. If that works, chances are that the program can be run efficiently on a GPU as well. After all, GPUs implement the SIMD principle quite extensively.”

“Hm? I think I don’t understand…”

“Ah, ok. No, you don’t have to learn assembly programming to do this. But you may have to think a little more about how the compiler looks at your code. Often you can help it by supplying additional information, like source directives. And of course you need to use a compiler that understands what SIMD is. Alas, the GNU compilers don’t have a clue about it, mostly. By the way, how  have you parallelized the code?”

“Um, no. Compilers can’t help you much here, except for very simple cases where a 10-year-old can do it just as well. But you do have to parallelize. How do you want to draw a meaningful comparison to the GPU version?”

“What do you mean, you don’t need to do this?”

“Um, yes, I think I’ve seen this paper recently. It should be on my desk somewhere… (paper rustling) And what exactly are you referring to?”

“Section 4.3, just a sec… here we are: ‘As shown in Fig. whatever, we have achieved a 400x speedup on an NVIDIA Tesla GPU as compared with our CPU implementation.’ (long pause)”

“Sorry, no, I’m still here. I’ve just been looking for the details of the CPU implementation. One moment… (longer pause) Ok, here’s something in the caption of the pretty CFD visualization: ‘In order to avoid issues with OpenMP parallelization we have run the CPU version on a single core.’ Wow. This has to sink in. And if I’m not mistaken, they compare a single-precision GPU code with a double precision version on the CPU. Truly hilarious.”

“No, I’m not making fun of those scientists. But ‘scientists’ is not the word I would use. They obviously think that everyone else is stupid.”

“Why? Because a factor of 400 is impossible. Neither the floating-point peak performance nor the memory bandwidth of the GPU is 400 times larger than that of a current standard compute node, or a chip, or even a single core. So they must have fabricated a slow CPU code on purpose. Realistically one may expect a factor of 2-4 if you compare a reasonably optimized CPU code on a single node with a single GPU, and use the same precision on both.”

“Yes, I agree. 2-4 isn’t bad at all. But that’s just counting the raw GPU. How would the data transfer affect the performance of you code? Can you estimate this?”

“Well, somehow the data must be brought into the GPU and the results must be copied back so that they don’t start rotting over there…”

“Yes (sighing with resignation). Compilers are smart. But there are limits. If you copy the whole problem through the PCI bus after every step, the only way to exploit the performance advantage of the GPU is to perform ridiculously many flops. It’s all a matter of code balance.”

“Code balance tells you how much data transfer your code needs per executed floating-point operation – and you have to count everything, including communication via buses, the network, the memory interface, etc. This can add up to quite some data volume. And then you compare with the amount of data the hardware can transfer per peak flop executed, which gives you an estimate for the performance.”

“You don’t know how many flops and how much data transfer your code needs? Can’t you just count that by looking at it? In most cases, compute time is spent in a very limited number of numerical kernel loops.”

“No, the compiler doesn’t count that for you (keeping composure with obvious effort). It’s also a matter of optimization; perhaps one can reduce the data transfers a bit. One would have to take a look at the code.”

“The compiler? (devastated) Yes, you should try that. Definitely… No listen, people are much too enthusiastic about the compiler’s abilities. Compilers can not read your mind, they can’t even look through the standard C++ template tricks that programmers are so fond of. Let alone generate optimal code for GPUs.”

“Ok, err, sorry, are you crying? (nonplussed) Please don’t. May I suggest that we sit together over a cup of coffee and I give you a crash course in basic performance modeling? Don’t worry, everything’s going to be alright. There, there…”

Fooling the masses – Stunt 5: Instead of performance, plot absolute runtime versus CPU count!

(See the prelude for some general information about what this is all about)

Runtime versus CPU count

Fig. 1: If you plot the program’s runtime versus CPU count at strong scaling, nobody will be able to tell whether scalability is good or bad at the larger CPU counts.

Apart from using a logarithmic scale (see Stunt 3), deficiencies in scalability can also be hidden effectively by plotting absolute runtime. This works for strong scaling only, since weak scaling shows constant or increasing runtime when the number of CPUs goes up.

Of course you should start at a very low number of CPUs (and not use log scale this time, or you’ll be caught immediately). Then, runtimes at large CPU counts will be compressed to a small ordinate range, and nobody will be able to tell the difference between a 1.5 or a 2.0 speedup — just as you need it (see Fig. 1). The casual observer is always stunned by the strong decrease in runtime, without noticing its insignificance. Even if the curve starts turning upward at some critical CPU count, you can often get away with the statement that your code scales up to this point, even though parallel efficiency has gone down the drain much sooner.

Excel eye candy obfuscation fury

Fig. 2: A little 3D and texture sugar will hide unpleasant truths even better. And yes, the numbers are very different at large CPU counts!

To make the results even more striking you can use the CPU time per core instead of overall wallclock time. Depending on how you determine CPU time, overhead like OS calls (system time) may be omitted and you get better-looking results. Not that it would matter in any way, because it’s all swamped by the scale.

A very popular variant of runtime graphs is the “obfuscation by eye candy fury”. Excel makes it so easy to add visual sugar to your graphs; you can literally click your audience into oblivion: 3D beveled columns, drop shadows, fill gradients on walls (and columns!), fancy textures, etc. etc.  A large number of data sets will also help (but see Stunt 11 for the full potential). Combined with the runtime-vs-CPU-count trick from above, comparing different systems without revealing the truth is a solved problem (see Fig. 2). Don’t forget to choose colors that will be indistinguishable in a grayscale printout; readers can’t criticize what they can’t see…

Fooling the masses – Stunt 4: Quietly employ weak scaling to show off!

(See the prelude for some general information about what this is all about)

As we have seen in Stunt 2 about slowing down code execution, making all parts in the speedup formula except communication overhead more dominant is good for the straightness of your scalability graph. Weak scaling, i.e., keeping the problem size per worker constant while increasing the number of workers, is probably the simplest way to achieve this (\alpha=1 in the speedup formula). Neglecting boundary effects, many codes will then show a constant ratio of communication overhead versus computation time; given a reasonably nonblocking network hierarchy, linear scalability often comes for free, at least for a large enough number of nodes. A positive side effect is that you don’t even have to recompile, let alone change your code.You can then even show real performance numbers, not just speedups!

Fig. 1: Weak scaling with a dominating serial part. The scalability function has a small slope, but the “work” done in the parallel part scales perfectly.

In case there is no communication problem but a large serial part, just by choosing the right metric to look at you can get perfect scaling, i.e., S(N)=N. Here’s how it works: Let’s assume that, despite weak scaling, your code is dominated by a sequential part. Hence, it’s S(N)=s+(1-s)N. The function is linear but has a small slope, which doesn’t look good and leads to a parallel efficiency \varepsilon(N)=S(N)/N that is much smaller than 1. However, looking at how the processors spend their time it is clear that the speedup function for the  purely parallel work per time unit, i.e., not counting any overhead, is linear in N with a slope of 1  (see Fig. 1). How can this be used to polish performance numbers? Just report a performance metric that is valid only for the purely parallel part of the application! MFlop/s or MLup/s (lattice site updates) will do fine, for instance. If the “overhead” part contains no lattice updates, the MLup/s metric has a perfect speedup function S_\mathrm{mlups}(N)=N, the performance graph is linear with no y-intercept, and \varepsilon(N)=1, even though all but one processor are twiddling their thumbs most of the time. Problem solved. Stay tuned for Stunt 10, where we will show that a similar trick is possible for communication-dominated cases as well.

The reason why it works is that we have chosen a different notion of what “work” means in the speedup formula. Speedup is defined as

S(N)=\frac{\text{work/time with \textit{N} workers}}{\text{work/time with 1 worker}}~,

no matter whether the amount of work varies with N or not. While there is no discussion about what “time” means, we can give different meanings to “work”. If “work” is chosen so that only operations in the parallel part are counted, parallel efficiency is perfect at weak scaling (\alpha=1). That trick cannot be pulled at strong scaling, though (it is left to the reader to prove this…).

It is certainly advisable to omit any mention of the fact that weak scaling was used. Your audience will cheer and rejoice in the light of such straight lines! However, don’t let them watch too long or they’ll ask nasty questions…

This stunt is essentially #4 of Bailey’s original “Twelve ways to fool the masses.”

Fooling the masses – Stunt 3: The log scale is your friend!

(See the prelude for some general information about what this is all about)

Sometimes a data plot just isn’t straight enough. You know that your code scales, but the bloody cluster stubbornly ignores this fact and the performance graph looks like a skew-whiff banana:

How are you supposed to drive your point home? Do not despair! Help is on the way. We’ll be guided by the Great Old Ones and just use a log scale. Whether it’s lin-log, log-lin, or log-log should be determined according to your particular needs for obfuscation. The point is that a log scale tends to underemphasize deviations between graphs in the same plot. Here we have chosen the log-log variant:

Doesn’t this look much nicer? You can easily attach the “almost linear scaling” label to it, although parallel efficiency is barely above 60% at the largest number of workers.

Fooling the masses – Stunt 2: Slow down code execution!

(See the prelude for some general information about what this is all about)

Common sense dictates that whenever you speed up any part of an application, be it computation, communication, or I/O, time to solution must go down. Why should one then try to slow down computations? In a sense, this stunt is similar to Stunt 1, but there’s more to it: Whenever there is some parallel overhead that adds to pure code execution time, the denominator in our “speedup” formula from Stunt 1 gets larger, impeding scalability. To make the discussion more general, let’s look at the speedup for parallel execution with N workers and a sequential part s, and scale the parallel problem size with a factor proportional to N^\alpha:

\large S(N)=\frac{s+(1-s)N^\alpha}{s+(1-s)N^{\alpha-1}+c_\alpha(N)}

Here, c_\alpha(N) contains all the overhead that is not directly related to code execution: Communication, I/O, synchronization, etc. The parameter \alpha can be used to set the problem size scaling: \alpha=0 for strong scaling, \alpha=1 for weak scaling. Now if the “computation” parts of this expression, i.e., everything except c_\alpha(N), get larger (e.g., by a factor of \mu>1), the impact of overhead goes down by just this factor:

\large S_\mu(N)=\frac{\mu(s+(1-s)N^\alpha)}{\mu(s+(1-s)N^{\alpha-1})+c_\alpha(N)}=\frac{s+(1-s)N^\alpha}{s+(1-s)N^{\alpha-1}+\color{red}{c_\alpha(N)\mu^{-1}}}

In layman’s terms, this effect can be summarized as “A slow machine scales better,” and it is one of the key reasons why Stunt 1 works.

Three corollaries immediately follow from this:

  1. Do not use high compiler optimization levels or the latest compiler versions. This is always possible if the machine you use just isn’t slow enough.
  2. Use a convoluted C++ framework that hides all performance complexities by neatly overloaded operators and template mechanics. You can then claim that, since the compiler will generate “optimal” code, performance is not your concern any more.
  3. If scalability is still bad, parallelize some short loops with OpenMP. That way you can get some extra bonus for a scalable hybrid code! Everyone knows today that “one should go hybrid”, even if there’s no indication that this will do any good.

If someone asks for time to solution, answer that if you had a bigger machine, you could get the solution as fast as you want. This is of course due to the superior scalability of your code!

However, let’s not forget that there are valid arguments for machines with slow processors like the (now extinct) IBM Blue Gene. Apart from the power consumption issue (a core that is \mu times slower than a standard x86 core consumes far less than 1/\mu times the power), it can be beneficial to use \mu N slow CPUs instead of N fast ones, if communication overhead has a certain dependence on N. See our book for a theoretical treatment of “slow computing”.

Fooling the masses – Stunt 1: Report speedup, not absolute performance!

(See the prelude for some general information about what this is all about.)

Have you ever been stuck with a slow machine, but needed to compare it with something else (much faster) you didn’t have the wits or guts to use? Or even worse, you want to sell one of those slow machines and no matter what you do, you just can’t get your codes run faster than your competition? This stunt may be for you. On a very simple level, if we define “speedup” as

\large S(N)=\frac{\text{work/time with \textit{N} workers}}{\text{work/time with 1 worker}}

it is clear that it is a gift from heaven: If S\approx N we speak of “good scalability”, but there is no indication of how fast a certain problem can be solved, or even how many “operations” per time unit are performed. Note that the speedup definition above works for strong and weak scaling scenarios alike.

Stunt 1 - Performance vs. speedup

In this example we see a comparison between some “big iron” machine, let’s call it “NEC”, and a standard cluster, which you would like to show off in your presentation. As you can see on the left, the big machine outperforms the cluster by far on a worker-by-worker basis; however, if the one-worker performance is normalized to one we see that the cluster “scales better”. Whatever the particular reasons for this may be, presenting scalability (or speedup) instead of absolute performance is key.

Certainly, not all audiences are so easily deceived, but labeling your talk with the word “executive” somewhere in the title will fend off the geeks and leave you with a convenient flock of suits who will eat what you give them.

Fooling the masses with performance results on parallel computers – prelude

In 1991, David H. Bailey published his insightful “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers.” In that humorous article, Bailey pinpointed typical “evade and disguise” techniques for presenting mediocre performance results in the best possible light. These are the original 12 ways:

  1. Quote only 32-bit performance results, not 64-bit results.
  2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application.
  3. Quietly employ assembly code and other low-level language constructs.
  4. Scale up the problem size with the number of processors, but omit any mention of this fact.
  5. Quote performance results projected to a full system.
  6. Compare your results against scalar, unoptimized code on Crays.
  7. When direct run time comparisons are required, compare with an old code on an obsolete system.
  8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation.
  9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar.
  10. Mutilate the algorithm used in the parallel implementation to match the architecture.
  11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment.
  12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance.

There are further explanations in the original paper for each item.

After two decades, it’s high time for an update. In 1991 the supercomputing landscape was governed by the “chicken vs. oxen” debate: The famous question “If you were plowing a field, which would you rather use?… Two strong oxen or 1024 chickens?” is attributed to Seymour Cray who couldn’t have said it better. Cray’s machines were certainly dominating in the oxen department, but competition from massively parallel systems like the Connection Machine was building up. At that time, users were much more used to dive into system-specific optimizations — with no MPI and OpenMP standards, portability of parallel programs was pretty much restricted to a certain vendor. And the use of double precision floating point was probably not as much a matter of course as it is today.

In the past two decades, hybrid, hierarchical systems, multi-core processors, accelerator technology, and the dominating presence of commodity hardware have reshaped the landscape of High Performance Computing. It’s also not so much oxen vs. chickens anymore; ants have received more than their share of hype. However, some things never change. My points (which I prefer to call “stunts”) are derived from Bailey’s original collection, and some are identical or merely reformulated. Others are new, reflecting today’s boundary conditions.

Although these musings are certainly inspired by experience with many publications and talks in HPC, I wish to point out that (i) no offense is intended, (ii) I am not immune to the inherent temptations myself and (iii) this all still just meant to be fun.

This is the list of stunts. It will be extended along the way:

  1. Report speedup instead of absolute performance!
  2. Slow down code execution!
  3. The log scale is your friend!
  4. Quietly employ weak scaling to show off!
  5. Instead of performance, plot absolute runtime versus CPU count!
  6. Ignore affinity and topology issues! 
  7. Be creative when comparing scaled performance!
  8. Impress your audience with awe-inspiring accuracy!
  9. Boast massive speedups with accelerators!
  10. Always emphasize the “interesting” part of your work!
  11. Show data! Plenty. And then some.
  12. Redefine “performance” to suit your needs!
  13. If they get you cornered, blame it all on OS jitter!
  14. Secretly use fancy hardware setups and software tricks!
  15. Play mysterious!
  16. Worship the God of Automation!