Georg Hager's Blog

Suche


Fooling the masses – Stunt 11: Show data! Plenty. And then some.

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

ShowDataPlenty_Bars

Figure 1: Lots of variants of an experiment, lots of machines, no idea for interpretation – why not show it all?

Computers produce data. Parallel computers produce more data. And doing performance experiments with parallel computers produces even more data, since we often conduct several runs with different code variants on different machines. Now imagine a situation where you’ve got all this data lying around on your disk, but you can’t make any sense of it. Since you wouldn’t bother to establish even a coarse performance model (your setup is just too complex for this kind of bean counting), you don’t know whether some particular performance measurement is “good” or “bad” on some particular machine, or how performance should change when some parameter is altered.

Why not put the cart before the horse and show all the data? This has decisive advantages: You can display all the work you did, without the embarrassing message that five or six numbers is all the relevant data you got from your research. Furthermore, it gives you the opportunity to add “discussion” at any desirable level of detail. For instance, if the page limit for submitting to your favorite conference is ten pages, and you have five pages full of messy bar graphs (such as the one shown in Fig. 1), you can easily fill the rest with meaningless “explanations” like “Variant 11 shows about 90% performance advantage on Machine G over Variant 10 on Machine F while, surprisingly, Variant 9 on Machine A is 6% slower than Variant 7 on Machine C.” You may want to write “92.23%” and “5.982%” here – see Stunt 8 for the reason.

In Summary, you can make up for your lack of insight by showing data. Plenty of it. Swamp your readers or your audience with hundreds of colorful bar graphs, scatter plots, timing diagrams, or whatever your favorite data visualization tool has to offer, and describe them at length. If you really want to leave the impression of true understanding, resort to obscure technical details − but this is the theme of another stunt.

 

Fooling the masses – Stunt 10: Always emphasize the “interesting” part of your work!

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

Have you ever thought about how to improve the aerodynamic properties of bulldozers? Yes? Then this stunt is for you.

Bulldozer aeodynamics

Figure 1: Always focus on the interesting stuff!

Code optimization is quite rewarding at times. You put all your effort into mutilating a chunk of code beyond all recognition, and in the end you get a 50% speedup. Often enough, this work requires a great deal of  expertise at the interface between hardware and software. And you certainly want to present your success to your peers so they understand how advanced you are.

If you do such a thing, don’t bother picking the low-hanging fruits. When 95% of the run time goes into a boring loop nest which multiplies a dense matrix with a vector, there is not much you can do beyond the standard textbook stuff – or use a library altogether. However, if the remaining 5% consist of a convoluted communication scheme, you’ve found your target. The fact that the possible gain is at most 5% shouldn’t stop you, because it’s all a matter of presentation; see Fig. 1: 3D bars and a proper choice of scales on the axes guide the reader’s eye and help you drive home your point. It’s like presenting the optimized drag coefficient of your aerodynamically improved bulldozer on high gloss paper, and the boring technical data in small black print on a dark blue background, on the back page.

Interestingly, if the optimization target is communication, or synchronization, or any other non-computational overhead, this stunt can be made to work even better: When you strongly scale to a pointless number of workers, bringing down such overheads will give you arbitrary speedups, far beyond the factor of two that is theoretically possible if you restrict yourself to a sensible minimum parallel efficiency of 50%. Coming back to the heavy gear analogy, if you accelerate your bulldozer to 200 mph, the drag coefficient does become essential, and you may feel vindicated. It won’t solve any real-world problem, but hey, this is research after all.

Thanks to Gerhard Wellein for discovering this stunt in a paper. Many papers, actually.

Fooling the masses – Stunt 9: Boast massive speedups with accelerators!

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

GPGPUs, FPGAs, massively parallel single-chip multiprocessors: Accelerators are cool. The  prospect of boosting your application’s performance by 100x or more is mouthwatering, so you invest days and weeks, even months to port and tweak code and make it run fast on that shiny new piece of hardware. However, in the end the outcome may well be a meager 2.5x, or even much less, if you compare to well-optimized parallel code on a standard 2-socket server that everyone can handle.

What happened? Well, a direct peak performance and memory bandwidth comparison reveals that – even without considering Amdahl’s Law and overheads like PCIe transfers – 2x-4x is just what can be expected. Assuming a fair game, of course. And that’s exactly your chance to sugarcoat your mediocre results and fabricate a blazing success! Here are some hints:

  • Compare bleeding-edge accelerator technology with vintage CPUs. A Pentium III will be fine for most purposes today. If you can’t get your hands on one of those, give me a call.
  • Always compare the fully parallel accelerator code with a sequential (single-core) CPU version. You can even argue in favor of this, saying that “serial CPU code constitutes a well-defined baseline.” Be sure to rush ahead to the next slide before anyone dares to ask why you don’t use a single GPGPU “core” to make the comparison even more “well-defined.”
  • Use no optimizations whatsoever with the CPU compiler; -O0 will be fine. Also use a plain, proof-of-concept implementation of your algorithm with no manual code changes. It’s good scientific practice because a rock-bottom baseline is more meaningful!
  • Use single precision on the GPU but double precision on the CPU. That will effectively cut the CPU’s cache, the memory bandwidth, and its peak performance in half.
  • Accelerate only the kernels that are easily portable and report only their performance. Amdahl’s Law and communication overhead will be nothing but smoke and mirrors!
dens MVM GPU vs. CPU

Figure 1: If you do it right, a 200x speedup for the GPU is absolutely feasible (comparing nVIDIA Tesla C2050 and Intel Dual Xeon 5650, dense matrix-vector multiply, matrix size 4500×4500, Intel compiler V12 update 11)

These tricks should give you a competitive edge; 20x-200x reported speedup should be no problem, depending on your unscrupulousness.

See Fig. 1 for an instructive example: A dense matrix-vector multiply using a 4500×4500 matrix is memory-bound on a Fermi-type GPGPU as well as on a standard CPU-based system such as a dual-socket Xeon 5650 (“Westmere”). The GPGPU baseline of about 45 GFlop/s at single precision (SP) is completely in line with the achievable memory bandwidth of roughly 90 GB/s (4 bytes per multiply-add operation, assuming the right-hand side is mostly held in shared memory). To get a slight speedup you can switch off ECC (who cares for some flipped bits?) to get about 120 GB/s, corresponding to 60 GFlop/s. No need to mention that the PCIe transfers to get the RHS and result vectors to and from the device are neglected. So much for the GPGPU.

On the CPU side, the full dual-Xeon node has a bandwidth of close to 40 GB/s, leading to an SP performance of 19 GFlop/s – much too fast to be shown in public! Going to double precision cuts that in half, but there is more: Using just a single core gets you down another factor of 4, and disabling vectorization gives you a further 0.5x (since you’re now far away from any bandwidth limitation). Finally, turning off optimization altogether (-O0) lets you hit rock bottom at 248 MFlop/s – a whopping 214x slower than the best GPGPU result! Going to my old P-III is not even necessary any more.

Note the cunning use of a linear scale in the diagram; in this particular case it is much better than a log scale since it reduces the CPU performance “baseline” to a mere flyspeck (see Stunt 3 for more information on log vs. linear scale).

This stunt is essentially identical to #6 and #7 of Bailey’s original collection.

Fooling the masses – Stunt 8: Impress your audience with awe-inspiring accuracy

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

“Qualitative” statements make you cringe. Being a scientist, you know that science is all about accuracy. And what could be more accurate than numbers?  Good for you! There is ample opportunity for generating numbers in computer science. The art is to not block your audience’s view on the raw and unfiltered truth.

# cores Time (plain) [s] Time (optimized) [s] Speedup [%]
1 12.0435766 8.6198441 39.71919283
2 6.1179025 5.5901951 9.439874469
3 4.9041394 4.6966098 4.418710705
4 4.7002801 4.6912042 0.193466317

Fig. 1: Awesome! Up to 1.77992 times faster!

A simple example: When comparing performance numbers between different machines or code versions (see Stunt 7 for some amazing tricks you can play there), don’t hesitate to provide timings with sub-microsecond resolution, even if the results fluctuate like mad. Then state, in a slightly generalizing, “management-style” demeanor, that your “optimizations lead to a performance increase of up to 39.71919283%”, and refer to the data (see the table) to substantiate your assertion. In other words, forget what your high school teachers told you about significant digits and use the full power of your pocket calculator, even if a slide rule would also do the trick.

By the way, the same works for drawing conclusions from data in diagrams. Since it is sometimes hard to read off precise values from graphs, help your audience by stating the raw facts (see Fig. 1). Using boldface and jazzy colors will help to keep peoples’ attention on the important messages. Do not waste time and space dealing with lengthy explanations but let the numbers speak for themselves!

 

Fooling the masses – Stunt 7: Be creative when comparing scaled performance!

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

Scaling comparison

Fig. 1: Comparing performance vs. number of compute units (CPU sockets, GPUs) under limited scalability can yield any result if you do it right!

Strong scaling performance comparisons between standard (i.e., CPU-based) and accelerated (e.g., GPU-based) systems are particularly rewarding, since there is ample opportunity to generate any speedup number you want. It is not about fabricating a large GPU vs. CPU speedup – this special topic deserves its own stunt (stay tuned!). You can just as well apply it to a “baseline” vs. an “optimized” code on the same machine.

This may not be easy to digest on a theoretical level, so I explain it using an example: Assume we have a parallel CPU code whose strong scalability is limited by Amdahl’s Law with a serial runtime fraction of 0.3%. Hence, the speedup factor is limited to 333. A parallel GPGPU version of the same program achieves a speedup of 10 (compared to one CPU socket) for the parallel part, and of 5 for the sequential part. Figure 1 shows a performance comparison for up to 16384 compute units (CPU sockets or GPUs, respectively). As expected, the GPGPU speedup is 10 for small numbers of workers and goes down to 5 for large numbers – always comparing equal numbers of GPGPUs and CPU sockets. So far so good.

However, if you pose the right question, the factor can be much larger – even infinity, if you want: If we ask “How many CPU sockets do we need to outcompute 32 GPUs?”, the answer is clearly 2048. This means that your GPU-vs-CPU speedup is not 5, and also not 10, but almost 64 (see the dashed line in the graph). And its gets even better: Somewhere between 32 and 40 GPUs the required number of CPU sockets slips away to infinity! Your GPUs can go where no CPU has ever gone before (and never will)!

Efficiency comparison

Fig. 2: Plotting the parallel efficiency of both variants reveals the reason why the stunt works. Note that the efficiency of the GPU version is always worse.

The reason why this works is that you compare the two graphs at different numbers of workers – so different that the parallel efficiency for the CPU code is way down, whereas for the GPU code it is still acceptable (see Fig. 2)! Or to put it the other way round: If we compare equal numbers of workers on both sides, the parallel efficiency on the CPUs is always better.

The sensible way to compare achievable performance would be to set a certain practical lower limit for the parallel efficiency, e.g., 50%. From a computing center point of view, this is probably the lowest efficiency we’d be willing to accept, except in cases where the reason for going parallel is lack of memory. Scaling beyond that point does not make sense, not even for the purpose of comparison! A reasonable question to ask would be “How much performance can I get from the two codes at their respective minimum acceptable efficiency point?”

So the gist of it is: When comparing scaled performance data, ask the right questions and be the hero of the day. And don’t talk about parallel efficiency, or people might ask why you are wasting precious CPU cycles…

Thanks to Gerhard for discovering this extraordinary stunt!

 

 

Fooling the masses – Stunt 6: Ignore affinity and topology issues!

Posted on by Georg Hager

Multicore topology of a Cray XE6 Interlagos node

Fig. 1: A Cray XE6 node (based on AMD Interlagos processors) with a rich topology: four ccNUMA domains, shared FP units and L2/L3 caches, highly anisotropic HyperTransport connection performance.

(See the prelude for some general information on what this is all about)

Real scientists are not bothered by details such as multicore, cache groups, ccNUMA, SMT, or network hierarchies. Those are just parts of a vicious plot to take the fun out of computing. Ignoring the associated issues like bandwidth bottlenecks, NUMA locality and contention penalties, file system buffer cache, synchronization overheads, etc., will make them go away.

If you present performance data and people ask specific questions about affinity and topology awareness, answer that it’s the OS’s or the compiler’s job, or the job of some auto-tuning framework, to care about those technicalities. Probably you can show off by proving that your code scales even though you run it without topology awareness; this can be easily achieved by applying Stunt 2 – just slow down code execution enough to make all overheads vanish.

If you run into problems like mysterious performance fluctuations, or performance being too low altogether (a direct but often ignored consequence of applying Stunt 2), blame the system administrators. Being lowly minions of top-achieving scientists like you, they will have the time to figure out what’s wrong with their hardware. If they tell you it’s all your fault, send a scathing e-mail to their boss and cc: your university president or company CTO, just for good measure. Finally, if all else fails, publish a paper at a high-profile conference stating that the hardware of manufacturer X shows horrible performance, especially together with compiler Y and library Z, and that the cluster you had access to was too small to get the results that you wanted in time. That’s why you are about to write a generous research proposal for a federal supercomputing facility. Anything less just won’t cut the mustard!

 

The 400x GPU speedup baloney

Posted on by Georg Hager

Recently, in the HPC services office…

(… phone rings …)

“Computing Center, HPC services. How can I help you?”

“Yes, High Performance Computing.”

“You want to use our GPGPU cluster? Great! The load on this baby could be higher anyway. It’s hard to believe, but people seem to avoid it like the plague (jovial laughter). Do you have a code that runs on GPUs already?”

“I see, the compiler should be able to handle this. But the code is SIMD vectorized for standard processors, right?”

“No, this has nothing to do with cell phones. SIMD means ‘Single Instruction Multiple Data,’ i.e., several operations can be performed on different operands with a single machine instruction. If that works, chances are that the program can be run efficiently on a GPU as well. After all, GPUs implement the SIMD principle quite extensively.”

“Hm? I think I don’t understand…”

“Ah, ok. No, you don’t have to learn assembly programming to do this. But you may have to think a little more about how the compiler looks at your code. Often you can help it by supplying additional information, like source directives. And of course you need to use a compiler that understands what SIMD is. Alas, the GNU compilers don’t have a clue about it, mostly. By the way, how  have you parallelized the code?”

“Um, no. Compilers can’t help you much here, except for very simple cases where a 10-year-old can do it just as well. But you do have to parallelize. How do you want to draw a meaningful comparison to the GPU version?”

“What do you mean, you don’t need to do this?”

“Um, yes, I think I’ve seen this paper recently. It should be on my desk somewhere… (paper rustling) And what exactly are you referring to?”

“Section 4.3, just a sec… here we are: ‘As shown in Fig. whatever, we have achieved a 400x speedup on an NVIDIA Tesla GPU as compared with our CPU implementation.’ (long pause)”

“Sorry, no, I’m still here. I’ve just been looking for the details of the CPU implementation. One moment… (longer pause) Ok, here’s something in the caption of the pretty CFD visualization: ‘In order to avoid issues with OpenMP parallelization we have run the CPU version on a single core.’ Wow. This has to sink in. And if I’m not mistaken, they compare a single-precision GPU code with a double precision version on the CPU. Truly hilarious.”

“No, I’m not making fun of those scientists. But ‘scientists’ is not the word I would use. They obviously think that everyone else is stupid.”

“Why? Because a factor of 400 is impossible. Neither the floating-point peak performance nor the memory bandwidth of the GPU is 400 times larger than that of a current standard compute node, or a chip, or even a single core. So they must have fabricated a slow CPU code on purpose. Realistically one may expect a factor of 2-4 if you compare a reasonably optimized CPU code on a single node with a single GPU, and use the same precision on both.”

“Yes, I agree. 2-4 isn’t bad at all. But that’s just counting the raw GPU. How would the data transfer affect the performance of you code? Can you estimate this?”

“Well, somehow the data must be brought into the GPU and the results must be copied back so that they don’t start rotting over there…”

“Yes (sighing with resignation). Compilers are smart. But there are limits. If you copy the whole problem through the PCI bus after every step, the only way to exploit the performance advantage of the GPU is to perform ridiculously many flops. It’s all a matter of code balance.”

“Code balance tells you how much data transfer your code needs per executed floating-point operation – and you have to count everything, including communication via buses, the network, the memory interface, etc. This can add up to quite some data volume. And then you compare with the amount of data the hardware can transfer per peak flop executed, which gives you an estimate for the performance.”

“You don’t know how many flops and how much data transfer your code needs? Can’t you just count that by looking at it? In most cases, compute time is spent in a very limited number of numerical kernel loops.”

“No, the compiler doesn’t count that for you (keeping composure with obvious effort). It’s also a matter of optimization; perhaps one can reduce the data transfers a bit. One would have to take a look at the code.”

“The compiler? (devastated) Yes, you should try that. Definitely… No listen, people are much too enthusiastic about the compiler’s abilities. Compilers can not read your mind, they can’t even look through the standard C++ template tricks that programmers are so fond of. Let alone generate optimal code for GPUs.”

“Ok, err, sorry, are you crying? (nonplussed) Please don’t. May I suggest that we sit together over a cup of coffee and I give you a crash course in basic performance modeling? Don’t worry, everything’s going to be alright. There, there…”

Fooling the masses – Stunt 5: Instead of performance, plot absolute runtime versus CPU count!

Posted on by Georg Hager

(See the prelude for some general information about what this is all about)

Runtime versus CPU count

Fig. 1: Plotting runtime versus CPU count at strong scaling, nobody will be able to tell whether scalability is good or bad at the larger CPU counts.

Apart from using a logarithmic scale (see Stunt 3), deficiencies in scalability can also be hidden effectively by plotting absolute runtime. This works for strong scaling only, since weak scaling shows constant or increasing runtime when the number of CPUs goes up.

Of course you should start at a very low number of CPUs (and not use log scale this time, or you’ll be caught immediately). Then, runtimes at large CPU counts will be compressed to a small ordinate range, and nobody will be able to tell the difference between a 1.5 or a 2.0 speedup — just as you need it (see Fig. 1). The casual observer is always stunned by the strong decrease in runtime, without noticing its insignificance. Even if the curve starts turning upward at some critical CPU count, you can often get away with the statement that your code scales up to this point, even though parallel efficiency has gone down the drain much sooner.

Excel eye candy obfuscation fury

Fig. 2: A little 3D and texture sugar will hide unpleasant truths even better. And yes, the numbers are very different at large CPU counts!

To make the results even more striking you can use the CPU time per core instead of overall wallclock time. Depending on how you determine CPU time, overhead like OS calls (system time) may be omitted and you get better-looking results. Not that it would matter in any way, because it’s all swamped by the scale.

A very popular variant of runtime graphs is the “obfuscation by eye candy fury”. Excel makes it so easy to add visual sugar to your graphs; you can literally click your audience into oblivion: 3D beveled columns, drop shadows, fill gradients on walls (and columns!), fancy textures, etc. etc.  A large number of data sets will also help (but see Stunt 11 for the full potential). Combined with the runtime-vs-CPU-count trick from above, comparing different systems without revealing the truth is a solved problem (see Fig. 2). Don’t forget to choose colors that will be indistinguishable in a grayscale printout; readers can’t criticize what they can’t see…

Fooling the masses – Stunt 4: Quietly employ weak scaling to show off!

Posted on by Georg Hager

(See the prelude for some general information about what this is all about)

As we have seen in Stunt 2 about slowing down code execution, making all parts in the speedup formula except communication overhead more dominant is good for the straightness of your scalability graph. Weak scaling, i.e., keeping the problem size per worker constant while increasing the number of workers, is probably the simplest way to achieve this (α=1 in the speedup formula). Neglecting boundary effects, many codes will then show a constant ratio of communication overhead versus computation time; given a reasonably nonblocking network hierarchy, linear scalability often comes for free, at least for a large enough number of nodes. A positive side effect is that you don’t even have to recompile, let alone change your code.You can then even show real performance numbers, not just speedups!

Fig. 1: Weak scaling with a dominating serial part. The scalability function has a small slope, but the “work” done in the parallel part scales perfectly.

In case there is no communication problem but a large serial part, just by choosing the right metric to look at you can get perfect scaling, i.e., S(N)=N. Here’s how it works: Let’s assume that, despite weak scaling, your code is dominated by a sequential part. Hence, it’s S(N)=s+(1-s)N. The function is linear but has a small slope, which doesn’t look good and leads to a parallel efficiency ε(N)=S(N)/N that is much smaller than 1. However, looking at how the processors spend their time it is clear that the speedup function for the  purely parallel work per time unit, i.e., not counting any overhead, is linear in N with a slope of 1  (see Fig. 1). How can this be used to polish performance numbers? Just report a performance metric that is valid only for the purely parallel part of the application! MFlop/s or MLup/s (lattice site updates) will do fine, for instance. If the “overhead” part contains no lattice updates, the MLup/s metric has a perfect speedup function, Smlups(N)=N, the performance graph is linear with no constant part, and ε(N)=1, even though all but one processor are twiddling their thumbs most of the time. Problem solved. Stay tuned for Stunt 10, where we will show that a similar trick is possible for communication-dominated cases as well.

The reason why it works is that we have chosen a different notion of what “work” means in the speedup formula. Speedup is defined as

S(N)=(work/time with N workers) / (work/time with 1 worker),

no matter whether the amount of work varies with N or not. While there is no discussion about what “time” means, we can give different meanings to “work”. If “work” is chosen so that only operations in the parallel part are counted, parallel efficiency is perfect at weak scaling (α=1). That trick can’t be pulled at strong scaling, though (it is left to the reader to prove this…).

It is certainly advisable to omit any mention of the fact that weak scaling was used. Your audience will cheer and rejoice in the light of such straight lines! However, don’t let them watch too long or they’ll ask nasty questions…

This stunt is essentially #4 of Bailey’s original “Twelve ways to fool the masses.”

Fooling the masses – Stunt 3: The log scale is your friend!

Posted on by Georg Hager

(See the prelude for some general information about what this is all about)

Sometimes a data plot just isn’t straight enough. You know that your code scales, but the bloody cluster stubbornly ignores this fact and the performance graph looks like a skew-whiff banana:

How are you supposed to drive your point home? Do not despair! Help is on the way. We’ll be guided by the Great Old Ones and just use a log scale. Whether it’s lin-log, log-lin, or log-log should be determined according to your particular needs for obfuscation. The point is that a log scale tends to underemphasize deviations between graphs in the same plot. Here we have chosen the log-log variant:

Doesn’t this look much nicer? You can easily attach the “almost linear scaling” label to it, although parallel efficiency is barely above 60% at the largest number of workers.

Nach oben