Georg Hager's Blog

Random thoughts on High Performance Computing


Fooling the masses – Stunt 16: Worship the God of Automation!

(See the prelude for some general information on what this is all about)

Developing a true understanding of a problem by thorough analysis is so 19th century. When there were no computers, people had to actually sit down with paper and pencil (quill?) and use their brains to figure out stuff. How old-fashioned and boring.


Figure 1: In automation, plugging one framework into the next is pivotal. Why not reuse what you already have?

With the advent of computer science we have added substantial momentum to automation. And it’s not just about automating tedious computations, but also about day-to-day tasks such as filtering information, visualizing data, and (now imagine a heroic fanfare in the background) the process of thinking itself!  With automation we don’t have to care about the validity, usefulness, or tenability of our automatically generated results, because understanding or insight was not asked for. The point is that “stuff” drops out of some program. What else could we wish for?

In high performance computing we find a particularly strong case for automation. Parallel computers are so complicated. Code is complicated. Compilers are complicated. The way performance is composed of all those little bottlenecks in the machine is complicated. So let’s design a machine (i.e., a program) that takes all this intricacy and digests it. This usually means we have to produce and interpret lots of data, but computers are good at crunching data, too! So why not use a fancy data analysis framework to dig out the information? It’s all in the data, so the machine should be able to see through it.

This is all easier than one might think. Do what tool developers do all the time: Build a new meta-tool from all the analysis tools you can lay your hands on (see figure). Then take a bunch of source codes, feed them into your new tool, and draw a diagram that shows in fancy colors that L3 misses and long network latencies are correlated with low performance and high energy consumption. Et voilà: real science!

Of course there are those reactionary die-hards who are still fostering the brain-quill-and-paper approach. But this is the age of cybernation1! Steal their thunder by repeating the computer science mantra: “Can this be automated? This can be automated. Automate this!” (a little rocking back and forth with eyes closed will emphasize the religious dimensions of your message).

If they are still obstinate, throw random pieces of Haskell code at them. That should teach them a lesson.

1 I love the similarity between cybernation and hibernation. It’s cyberlarious.

Fooling the masses – Stunt 15: Play mysterious!

(See the prelude for some general information on what this is all about)

Do you know that feeling when you think to yourself “What on earth did they do there?” when reading a paper? After some depressing hours of trying to figure out how those guys got their results, you just give up, shrug, and cite them in your related work section because everybody else does. “My goodness” you think, “I wish I knew how to do this myself.” Here are some ideas.

Make it as hard as possible for others to figure out what you actually did, so that they can’t compare your results to theirs right away. A popular strategy is to present problem sizes, i.e., the “amount of work,” separately (on a different page) from performance results, which in turn should be given in seconds (time to solution). This will force your audience to figure out the conventional “work/time” performance metric by themselves; many will give up and just believe your statement that you are “faster than X.” It will also be harder to compare with performance models, which is a good thing if you’re off the model by some large factor. Runtime is not the only effective metric; anything unusual will do, such as “microflops per picosecond” or “cache misses per core hour.” Combine with stunt 12 as needed.

Test case problem size
# Iterations Runtime [s]
car see page 4 500  2.34521
plane see page 6 sufficient  3.14159
train see page 2 roughly 112  0.11991
chicken see page 12 whatever  42.0

It’s really useful to be secretive about the exact hardware you were using. “eight-core AMD Opteron” or “six-core Intel Xeon” will usually suffice as a description. You may throw in the clock speed for good measure, but don’t mention whether you have activated Turbo Mode or not, or whether you used one socket or two, or how you enforced affinity. On the other hand, never forget to specify the exact Linux distribution and kernel version! No performance paper will be complete without it. We don’t want to hide anything after all, right?

Talking about openness: leave a good impression by making your software framework available for download. But when you do, take care to make it impossible to actually run it. Forge a totally convoluted, mind-twisting build system. Put in dependencies to several huge Java frameworks with version numbers far into the past or the future (requiring gcc 2.95.2 will work fine, too!). Never write plain C code, always generate it, using popular languages such as Logo, Java2kWhitespace, or Brainfuck (pick any two). Because: what they can’t run, they can’t criticize! (Think “Stupida Mouse!”) Give your software framework a fancy, foreign-sounding name, preferably from a romance language. Acronyms are so 1980s1, but “Inocentada,” “Vanitas,” “Nebulosità,” or “Ordure” will really cut the mustard.

If your strategy works you can even afford to be bold in your abstract: “Our framework achieves up to 34% more performance than previous work.” Since nobody can refute that claim due to your bleeding-edge camouflage tactics, you have laid the foundation of what will be known in a few years as “seminal work.”

1 So I was told by Julian

Fooling the masses – Stunt 14: Secretly use fancy hardware setups and software tricks!

(See the prelude for some general information on what this is all about)

You’re not the average computer or domain scientist, who has to buy off-the-shelf white-box hardware and run GCC-compiled code on it. You deserve something better, shinier, fancier, which, incidentally, makes your code run faster. And even without special hardware there are those neat little dirty tricks you only get to know when diving deep into the hardware-software interactions. In a sense, it’s the complement of slowing down the baseline (see stunt 2 and stunt 9). The only thing you need to care about is make sure nobody notices. Here’s what can give you a decisive edge:

File:2007TaipeiITMonth IntelOCLiveTest Overclocking-6.jpg

Figure 1: Liquid nitrogen cooling is overclocker’s heaven! (Image by Rico Shen)

Pimped-up hardware: Overclocked, nitrogen-cooled (see Fig. 1), high-bin hardware that takes its own power plant to run is a rewarding platform. Since you do not care about reproducibility of results, the fact that nobody else has access to your hardware shouldn’t give you sleepless nights.

No safety nets: ECC protection and short memory refresh cycles are for cowards! Bleeding-edge research has no use for pathetic whiners who prefer reliability over performance.

Quiet machine: Ask your friendly system administrators to stop the queues for everybody else and let you have the whole machine for your own benchmarking. That will lift the burden of statistical analysis off your shoulders.

Eliminate inconvenient details: Away with that bloody routine which takes only 10% of serial runtime but is so awfully hard to optimize. Who needs boundary handling anyway?

Low-level programming: “A=B+C” is lame. “vaddpd (%r13,%r8,8), %ymm1, %ymm2” rocks your world. You think assembly, you breathe machine code, and you use C intrinsics only on bad hair day. Keep your secrets and let the losers deal with the deficiencies of their rotten GNU compilers!

This stunt is a combination of #3, #10 and #11 from Bailey’s original collection.

Fooling the masses – Stunt 13: If they get you cornered, blame it all on OS jitter!

(See the prelude for some general information on what this is all about)

Ok, so you’re in real trouble: The pimpled smarty-pants in the audience has figured out that the data you have shown in your talk was a pile of sugar-coated white noise.  Or the conference deadline draws near, all you have is  a pile of useless data, and your boss desperately needs another publication. Or your paper is already beyond the page limit, and you need a convenient shortcut.

In such a situation it is good to have a backup plan that will allow you to save your neck and still look good. Fortunately, computers are so complex that there is ample opportunity to blame something that is not your fault.

Enter the “technical-detail-not-under-my-control.” Here are your friends – each is followed by a typical usage example:

Stupid compilers: “Our version of the code shows slightly worse single-thread performance, which is presumably due to the limited optimization capabilities of the compiler.

If the compiler can’t fix it, who can? And nobody wants to inspect assembly code anyway.

Out-of-order execution (or lack thereof): “Processor A shows better performance than processor B possibly because of A’s superior out-of-order processing capabilities.

That’s a truly ingenious statement, mainly because it is so very hard to prove. Among friends you may just bluntly state that B has a lousy design.

L1 instruction cache misses: “As shown in Table 1, our optimized code version B is faster because it has 20% fewer L1 instruction cache misses than version A.

Hardly anyone discusses L1 instruction cache misses, because they hardly ever do any harm. Thus they are the ideal scapegoat if you don’t know the exact reason for some effect. Be careful with presenting performance counter data, though: If the instruction count reveals that there is only one L1I miss every 10000 instructions, that’s not what you want to show in public…

TLB misses:Performance shows a gradual breakdown with growing problem size. This may be caused by excessive penalties due to TLB misses.

TLB misses are always a good choice because they have a reputation to be even more evil than cache misses! You may want to add that the time was too short to prove your statement by  measuring the misses and estimate their actual impact, and then search for the real cause.

Bad prefetching: “Performance does not scale beyond four cores on a socket. We attribute this to problems with the prefetching hardware.

Prefetching is necessary. Bad prefetching means bad performance because the latency (gasp!) cannot be hidden. Other reasons for performance saturation such as a bandwidth bottleneck or load imbalance are much too mundane to be interesting.

Bank conflicts: “Processor X has only [sic!] eight cache banks, which may explain the large fluctuations in performance vs. problem size.

Wow. Cache banks. And you are the one to know that eight just won’t cut it.

Transient network errors: “In contrast to other high-performance networks such as Cray’s Gemini, InfiniBand does not have link-level error detection, which impacts the scalability of our highly parallel code.

You are a good boy/girl, so you have read all of Cray’s marketing brochures. Link-level error detection is soooo cool, it just has to be what you need to scale.

OS jitter: “Beyond eight nodes our implementation essentially stops scaling. Since the cluster runs vanilla [insert your dearly hated distro here] Linux OS images, operating system noise (“OS jitter”) is the likely cause for this failure.

My all-time favorite. This Linux c**p is beyond contempt.

The list can be extended, of course. Whatever technical detail you choose, make sure to pick one that is widely misunderstood or has a reputation of being evil, very complicated, hard to understand, and whose responsibility for your problem is next to impossible to prove. Note the peculiar diction; be noncommittal and use “may explain,” “could be,” “likely,” “presumably,” etc., so you won’t be accountable in case you’re proven wrong.

Fooling the masses – Stunt 12: Redefine “performance” to suit your needs!

(See the prelude for some general information on what this is all about)

There is actually a very clear definition of the term “performance” in HPC: “performance” is “work” divided by “time.” You may think about “work” as “the problem”; occasionally, Flops is a possible (but easily manipulable) measure. “Time” is the overall wall-clock time to do the work, including everything that needs to be done but counts as overhead, such as communication and synchronization. If you look at performance with respect to parallel resources, i.e., if you want to know how performance scales with the number of workers, the same formula applies – to strong and to weak scaling alike.

Now sometimes the performance results for your parallel program are … mediocre. There may be two reasons for this:

  • Scalability is good but your single-core or single-node performance sucks. This is may be a consequence of your applying stunt 2, either by accident or on purpose. If you only show speedups (see stunt 1) you will be fine; however, sometimes it is not so easy to get away with this.
  • Single-core and single-socket performance is good (i.e., you understand and hit the relevant bottleneck[s]), but scalability is bad. You know the culprits: load imbalance, insufficient parallel work, communication, synchronization.

Either way, you probably don’t want to show actual performance numbers or data about time to solution. As it happens there is a host of options you can choose from. You can certainly play around with the “time” component in the performance calculation, as shown in stunt 5 and stunt 9, or quietly invoke weak scaling as shown in stunt 4. Here are some popular and more imaginative examples:

Employ “floptimization:” Throw in some extra Flops to make the CPU perform more “work” at hardly any extra cost; often there is at least some headroom in the floating-point pipelines when running real applications. Accumulating something in a register is a classic:

  for(s=0.0, i=0; i<N; ++i) {
    p[i] = f(q[i]); // actual work
    s += p[i];      // floptimization

Even if the sum over all elements of p[] is not needed at all, it’s one more Flop per iteration, it boosts your flop rate, and it will usually not impact your time to solution! If you do it right, only the sky is the limit.

Use manpower metrics: Developing code is expensive, so you can take manpower into account. State that your superior hardware or software environment enabled you to get a working code in less than one week, which amounts to so and so many “man-hours per GFlop/s.” This can be extended to other popular software metrics.

Play the money card: “GFlop/s per Dollar” may also be useful, especially if you compare different systems such as your average off-the-shelf departmental cluster and the big DoE-paid iron.

Go green: Since everyone in HPC today is all mad about saving energy, use “Joules per Flop,” “GFlop/s per Watt,” “CO2 equivalents,” “Joule-seconds,” “Joule-seconds-squared,” or any other combination of metrics that shows the superiority of your approach in terms of “green” or “infrastructure-aware” computing. Moreover, energy to solution is almost proportional to runtime, so you can rename your paper on “Performance Optimization of a Kolonovski-Butterfield Solver” to “Energy Optimization of a Kolonovski-Butterfield Solver” and turn a boring case study into bleeding-edge research.

Blame the wimpy hardware: For multi- or many-core chips, per-core or per-thread metrics mask the inherent bottlenecks and are great for bashing your competitors. For instance, when comparing a multi-core CPU with a GPGPU you could declare that “the available bandwidth per thread is 122 times higher on the CPU than on the GPGPU.” Oh wait – make that 121.68 times.


Figure 1: Windows task manager is your friend: Four cores, all of them 100% utilized – that’s performance!

Boast frantic activity: MIPS (millions of instructions per second) or, equivalently, IPC (instructions per cycle) are perfect higher-is-better metrics if true performance is low. IPC can be boosted on purpose by several means: You can use a highly abstracting language to make the compiler generate a vast amount of instructions for simple things like loading a value from memory, or disable SIMD so that more instructions are executed for the same amount of work. Waiting in OpenMP barriers or MPI functions is also good for a large IPC count, since it often involves spin-waiting loops. You can use special options to disable compiler optimizations which have the potential to change numerical results (such as common subexpression elimination). But take care: If that means moving slow operations like floating-point divides into an inner loop, the IPC value will go down dramatically.

Fight the evil: State that your new code or new algorithm shows a factor of X fewer cache misses compared to the baseline. Cache misses are the black plague of HPC, so nobody will doubt your success. And they are just as easy to manipulate as Flops. Combining with the “go green” strategy is straightforward: Cache misses cost vast amounts of energy (they say), so you can be green and good at the same time!

Wow the management: State that “all processors are 100% utilized, so the code makes perfect use of the available resources.” Windows task manager is the perfect tool to show off with a stunt like this (see Fig. 1), but “top” under Linux works fine, too:

 2318 492m 457m 596 98.6 22.9 0:35.52 2 a.out 
 2319 492m 457m 596 98.6 22.9 0:35.53 0 a.out 
 2321 492m 457m 596 98.6 22.9 0:35.50 1 a.out 
 2320 492m 457m 596 97.6 22.9 0:35.44 3 a.out

In summary, metrics are your wishing well: Find the right metric and any code will look good.

This stunt is essentially #9 of Bailey’s original “Twelve ways to fool the masses.”

Fooling the masses – Stunt 11: Show data! Plenty. And then some.

(See the prelude for some general information on what this is all about)


Figure 1: Lots of variants of an experiment, lots of machines, no idea for interpretation – why not show it all?

Computers produce data. Parallel computers produce more data. And doing performance experiments with parallel computers produces even more data, since we often conduct several runs with different code variants on different machines. Now imagine a situation where you’ve got all this data lying around on your disk, but you can’t make any sense of it. Since you wouldn’t bother to establish even a coarse performance model (your setup is just too complex for this kind of bean counting), you don’t know whether some particular performance measurement is “good” or “bad” on some particular machine, or how performance should change when some parameter is altered.

Why not put the cart before the horse and show all the data? This has decisive advantages: You can display all the work you did, without the embarrassing message that five or six numbers is all the relevant data you got from your research. Furthermore, it gives you the opportunity to add “discussion” at any desirable level of detail. For instance, if the page limit for submitting to your favorite conference is ten pages, and you have five pages full of messy bar graphs (such as the one shown in Fig. 1), you can easily fill the rest with meaningless “explanations” like “Variant 11 shows about 90% performance advantage on Machine G over Variant 10 on Machine F while, surprisingly, Variant 9 on Machine A is 6% slower than Variant 7 on Machine C.” You may want to write “92.23%” and “5.982%” here – see Stunt 8 for the reason.

In Summary, you can make up for your lack of insight by showing data. Plenty of it. Swamp your readers or your audience with hundreds of colorful bar graphs, scatter plots, timing diagrams, or whatever your favorite data visualization tool has to offer, and describe them at length. If you really want to leave the impression of true understanding, resort to obscure technical details − but this is the theme of another stunt.


Fooling the masses – Stunt 10: Always emphasize the “interesting” part of your work!

(See the prelude for some general information on what this is all about)

Have you ever thought about how to improve the aerodynamic properties of bulldozers? Yes? Then this stunt is for you.

Bulldozer aeodynamics

Figure 1: Always focus on the interesting stuff!

Code optimization is quite rewarding at times. You put all your effort into mutilating a chunk of code beyond all recognition, and in the end you get a 50% speedup. Often enough, this work requires a great deal of  expertise at the interface between hardware and software. And you certainly want to present your success to your peers so they understand how advanced you are.

If you do such a thing, don’t bother picking the low-hanging fruits. When 95% of the run time goes into a boring loop nest which multiplies a dense matrix with a vector, there is not much you can do beyond the standard textbook stuff – or use a library altogether. However, if the remaining 5% consist of a convoluted communication scheme, you’ve found your target. The fact that the possible gain is at most 5% shouldn’t stop you, because it’s all a matter of presentation; see Fig. 1: 3D bars and a proper choice of scales on the axes guide the reader’s eye and help you drive home your point. It’s like presenting the optimized drag coefficient of your aerodynamically improved bulldozer on high gloss paper, and the boring technical data in small black print on a dark blue background, on the back page.

Interestingly, if the optimization target is communication, or synchronization, or any other non-computational overhead, this stunt can be made to work even better: When you strongly scale to a pointless number of workers, reducing such overheads will give you arbitrary speedups, far beyond the factor of two that is theoretically possible if you restrict yourself to a sensible minimum parallel efficiency of 50%. Coming back to the heavy gear analogy, if you accelerate your bulldozer to 200 mph, the drag coefficient does become essential, and you may feel vindicated. It won’t solve any real-world problem, but hey, this is research after all.

Thanks to Gerhard Wellein for discovering this stunt in a paper. Many papers, actually.

Fooling the masses – Stunt 9: Boast massive speedups with accelerators!

(See the prelude for some general information on what this is all about)

GPGPUs, FPGAs, massively parallel single-chip multiprocessors: Accelerators are cool. The  prospect of boosting your application’s performance by 100x or more is mouthwatering, so you invest days and weeks, even months to port and tweak code and make it run fast on that shiny new piece of hardware. However, in the end the outcome may well be a meager 2.5x, or even much less, if you compare to well-optimized parallel code on a standard 2-socket server that everyone can handle.

What happened? Well, a direct peak performance and memory bandwidth comparison reveals that – even without considering Amdahl’s Law and overheads like PCIe transfers – 2x-4x is just what can be expected. Assuming a fair game, of course. And that’s exactly your chance to sugarcoat your mediocre results and fabricate a blazing success! Here are some hints:

  • Compare bleeding-edge accelerator technology with vintage CPUs. A Pentium III will be fine for most purposes today. If you can’t get your hands on one of those, give me a call.
  • Always compare the fully parallel accelerator code with a sequential (single-core) CPU version. You can even argue in favor of this, saying that “serial CPU code constitutes a well-defined baseline.” Be sure to rush ahead to the next slide before anyone dares to ask why you don’t use a single GPGPU “core” to make the comparison even more “well-defined.”
  • Use no optimizations whatsoever with the CPU compiler; -O0 will be fine. Also use a plain, proof-of-concept implementation of your algorithm with no manual code changes. It’s good scientific practice because a rock-bottom baseline is more meaningful!
  • Use single precision on the GPU but double precision on the CPU. That will effectively cut the CPU’s cache, the memory bandwidth, and its peak performance in half.
  • Accelerate only the kernels that are easily portable and report only their performance. Amdahl’s Law and communication overhead will be nothing but smoke and mirrors!
dens MVM GPU vs. CPU

Figure 1: If you do it right, a 200x speedup for the GPU is absolutely feasible (comparing nVIDIA Tesla C2050 and Intel Dual Xeon 5650, dense matrix-vector multiply, matrix size 4500×4500, Intel compiler V12 update 11)

These tricks should give you a competitive edge; 20x-200x reported speedup should be no problem, depending on your unscrupulousness.

See Fig. 1 for an instructive example: A dense matrix-vector multiply using a 4500×4500 matrix is memory-bound on a Fermi-type GPGPU as well as on a standard CPU-based system such as a dual-socket Xeon 5650 (“Westmere”). The GPGPU baseline of about 45 GFlop/s at single precision (SP) is completely in line with the achievable memory bandwidth of roughly 90 GB/s (4 bytes per multiply-add operation, assuming the right-hand side is mostly held in shared memory). To get a slight speedup you can switch off ECC (who cares for some flipped bits?) to get about 120 GB/s, corresponding to 60 GFlop/s. No need to mention that the PCIe transfers to get the RHS and result vectors to and from the device are neglected. So much for the GPGPU.

On the CPU side, the full dual-Xeon node has a bandwidth of close to 40 GB/s, leading to an SP performance of 19 GFlop/s – much too fast to be shown in public! Going to double precision cuts that in half, but there is more: Using just a single core gets you down another factor of 4, and disabling vectorization gives you a further 0.5x (since you’re now far away from any bandwidth limitation). Finally, turning off optimization altogether (-O0) lets you hit rock bottom at 248 MFlop/s – a whopping 214x slower than the best GPGPU result! Going to my old P-III is not even necessary any more.

Note the cunning use of a linear scale in the diagram; in this particular case it is much better than a log scale since it reduces the CPU performance “baseline” to a mere flyspeck (see Stunt 3 for more information on log vs. linear scale).

This stunt is essentially identical to #6 and #7 of Bailey’s original collection.

Fooling the masses – Stunt 8: Impress your audience with awe-inspiring accuracy

(See the prelude for some general information on what this is all about)

“Qualitative” statements make you cringe. Being a scientist, you know that science is all about accuracy. And what could be more accurate than numbers?  Good for you! There is ample opportunity for generating numbers in computer science. The art is to not block your audience’s view on the raw and unfiltered truth.

# cores Time (plain) [s] Time (optimized) [s] Speedup [%]
1 12.0435766 8.6198441 39.71919283
2 6.1179025 5.5901951 9.439874469
3 4.9041394 4.6966098 4.418710705
4 4.7002801 4.6912042 0.193466317

Fig. 1: Awesome! Up to 1.77992 times faster!

A simple example: When comparing performance numbers between different machines or code versions (see Stunt 7 for some amazing tricks you can play there), don’t hesitate to provide timings with sub-microsecond resolution, even if the results fluctuate like mad. Then state, in a slightly generalizing, “management-style” demeanor, that your “optimizations lead to a performance increase of up to 39.71919283%”, and refer to the data (see the table) to substantiate your assertion. In other words, forget what your high school teachers told you about significant digits and use the full power of your pocket calculator, even if a slide rule would also do the trick.

By the way, the same works for drawing conclusions from data in diagrams. Since it is sometimes hard to read off precise values from graphs, help your audience by stating the raw facts (see Fig. 1). Using boldface and jazzy colors will help to keep peoples’ attention on the important messages. Do not waste time and space dealing with lengthy explanations but let the numbers speak for themselves!


Fooling the masses – Stunt 7: Be creative when comparing scaled performance!

(See the prelude for some general information on what this is all about)

Scaling comparison

Fig. 1: Comparing performance vs. number of compute units (CPU sockets, GPUs) under limited scalability can yield any result if you do it right!

Strong scaling performance comparisons between standard (i.e., CPU-based) and accelerated (e.g., GPU-based) systems are particularly rewarding, since there is ample opportunity to generate any speedup number you want. It is not about fabricating a large GPU vs. CPU speedup – this special topic deserves its own stunt (stay tuned!). You can just as well apply it to a “baseline” vs. an “optimized” code on the same machine.

This may not be easy to digest on a theoretical level, so I explain it using an example: Assume we have a parallel CPU code whose strong scalability is limited by Amdahl’s Law with a serial runtime fraction of 0.3%. Hence, the speedup factor is limited to 333. A parallel GPGPU version of the same program achieves a speedup of 10 (compared to one CPU socket) for the parallel part, and of 5 for the sequential part. Figure 1 shows a performance comparison for up to 16384 compute units (CPU sockets or GPUs, respectively). As expected, the GPGPU speedup is 10 for small numbers of workers and goes down to 5 for large numbers – always comparing equal numbers of GPGPUs and CPU sockets. So far so good.

However, if you pose the right question, the factor can be much larger – even infinity, if you want: If we ask “How many CPU sockets do we need to outcompute 32 GPUs?”, the answer is clearly 2048. This means that your GPU-vs-CPU speedup is not 5, and also not 10, but almost 64 (see the dashed line in the graph). And its gets even better: Somewhere between 32 and 40 GPUs the required number of CPU sockets slips away to infinity! Your GPUs can go where no CPU has ever gone before (and never will)!

Efficiency comparison

Fig. 2: Plotting the parallel efficiency of both variants reveals the reason why the stunt works. Note that the efficiency of the GPU version is always worse.

The reason why this works is that you compare the two graphs at different numbers of workers – so different that the parallel efficiency for the CPU code is way down, whereas for the GPU code it is still acceptable (see Fig. 2)! Or to put it the other way round: If we compare equal numbers of workers on both sides, the parallel efficiency on the CPUs is always better.

The sensible way to compare achievable performance would be to set a certain practical lower limit for the parallel efficiency, e.g., 50%. From a computing center point of view, this is probably the lowest efficiency we’d be willing to accept, except in cases where the reason for going parallel is lack of memory. Scaling beyond that point does not make sense, not even for the purpose of comparison! A reasonable question to ask would be “How much performance can I get from the two codes at their respective minimum acceptable efficiency point?”

So the gist of it is: When comparing scaled performance data, ask the right questions and be the hero of the day. And don’t talk about parallel efficiency, or people might ask why you are wasting precious CPU cycles…

Thanks to Gerhard for discovering this extraordinary stunt!