Georg Hager's Blog

Suche


Gaussian Elimination Squad to fight in “SC13 Intel Parallel Universe Computing Challenge”

Posted on by Georg Hager

It seems that Intel’s marketing division has some money to spend for charity – $25,000 to be exact. To make its spending as entertaining as possible, they have set up a stage show at SC13 in Denver, Colorado (Nov 17-22, 2013): The “Intel Parallel Universe Computing Challenge.”

Eight teams from the US, Europe, China, and Korea are going to fight in a three-round sudden death tournament. The German team comprises members from Aachen, Darmstadt, Jülich, Dresden, Garching, and Erlangen, and happens to be led by me. According to the rules each match will be 30 minutes, in which the teams have to answer questions and optimize code. The audience will have the chance to answer questions and win prizes, too! So if you happen to be at SC13 and can spare the time, drop by the Intel booth (#2701, see floor plan) and see us fight! See the link above for the schedule.

Honoring one of the greatest mathematicians of all time, the German team has chosen a peculiar war name: The “Gaussian Elimination Squad.” We will take on “Team Milky Way” from China in our first match on Tuesday, Nov 19th, 11:00am. We do not know why they have picked the name of a popular chocolate bar ;-) but probably it was in anticipation of what’s going to happen to them:

Milky Way before processing

Milky Way before elimination

Milky Way after processing

Milky Way after elimination

 

 

 

 

 

 

 

Now you know why we call ourselves the “Gaussian Elimination Squad.”

Intel vs. GCC for the OpenMP vector triad: Barrier shootout!

Posted on by Georg Hager

We use the Schönauer Vector Triad for most of our microbenchmarking. It’s a simple benchmark that everyone can write. It looks quite simple when parallelized with OpenMP:

double precision, dimension(N) :: a,b,c,d
! initialization etc. omitted
s = walltime()
!$omp parallel private(R,i)
do R=1,NITER
!$omp do
  do i=1,N
    a(i) = b(i) + c(i) * d(i)
  enddo
!$omp end do
enddo
!$omp end parallel
e=walltime()
MFlops = R*N/(e-s)/1.e6

There are some details that are necessary to make it work as intended; you can read all about this in our book [1]. Usually we choose NITER for every N so that the runtime is a couple hundred milliseconds (so it can be measured accurately), and report performance for N ranging from small to large. The performance of the vector triad is determined by the data transfers, even when the data is in the L1 cache. In the parallel case we can additionally see the usual multicore bandwidth bottleneck(s).

The OpenMP parallelization adds its own overhead, of course. As it turns out, it is mostly concentrated in the implicit barrier at the end of the workshared loop in this case. So, when looking at the performance of the OpenMP code vs. N, we usually see that using more threads slows down the code if N is too small. We can even calculate the barrier overhead from this (again, the book will tell you the gory details).

The barrier overhead varies across compilers and compiler versions, and it depends on the positions of the threads in the machine (e.g., sharing caches or not). You can certainly measure it directly with a suitable microbenchmark [2], but it is quite interesting to see the impact directly in the vector triad performance data.

vtriad_Lima_icpc_vs_gcc

Here we see the OpenMP vector triad performance on one “Intel Xeon Westmere” socket (6 cores) running at about 2.8 GHz, comparing a reasonably current Intel compiler with g++ 4.7.0. With the Intel compiler the sequential code achieves “best possible” performance within the L1 cache (4 flops in 3 cycles). With OpenMP turned on you cannot see this, of course, since the barrier overhead dominates for loop lengths below a couple of 1000s.

Looking at the results for the two compilers we see that GCC has not learned anything over the last five years (this is for how long we have been comparing compilers in terms of OpenMP barrier overhead): The barrier takes roughly a factor of 20 longer with gcc than with the Intel compiler. Comparing with the ECM performance model [3] for the vector triad we see that the Intel compiler’s barrier is fast enough to at least get near the performance limit in the L2 cache (blue dashed line). Both compilers are on par where it’s easy, i.e., in L3 cache and memory, where the loop is so long that the overhead is negligible.

Note that the bad performance of g++ in this benchmark is not due to some “magic” compiler option that I’ve missed. It’s the devastatingly slow OpenMP barrier. For reference, these are the compiler options I have used:

icpc -openmp -Ofast -xHOST -fno-alias ...
g++ -fopenmp -O3 -msse4.2 -fargument-noalias-global ...

In conclusion, the GCC OpenMP barrier is still slooooow. If you have “short” loops to parallelize, GCC is not for you. Of course you might be able to work around such problems (mutilating a popular saying from one of the Great Old Ones: “If synchronization is the problem, don’t synchronize!”), but it’s still good to be aware of them.

If you are interested in concrete numbers you can take a look at any of our recent tutorials [4], where we always include some recent barrier measurements with current compilers.

[1] G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Press, 2010.

[2] The EPCC OpenMP Microbenchmarks.

[3] G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Computation and Concurrency: Practice and Experience, DOI: 10.1002/cpe.3180 (2013), Preprint: arXiv:1208.2908

[4] My Tutorials blog page

Fooling the masses – Stunt 14: Secretly use fancy hardware setups and software tricks!

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

You’re not the average computer or domain scientist, who has to buy off-the-shelf white-box hardware and run GCC-compiled code on it. You deserve something better, shinier, fancier, which, incidentally, makes your code run faster. And even without special hardware there are those neat little dirty tricks you only get to know when diving deep into the hardware-software interactions. In a sense, it’s the complement of slowing down the baseline (see stunt 2 and stunt 9). The only thing you need to care about is make sure nobody notices. Here’s what can give you a decisive edge:

File:2007TaipeiITMonth IntelOCLiveTest Overclocking-6.jpg

Figure 1: Liquid nitrogen cooling is overclocker’s heaven! (Image by Rico Shen)

Pimped-up hardware: Overclocked, nitrogen-cooled (see Fig. 1), high-bin hardware that takes its own power plant to run is a rewarding platform. Since you do not care about reproducibility of results, the fact that nobody else has access to your hardware shouldn’t give you sleepless nights.

No safety nets: ECC protection and short memory refresh cycles are for cowards! Bleeding-edge research has no use for pathetic whiners who prefer reliability over performance.

Quiet machine: Ask your friendly system administrators to stop the queues for everybody else and let you have the whole machine for your own benchmarking. That will lift the burden of statistical analysis off your shoulders.

Eliminate inconvenient details: Away with that bloody routine which takes only 10% of serial runtime but is so awfully hard to optimize. Who needs boundary handling anyway?

Low-level programming: “A=B+C” is lame. “vaddpd (%r13,%r8,8), %ymm1, %ymm2″ rocks your world. You think assembly, you breathe machine code, and you use C intrinsics only on bad hair day. Keep your secrets and let the losers deal with the deficiencies of their rotten GNU compilers!

This stunt is a combination of #3, #10 and #11 from Bailey’s original collection.

Fooling the masses – Stunt 13: If they get you cornered, blame it all on OS jitter!

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

Ok, so you’re in real trouble: The pimpled smarty-pants in the audience has figured out that the data you have shown in your talk was a pile of sugar-coated white noise.  Or the conference deadline draws near, all you have is  a pile of useless data, and your boss desperately needs another publication. Or your paper is already beyond the page limit, and you need a convenient shortcut.

In such a situation it is good to have a backup plan that will allow you to save your neck and still look good. Fortunately, computers are so complex that there is ample opportunity to blame something that is not your fault.

Enter the “technical-detail-not-under-my-control.” Here are your friends – each is followed by a typical usage example:

Stupid compilers: “Our version of the code shows slightly worse single-thread performance, which is presumably due to the limited optimization capabilities of the compiler.

If the compiler can’t fix it, who can? And nobody wants to inspect assembly code anyway.

Out-of-order execution (or lack thereof): “Processor A shows better performance than processor B possibly because of A’s superior out-of-order processing capabilities.

That’s a truly ingenious statement, mainly because it is so very hard to prove. Among friends you may just bluntly state that B has a lousy design.

L1 instruction cache misses: “As shown in Table 1, our optimized code version B is faster because it has 20% fewer L1 instruction cache misses than version A.

Hardly anyone discusses L1 instruction cache misses, because they hardly ever do any harm. Thus they are the ideal scapegoat if you don’t know the exact reason for some effect. Be careful with presenting performance counter data, though: If the instruction count reveals that there is only one L1I miss every 10000 instructions, that’s not what you want to show in public…

TLB misses:Performance shows a gradual breakdown with growing problem size. This may be caused by excessive penalties due to TLB misses.

TLB misses are always a good choice because they have a reputation to be even more evil than cache misses! You may want to add that the time was too short to prove your statement by  measuring the misses and estimate their actual impact, and then search for the real cause.

Bad prefetching: “Performance does not scale beyond four cores on a socket. We attribute this to problems with the prefetching hardware.

Prefetching is necessary. Bad prefetching means bad performance because the latency (gasp!) cannot be hidden. Other reasons for performance saturation such as a bandwidth bottleneck or load imbalance are much too mundane to be interesting.

Bank conflicts: “Processor X has only [sic!] eight cache banks, which may explain the large fluctuations in performance vs. problem size.

Wow. Cache banks. And you are the one to know that eight just won’t cut it.

Transient network errors: “In contrast to other high-performance networks such as Cray’s Gemini, InfiniBand does not have link-level error detection, which impacts the scalability of our highly parallel code.

You are a good boy/girl, so you have read all of Cray’s marketing brochures. Link-level error detection is soooo cool, it just has to be what you need to scale.

OS jitter: “Beyond eight nodes our implementation essentially stops scaling. Since the cluster runs vanilla [insert your dearly hated distro here] Linux OS images, operating system noise (“OS jitter”) is the likely cause for this failure.

My all-time favorite. This Linux c**p is beyond contempt.

The list can be extended, of course. Whatever technical detail you choose, make sure to pick one that is widely misunderstood or has a reputation of being evil, very complicated, hard to understand, and whose responsibility for your problem is next to impossible to prove. Note the peculiar diction; be noncommittal and use “may explain,” “could be,” “likely,” “presumably,” etc., so you won’t be accountable in case you’re proven wrong.

Fooling the masses – Stunt 12: Redefine “performance” to suit your needs!

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

There is actually a very clear definition of the term “performance” in HPC: “performance” is “work” divided by “time.” You may think about “work” as “the problem”; occasionally, Flops is a possible (but easily manipulable) measure. “Time” is the overall wall-clock time to do the work, including everything that needs to be done but counts as overhead, such as communication and synchronization. If you look at performance with respect to parallel resources, i.e., if you want to know how performance scales with the number of workers, the same formula applies – to strong and to weak scaling alike.

Now sometimes the performance results for your parallel program are … mediocre. There may be two reasons for this:

  • Scalability is good but your single-core or single-node performance sucks. This is may be a consequence of your applying stunt 2, either by accident or on purpose. If you only show speedups (see stunt 1) you will be fine; however, sometimes it is not so easy to get away with this.
  • Single-core and single-socket performance is good (i.e., you understand and hit the relevant bottleneck[s]), but scalability is bad. You know the culprits: load imbalance, insufficient parallel work, communication, synchronization.

Either way, you probably don’t want to show actual performance numbers or data about time to solution. As it happens there is a host of options you can choose from. You can certainly play around with the “time” component in the performance calculation, as shown in stunt 5 and stunt 9, or quietly invoke weak scaling as shown in stunt 4. Here are some popular and more imaginative examples:

Employ “floptimization:” Throw in some extra Flops to make the CPU perform more “work” at hardly any extra cost; often there is at least some headroom in the floating-point pipelines when running real applications. Accumulating something in a register is a classic:

  for(s=0.0, i=0; i<N; ++i) {
    p[i] = f(q[i]); // actual work
    s += p[i];      // floptimization
  }

Even if the sum over all elements of p[] is not needed at all, it’s one more Flop per iteration, it boosts your flop rate, and it will usually not impact your time to solution! If you do it right, only the sky is the limit.

Use manpower metrics: Developing code is expensive, so you can take manpower into account. State that your superior hardware or software environment enabled you to get a working code in less than one week, which amounts to so and so many “man-hours per GFlop/s.” This can be extended to other popular software metrics.

Play the money card: “GFlop/s per Dollar” may also be useful, especially if you compare different systems such as your average off-the-shelf departmental cluster and the big DoE-paid iron.

Go green: Since everyone in HPC today is all mad about saving energy, use “Joules per Flop,” “GFlop/s per Watt,” “CO2 equivalents,” “Joule-seconds,” “Joule-seconds-squared,” or any other combination of metrics that shows the superiority of your approach in terms of “green” or “infrastructure-aware” computing. Moreover, energy to solution is almost proportional to runtime, so you can rename your paper on “Performance Optimization of a Kolonovski-Butterfield Solver” to “Energy Optimization of a Kolonovski-Butterfield Solver” and turn a boring case study into bleeding-edge research.

Blame the wimpy hardware: For multi- or many-core chips, per-core or per-thread metrics mask the inherent bottlenecks and are great for bashing your competitors. For instance, when comparing a multi-core CPU with a GPGPU you could declare that “the available bandwidth per thread is 122 times higher on the CPU than on the GPGPU.” Oh wait – make that 121.68 times.

taskmgr.png

Figure 1: Windows task manager is your friend: Four cores, all of them 100% utilized – that’s performance!

Boast frantic activity: MIPS (millions of instructions per second) or, equivalently, IPC (instructions per cycle) are perfect higher-is-better metrics if true performance is low. IPC can be boosted on purpose by several means: You can use a highly abstracting language to make the compiler generate a vast amount of instructions for simple things like loading a value from memory, or disable SIMD so that more instructions are executed for the same amount of work. Waiting in OpenMP barriers or MPI functions is also good for a large IPC count, since it often involves spin-waiting loops. You can use special options to disable compiler optimizations which have the potential to change numerical results (such as common subexpression elimination). But take care: If that means moving slow operations like floating-point divides into an inner loop, the IPC value will go down dramatically.

Fight the evil: State that your new code or new algorithm shows a factor of X fewer cache misses compared to the baseline. Cache misses are the black plague of HPC, so nobody will doubt your success. And they are just as easy to manipulate as Flops. Combining with the “go green” strategy is straightforward: Cache misses cost vast amounts of energy (they say), so you can be green and good at the same time!

Wow the management: State that “all processors are 100% utilized, so the code makes perfect use of the available resources.” Windows task manager is the perfect tool to show off with a stunt like this (see Fig. 1), but “top” under Linux works fine, too:

  PID VIRT RES  SHR %CPU %MEM  TIME+  P COMMAND 
 2318 492m 457m 596 98.6 22.9 0:35.52 2 a.out 
 2319 492m 457m 596 98.6 22.9 0:35.53 0 a.out 
 2321 492m 457m 596 98.6 22.9 0:35.50 1 a.out 
 2320 492m 457m 596 97.6 22.9 0:35.44 3 a.out

In summary, metrics are your wishing well: Find the right metric and any code will look good.

This stunt is essentially #9 of Bailey’s original “Twelve ways to fool the masses.”

Fooling the masses – Stunt 11: Show data! Plenty. And then some.

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

ShowDataPlenty_Bars

Figure 1: Lots of variants of an experiment, lots of machines, no idea for interpretation – why not show it all?

Computers produce data. Parallel computers produce more data. And doing performance experiments with parallel computers produces even more data, since we often conduct several runs with different code variants on different machines. Now imagine a situation where you’ve got all this data lying around on your disk, but you can’t make any sense of it. Since you wouldn’t bother to establish even a coarse performance model (your setup is just too complex for this kind of bean counting), you don’t know whether some particular performance measurement is “good” or “bad” on some particular machine, or how performance should change when some parameter is altered.

Why not put the cart before the horse and show all the data? This has decisive advantages: You can display all the work you did, without the embarrassing message that five or six numbers is all the relevant data you got from your research. Furthermore, it gives you the opportunity to add “discussion” at any desirable level of detail. For instance, if the page limit for submitting to your favorite conference is ten pages, and you have five pages full of messy bar graphs (such as the one shown in Fig. 1), you can easily fill the rest with meaningless “explanations” like “Variant 11 shows about 90% performance advantage on Machine G over Variant 10 on Machine F while, surprisingly, Variant 9 on Machine A is 6% slower than Variant 7 on Machine C.” You may want to write “92.23%” and “5.982%” here – see Stunt 8 for the reason.

In Summary, you can make up for your lack of insight by showing data. Plenty of it. Swamp your readers or your audience with hundreds of colorful bar graphs, scatter plots, timing diagrams, or whatever your favorite data visualization tool has to offer, and describe them at length. If you really want to leave the impression of true understanding, resort to obscure technical details − but this is the theme of another stunt.

 

Fooling the masses – Stunt 10: Always emphasize the “interesting” part of your work!

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

Have you ever thought about how to improve the aerodynamic properties of bulldozers? Yes? Then this stunt is for you.

Bulldozer aeodynamics

Figure 1: Always focus on the interesting stuff!

Code optimization is quite rewarding at times. You put all your effort into mutilating a chunk of code beyond all recognition, and in the end you get a 50% speedup. Often enough, this work requires a great deal of  expertise at the interface between hardware and software. And you certainly want to present your success to your peers so they understand how advanced you are.

If you do such a thing, don’t bother picking the low-hanging fruits. When 95% of the run time goes into a boring loop nest which multiplies a dense matrix with a vector, there is not much you can do beyond the standard textbook stuff – or use a library altogether. However, if the remaining 5% consist of a convoluted communication scheme, you’ve found your target. The fact that the possible gain is at most 5% shouldn’t stop you, because it’s all a matter of presentation; see Fig. 1: 3D bars and a proper choice of scales on the axes guide the reader’s eye and help you drive home your point. It’s like presenting the optimized drag coefficient of your aerodynamically improved bulldozer on high gloss paper, and the boring technical data in small black print on a dark blue background, on the back page.

Interestingly, if the optimization target is communication, or synchronization, or any other non-computational overhead, this stunt can be made to work even better: When you strongly scale to a pointless number of workers, bringing down such overheads will give you arbitrary speedups, far beyond the factor of two that is theoretically possible if you restrict yourself to a sensible minimum parallel efficiency of 50%. Coming back to the heavy gear analogy, if you accelerate your bulldozer to 200 mph, the drag coefficient does become essential, and you may feel vindicated. It won’t solve any real-world problem, but hey, this is research after all.

Thanks to Gerhard Wellein for discovering this stunt in a paper. Many papers, actually.

Fooling the masses – Stunt 9: Boast massive speedups with accelerators!

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

GPGPUs, FPGAs, massively parallel single-chip multiprocessors: Accelerators are cool. The  prospect of boosting your application’s performance by 100x or more is mouthwatering, so you invest days and weeks, even months to port and tweak code and make it run fast on that shiny new piece of hardware. However, in the end the outcome may well be a meager 2.5x, or even much less, if you compare to well-optimized parallel code on a standard 2-socket server that everyone can handle.

What happened? Well, a direct peak performance and memory bandwidth comparison reveals that – even without considering Amdahl’s Law and overheads like PCIe transfers – 2x-4x is just what can be expected. Assuming a fair game, of course. And that’s exactly your chance to sugarcoat your mediocre results and fabricate a blazing success! Here are some hints:

  • Compare bleeding-edge accelerator technology with vintage CPUs. A Pentium III will be fine for most purposes today. If you can’t get your hands on one of those, give me a call.
  • Always compare the fully parallel accelerator code with a sequential (single-core) CPU version. You can even argue in favor of this, saying that “serial CPU code constitutes a well-defined baseline.” Be sure to rush ahead to the next slide before anyone dares to ask why you don’t use a single GPGPU “core” to make the comparison even more “well-defined.”
  • Use no optimizations whatsoever with the CPU compiler; -O0 will be fine. Also use a plain, proof-of-concept implementation of your algorithm with no manual code changes. It’s good scientific practice because a rock-bottom baseline is more meaningful!
  • Use single precision on the GPU but double precision on the CPU. That will effectively cut the CPU’s cache, the memory bandwidth, and its peak performance in half.
  • Accelerate only the kernels that are easily portable and report only their performance. Amdahl’s Law and communication overhead will be nothing but smoke and mirrors!
dens MVM GPU vs. CPU

Figure 1: If you do it right, a 200x speedup for the GPU is absolutely feasible (comparing nVIDIA Tesla C2050 and Intel Dual Xeon 5650, dense matrix-vector multiply, matrix size 4500×4500, Intel compiler V12 update 11)

These tricks should give you a competitive edge; 20x-200x reported speedup should be no problem, depending on your unscrupulousness.

See Fig. 1 for an instructive example: A dense matrix-vector multiply using a 4500×4500 matrix is memory-bound on a Fermi-type GPGPU as well as on a standard CPU-based system such as a dual-socket Xeon 5650 (“Westmere”). The GPGPU baseline of about 45 GFlop/s at single precision (SP) is completely in line with the achievable memory bandwidth of roughly 90 GB/s (4 bytes per multiply-add operation, assuming the right-hand side is mostly held in shared memory). To get a slight speedup you can switch off ECC (who cares for some flipped bits?) to get about 120 GB/s, corresponding to 60 GFlop/s. No need to mention that the PCIe transfers to get the RHS and result vectors to and from the device are neglected. So much for the GPGPU.

On the CPU side, the full dual-Xeon node has a bandwidth of close to 40 GB/s, leading to an SP performance of 19 GFlop/s – much too fast to be shown in public! Going to double precision cuts that in half, but there is more: Using just a single core gets you down another factor of 4, and disabling vectorization gives you a further 0.5x (since you’re now far away from any bandwidth limitation). Finally, turning off optimization altogether (-O0) lets you hit rock bottom at 248 MFlop/s – a whopping 214x slower than the best GPGPU result! Going to my old P-III is not even necessary any more.

Note the cunning use of a linear scale in the diagram; in this particular case it is much better than a log scale since it reduces the CPU performance “baseline” to a mere flyspeck (see Stunt 3 for more information on log vs. linear scale).

This stunt is essentially identical to #6 and #7 of Bailey’s original collection.

Fooling the masses – Stunt 8: Impress your audience with awe-inspiring accuracy

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

“Qualitative” statements make you cringe. Being a scientist, you know that science is all about accuracy. And what could be more accurate than numbers?  Good for you! There is ample opportunity for generating numbers in computer science. The art is to not block your audience’s view on the raw and unfiltered truth.

# cores Time (plain) [s] Time (optimized) [s] Speedup [%]
1 12.0435766 8.6198441 39.71919283
2 6.1179025 5.5901951 9.439874469
3 4.9041394 4.6966098 4.418710705
4 4.7002801 4.6912042 0.193466317

Fig. 1: Awesome! Up to 1.77992 times faster!

A simple example: When comparing performance numbers between different machines or code versions (see Stunt 7 for some amazing tricks you can play there), don’t hesitate to provide timings with sub-microsecond resolution, even if the results fluctuate like mad. Then state, in a slightly generalizing, “management-style” demeanor, that your “optimizations lead to a performance increase of up to 39.71919283%”, and refer to the data (see the table) to substantiate your assertion. In other words, forget what your high school teachers told you about significant digits and use the full power of your pocket calculator, even if a slide rule would also do the trick.

By the way, the same works for drawing conclusions from data in diagrams. Since it is sometimes hard to read off precise values from graphs, help your audience by stating the raw facts (see Fig. 1). Using boldface and jazzy colors will help to keep peoples’ attention on the important messages. Do not waste time and space dealing with lengthy explanations but let the numbers speak for themselves!

 

Fooling the masses – Stunt 7: Be creative when comparing scaled performance!

Posted on by Georg Hager

(See the prelude for some general information on what this is all about)

Scaling comparison

Fig. 1: Comparing performance vs. number of compute units (CPU sockets, GPUs) under limited scalability can yield any result if you do it right!

Strong scaling performance comparisons between standard (i.e., CPU-based) and accelerated (e.g., GPU-based) systems are particularly rewarding, since there is ample opportunity to generate any speedup number you want. It is not about fabricating a large GPU vs. CPU speedup – this special topic deserves its own stunt (stay tuned!). You can just as well apply it to a “baseline” vs. an “optimized” code on the same machine.

This may not be easy to digest on a theoretical level, so I explain it using an example: Assume we have a parallel CPU code whose strong scalability is limited by Amdahl’s Law with a serial runtime fraction of 0.3%. Hence, the speedup factor is limited to 333. A parallel GPGPU version of the same program achieves a speedup of 10 (compared to one CPU socket) for the parallel part, and of 5 for the sequential part. Figure 1 shows a performance comparison for up to 16384 compute units (CPU sockets or GPUs, respectively). As expected, the GPGPU speedup is 10 for small numbers of workers and goes down to 5 for large numbers – always comparing equal numbers of GPGPUs and CPU sockets. So far so good.

However, if you pose the right question, the factor can be much larger – even infinity, if you want: If we ask “How many CPU sockets do we need to outcompute 32 GPUs?”, the answer is clearly 2048. This means that your GPU-vs-CPU speedup is not 5, and also not 10, but almost 64 (see the dashed line in the graph). And its gets even better: Somewhere between 32 and 40 GPUs the required number of CPU sockets slips away to infinity! Your GPUs can go where no CPU has ever gone before (and never will)!

Efficiency comparison

Fig. 2: Plotting the parallel efficiency of both variants reveals the reason why the stunt works. Note that the efficiency of the GPU version is always worse.

The reason why this works is that you compare the two graphs at different numbers of workers – so different that the parallel efficiency for the CPU code is way down, whereas for the GPU code it is still acceptable (see Fig. 2)! Or to put it the other way round: If we compare equal numbers of workers on both sides, the parallel efficiency on the CPUs is always better.

The sensible way to compare achievable performance would be to set a certain practical lower limit for the parallel efficiency, e.g., 50%. From a computing center point of view, this is probably the lowest efficiency we’d be willing to accept, except in cases where the reason for going parallel is lack of memory. Scaling beyond that point does not make sense, not even for the purpose of comparison! A reasonable question to ask would be “How much performance can I get from the two codes at their respective minimum acceptable efficiency point?”

So the gist of it is: When comparing scaled performance data, ask the right questions and be the hero of the day. And don’t talk about parallel efficiency, or people might ask why you are wasting precious CPU cycles…

Thanks to Gerhard for discovering this extraordinary stunt!

 

 

Nach oben