# Georg Hager's Blog

## SC13 Intel Parallel Universe Challenge: The Squad moves to the final round!

Posted on by

We’ve made it to the finals! After a slight disadvantage in the trivia round, we managed to catch up and overtake the Korean team in the coding challenge. It was not strictly a surge of pure brilliance, but who cares… Tomorrow, Thursday 2pm Mountain Time it is, against the winner of the other semi-final (Rice University or NCSA/Illinois).

## SC13 Intel Parallel Universe Challenge: China eliminated!

Posted on by

The winning team – thinking hard!

Today, the “Gaussian Elimination Squad” has kicked the Chinese team out of the competition! So we’re up for the semi-finals tomorrow (Wednesday) at 4pm Mountain Time.

## SC13 Intel Parallel Universe Computing Challenge: Teams disclosed

Posted on by

Intel has finally disclosed the list of teams and team members for the “SC13 Parallel Universe Computing Challenge:” Five teams from the US (including LLNL and ANL), one from China, one from Korea, and one from Germany: http://software.intel.com/en-us/supercomputing#pid-20366-1690

Posted on by

It seems that Intel’s marketing division has some money to spend for charity – $25,000 to be exact. To make its spending as entertaining as possible, they have set up a stage show at SC13 in Denver, Colorado (Nov 17-22, 2013): The “Intel Parallel Universe Computing Challenge.” Eight teams from the US, Europe, China, and Korea are going to fight in a three-round sudden death tournament. The German team comprises members from Aachen, Darmstadt, Jülich, Dresden, Garching, and Erlangen, and happens to be led by me. According to the rules each match will be 30 minutes, in which the teams have to answer questions and optimize code. The audience will have the chance to answer questions and win prizes, too! So if you happen to be at SC13 and can spare the time, drop by the Intel booth (#2701, see floor plan) and see us fight! See the link above for the schedule. Honoring one of the greatest mathematicians of all time, the German team has chosen a peculiar war name: The “Gaussian Elimination Squad.” We will take on “Team Milky Way” from China in our first match on Tuesday, Nov 19th, 11:00am. We do not know why they have picked the name of a popular chocolate bar 😉 but probably it was in anticipation of what’s going to happen to them: Milky Way before elimination Milky Way after elimination Now you know why we call ourselves the “Gaussian Elimination Squad.” ## Intel vs. GCC for the OpenMP vector triad: Barrier shootout! Posted on by We use the Schönauer Vector Triad for most of our microbenchmarking. It’s a simple benchmark that everyone can write. It looks quite simple when parallelized with OpenMP: double precision, dimension(N) :: a,b,c,d ! initialization etc. omitted s = walltime() !$omp parallel private(R,i)
do R=1,NITER
!$omp do do i=1,N a(i) = b(i) + c(i) * d(i) enddo !$omp end do
enddo
!\$omp end parallel
e=walltime()
MFlops = R*N/(e-s)/1.e6

There are some details that are necessary to make it work as intended; you can read all about this in our book [1]. Usually we choose NITER for every N so that the runtime is a couple hundred milliseconds (so it can be measured accurately), and report performance for N ranging from small to large. The performance of the vector triad is determined by the data transfers, even when the data is in the L1 cache. In the parallel case we can additionally see the usual multicore bandwidth bottleneck(s).

The OpenMP parallelization adds its own overhead, of course. As it turns out, it is mostly concentrated in the implicit barrier at the end of the workshared loop in this case. So, when looking at the performance of the OpenMP code vs. N, we usually see that using more threads slows down the code if N is too small. We can even calculate the barrier overhead from this (again, the book will tell you the gory details).

The barrier overhead varies across compilers and compiler versions, and it depends on the positions of the threads in the machine (e.g., sharing caches or not). You can certainly measure it directly with a suitable microbenchmark [2], but it is quite interesting to see the impact directly in the vector triad performance data.

Here we see the OpenMP vector triad performance on one “Intel Xeon Westmere” socket (6 cores) running at about 2.8 GHz, comparing a reasonably current Intel compiler with g++ 4.7.0. With the Intel compiler the sequential code achieves “best possible” performance within the L1 cache (4 flops in 3 cycles). With OpenMP turned on you cannot see this, of course, since the barrier overhead dominates for loop lengths below a couple of 1000s.

Looking at the results for the two compilers we see that GCC has not learned anything over the last five years (this is for how long we have been comparing compilers in terms of OpenMP barrier overhead): The barrier takes roughly a factor of 20 longer with gcc than with the Intel compiler. Comparing with the ECM performance model [3] for the vector triad we see that the Intel compiler’s barrier is fast enough to at least get near the performance limit in the L2 cache (blue dashed line). Both compilers are on par where it’s easy, i.e., in L3 cache and memory, where the loop is so long that the overhead is negligible.

Note that the bad performance of g++ in this benchmark is not due to some “magic” compiler option that I’ve missed. It’s the devastatingly slow OpenMP barrier. For reference, these are the compiler options I have used:

icpc -openmp -Ofast -xHOST -fno-alias ...
g++ -fopenmp -O3 -msse4.2 -fargument-noalias-global ...

In conclusion, the GCC OpenMP barrier is still slooooow. If you have “short” loops to parallelize, GCC is not for you. Of course you might be able to work around such problems (mutilating a popular saying from one of the Great Old Ones: “If synchronization is the problem, don’t synchronize!”), but it’s still good to be aware of them.

If you are interested in concrete numbers you can take a look at any of our recent tutorials [4], where we always include some recent barrier measurements with current compilers.

[1] G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Press, 2010.

[3] G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Computation and Concurrency: Practice and Experience, DOI: 10.1002/cpe.3180 (2013), Preprint: arXiv:1208.2908

[4] My Tutorials blog page

## Fooling the masses – Stunt 14: Secretly use fancy hardware setups and software tricks!

Posted on by

(See the prelude for some general information on what this is all about)

You’re not the average computer or domain scientist, who has to buy off-the-shelf white-box hardware and run GCC-compiled code on it. You deserve something better, shinier, fancier, which, incidentally, makes your code run faster. And even without special hardware there are those neat little dirty tricks you only get to know when diving deep into the hardware-software interactions. In a sense, it’s the complement of slowing down the baseline (see stunt 2 and stunt 9). The only thing you need to care about is make sure nobody notices. Here’s what can give you a decisive edge:

Figure 1: Liquid nitrogen cooling is overclocker’s heaven! (Image by Rico Shen)

Pimped-up hardware: Overclocked, nitrogen-cooled (see Fig. 1), high-bin hardware that takes its own power plant to run is a rewarding platform. Since you do not care about reproducibility of results, the fact that nobody else has access to your hardware shouldn’t give you sleepless nights.

No safety nets: ECC protection and short memory refresh cycles are for cowards! Bleeding-edge research has no use for pathetic whiners who prefer reliability over performance.

Quiet machine: Ask your friendly system administrators to stop the queues for everybody else and let you have the whole machine for your own benchmarking. That will lift the burden of statistical analysis off your shoulders.

Eliminate inconvenient details: Away with that bloody routine which takes only 10% of serial runtime but is so awfully hard to optimize. Who needs boundary handling anyway?

Low-level programming: “A=B+C” is lame. “vaddpd (%r13,%r8,8), %ymm1, %ymm2″ rocks your world. You think assembly, you breathe machine code, and you use C intrinsics only on bad hair day. Keep your secrets and let the losers deal with the deficiencies of their rotten GNU compilers!

This stunt is a combination of #3, #10 and #11 from Bailey’s original collection.

## Fooling the masses – Stunt 13: If they get you cornered, blame it all on OS jitter!

Posted on by

(See the prelude for some general information on what this is all about)

Ok, so you’re in real trouble: The pimpled smarty-pants in the audience has figured out that the data you have shown in your talk was a pile of sugar-coated white noise.  Or the conference deadline draws near, all you have is  a pile of useless data, and your boss desperately needs another publication. Or your paper is already beyond the page limit, and you need a convenient shortcut.

In such a situation it is good to have a backup plan that will allow you to save your neck and still look good. Fortunately, computers are so complex that there is ample opportunity to blame something that is not your fault.

Enter the “technical-detail-not-under-my-control.” Here are your friends – each is followed by a typical usage example:

Stupid compilers: “Our version of the code shows slightly worse single-thread performance, which is presumably due to the limited optimization capabilities of the compiler.

If the compiler can’t fix it, who can? And nobody wants to inspect assembly code anyway.

Out-of-order execution (or lack thereof): “Processor A shows better performance than processor B possibly because of A’s superior out-of-order processing capabilities.

That’s a truly ingenious statement, mainly because it is so very hard to prove. Among friends you may just bluntly state that B has a lousy design.

L1 instruction cache misses: “As shown in Table 1, our optimized code version B is faster because it has 20% fewer L1 instruction cache misses than version A.

Hardly anyone discusses L1 instruction cache misses, because they hardly ever do any harm. Thus they are the ideal scapegoat if you don’t know the exact reason for some effect. Be careful with presenting performance counter data, though: If the instruction count reveals that there is only one L1I miss every 10000 instructions, that’s not what you want to show in public…

TLB misses:Performance shows a gradual breakdown with growing problem size. This may be caused by excessive penalties due to TLB misses.

TLB misses are always a good choice because they have a reputation to be even more evil than cache misses! You may want to add that the time was too short to prove your statement by  measuring the misses and estimate their actual impact, and then search for the real cause.

Bad prefetching: “Performance does not scale beyond four cores on a socket. We attribute this to problems with the prefetching hardware.

Prefetching is necessary. Bad prefetching means bad performance because the latency (gasp!) cannot be hidden. Other reasons for performance saturation such as a bandwidth bottleneck or load imbalance are much too mundane to be interesting.

Bank conflicts: “Processor X has only [sic!] eight cache banks, which may explain the large fluctuations in performance vs. problem size.

Wow. Cache banks. And you are the one to know that eight just won’t cut it.

Transient network errors: “In contrast to other high-performance networks such as Cray’s Gemini, InfiniBand does not have link-level error detection, which impacts the scalability of our highly parallel code.

You are a good boy/girl, so you have read all of Cray’s marketing brochures. Link-level error detection is soooo cool, it just has to be what you need to scale.

OS jitter: “Beyond eight nodes our implementation essentially stops scaling. Since the cluster runs vanilla [insert your dearly hated distro here] Linux OS images, operating system noise (“OS jitter”) is the likely cause for this failure.

My all-time favorite. This Linux c**p is beyond contempt.

The list can be extended, of course. Whatever technical detail you choose, make sure to pick one that is widely misunderstood or has a reputation of being evil, very complicated, hard to understand, and whose responsibility for your problem is next to impossible to prove. Note the peculiar diction; be noncommittal and use “may explain,” “could be,” “likely,” “presumably,” etc., so you won’t be accountable in case you’re proven wrong.

## Fooling the masses – Stunt 12: Redefine “performance” to suit your needs!

Posted on by

(See the prelude for some general information on what this is all about)

There is actually a very clear definition of the term “performance” in HPC: “performance” is “work” divided by “time.” You may think about “work” as “the problem”; occasionally, Flops is a possible (but easily manipulable) measure. “Time” is the overall wall-clock time to do the work, including everything that needs to be done but counts as overhead, such as communication and synchronization. If you look at performance with respect to parallel resources, i.e., if you want to know how performance scales with the number of workers, the same formula applies – to strong and to weak scaling alike.

Now sometimes the performance results for your parallel program are … mediocre. There may be two reasons for this:

• Scalability is good but your single-core or single-node performance sucks. This is may be a consequence of your applying stunt 2, either by accident or on purpose. If you only show speedups (see stunt 1) you will be fine; however, sometimes it is not so easy to get away with this.
• Single-core and single-socket performance is good (i.e., you understand and hit the relevant bottleneck[s]), but scalability is bad. You know the culprits: load imbalance, insufficient parallel work, communication, synchronization.

Either way, you probably don’t want to show actual performance numbers or data about time to solution. As it happens there is a host of options you can choose from. You can certainly play around with the “time” component in the performance calculation, as shown in stunt 5 and stunt 9, or quietly invoke weak scaling as shown in stunt 4. Here are some popular and more imaginative examples:

Employ “floptimization:” Throw in some extra Flops to make the CPU perform more “work” at hardly any extra cost; often there is at least some headroom in the floating-point pipelines when running real applications. Accumulating something in a register is a classic:

  for(s=0.0, i=0; i<N; ++i) {
p[i] = f(q[i]); // actual work
s += p[i];      // floptimization
}

Even if the sum over all elements of p[] is not needed at all, it’s one more Flop per iteration, it boosts your flop rate, and it will usually not impact your time to solution! If you do it right, only the sky is the limit.

Use manpower metrics: Developing code is expensive, so you can take manpower into account. State that your superior hardware or software environment enabled you to get a working code in less than one week, which amounts to so and so many “man-hours per GFlop/s.” This can be extended to other popular software metrics.

Play the money card: “GFlop/s per Dollar” may also be useful, especially if you compare different systems such as your average off-the-shelf departmental cluster and the big DoE-paid iron.

Go green: Since everyone in HPC today is all mad about saving energy, use “Joules per Flop,” “GFlop/s per Watt,” “CO2 equivalents,” “Joule-seconds,” “Joule-seconds-squared,” or any other combination of metrics that shows the superiority of your approach in terms of “green” or “infrastructure-aware” computing. Moreover, energy to solution is almost proportional to runtime, so you can rename your paper on “Performance Optimization of a Kolonovski-Butterfield Solver” to “Energy Optimization of a Kolonovski-Butterfield Solver” and turn a boring case study into bleeding-edge research.

Blame the wimpy hardware: For multi- or many-core chips, per-core or per-thread metrics mask the inherent bottlenecks and are great for bashing your competitors. For instance, when comparing a multi-core CPU with a GPGPU you could declare that “the available bandwidth per thread is 122 times higher on the CPU than on the GPGPU.” Oh wait – make that 121.68 times.

Figure 1: Windows task manager is your friend: Four cores, all of them 100% utilized – that’s performance!

Boast frantic activity: MIPS (millions of instructions per second) or, equivalently, IPC (instructions per cycle) are perfect higher-is-better metrics if true performance is low. IPC can be boosted on purpose by several means: You can use a highly abstracting language to make the compiler generate a vast amount of instructions for simple things like loading a value from memory, or disable SIMD so that more instructions are executed for the same amount of work. Waiting in OpenMP barriers or MPI functions is also good for a large IPC count, since it often involves spin-waiting loops. You can use special options to disable compiler optimizations which have the potential to change numerical results (such as common subexpression elimination). But take care: If that means moving slow operations like floating-point divides into an inner loop, the IPC value will go down dramatically.

Fight the evil: State that your new code or new algorithm shows a factor of X fewer cache misses compared to the baseline. Cache misses are the black plague of HPC, so nobody will doubt your success. And they are just as easy to manipulate as Flops. Combining with the “go green” strategy is straightforward: Cache misses cost vast amounts of energy (they say), so you can be green and good at the same time!

Wow the management: State that “all processors are 100% utilized, so the code makes perfect use of the available resources.” Windows task manager is the perfect tool to show off with a stunt like this (see Fig. 1), but “top” under Linux works fine, too:

  PID VIRT RES  SHR %CPU %MEM  TIME+  P COMMAND
2318 492m 457m 596 98.6 22.9 0:35.52 2 a.out
2319 492m 457m 596 98.6 22.9 0:35.53 0 a.out
2321 492m 457m 596 98.6 22.9 0:35.50 1 a.out
2320 492m 457m 596 97.6 22.9 0:35.44 3 a.out

In summary, metrics are your wishing well: Find the right metric and any code will look good.

This stunt is essentially #9 of Bailey’s original “Twelve ways to fool the masses.”

## Fooling the masses – Stunt 11: Show data! Plenty. And then some.

Posted on by

(See the prelude for some general information on what this is all about)

Figure 1: Lots of variants of an experiment, lots of machines, no idea for interpretation – why not show it all?

Computers produce data. Parallel computers produce more data. And doing performance experiments with parallel computers produces even more data, since we often conduct several runs with different code variants on different machines. Now imagine a situation where you’ve got all this data lying around on your disk, but you can’t make any sense of it. Since you wouldn’t bother to establish even a coarse performance model (your setup is just too complex for this kind of bean counting), you don’t know whether some particular performance measurement is “good” or “bad” on some particular machine, or how performance should change when some parameter is altered.

Why not put the cart before the horse and show all the data? This has decisive advantages: You can display all the work you did, without the embarrassing message that five or six numbers is all the relevant data you got from your research. Furthermore, it gives you the opportunity to add “discussion” at any desirable level of detail. For instance, if the page limit for submitting to your favorite conference is ten pages, and you have five pages full of messy bar graphs (such as the one shown in Fig. 1), you can easily fill the rest with meaningless “explanations” like “Variant 11 shows about 90% performance advantage on Machine G over Variant 10 on Machine F while, surprisingly, Variant 9 on Machine A is 6% slower than Variant 7 on Machine C.” You may want to write “92.23%” and “5.982%” here – see Stunt 8 for the reason.

In Summary, you can make up for your lack of insight by showing data. Plenty of it. Swamp your readers or your audience with hundreds of colorful bar graphs, scatter plots, timing diagrams, or whatever your favorite data visualization tool has to offer, and describe them at length. If you really want to leave the impression of true understanding, resort to obscure technical details − but this is the theme of another stunt.

## Fooling the masses – Stunt 10: Always emphasize the “interesting” part of your work!

Posted on by

(See the prelude for some general information on what this is all about)

Have you ever thought about how to improve the aerodynamic properties of bulldozers? Yes? Then this stunt is for you.

Figure 1: Always focus on the interesting stuff!

Code optimization is quite rewarding at times. You put all your effort into mutilating a chunk of code beyond all recognition, and in the end you get a 50% speedup. Often enough, this work requires a great deal of  expertise at the interface between hardware and software. And you certainly want to present your success to your peers so they understand how advanced you are.

If you do such a thing, don’t bother picking the low-hanging fruits. When 95% of the run time goes into a boring loop nest which multiplies a dense matrix with a vector, there is not much you can do beyond the standard textbook stuff – or use a library altogether. However, if the remaining 5% consist of a convoluted communication scheme, you’ve found your target. The fact that the possible gain is at most 5% shouldn’t stop you, because it’s all a matter of presentation; see Fig. 1: 3D bars and a proper choice of scales on the axes guide the reader’s eye and help you drive home your point. It’s like presenting the optimized drag coefficient of your aerodynamically improved bulldozer on high gloss paper, and the boring technical data in small black print on a dark blue background, on the back page.

Interestingly, if the optimization target is communication, or synchronization, or any other non-computational overhead, this stunt can be made to work even better: When you strongly scale to a pointless number of workers, bringing down such overheads will give you arbitrary speedups, far beyond the factor of two that is theoretically possible if you restrict yourself to a sensible minimum parallel efficiency of 50%. Coming back to the heavy gear analogy, if you accelerate your bulldozer to 200 mph, the drag coefficient does become essential, and you may feel vindicated. It won’t solve any real-world problem, but hey, this is research after all.

Thanks to Gerhard Wellein for discovering this stunt in a paper. Many papers, actually.