# Georg Hager's Blog

## Blaze 2.1 has been released

Posted on by

Blaze is an open-source, high-performance C++ math library for dense and sparse arithmetic based on Smart Expression Templates. Just in time for ISC’14, Blaze 2.1 has been released. The focus of this latest release is the improvement and extension of the shared memory parallelization capabilities: In addition to OpenMP parallelization, Blaze 2.1 now also enables shared memory parallelization based on C++11 and Boost threads.  Additionally, the underlying architecture for the parallel execution and automatic, context-sensitive optimization has been significantly improved. This not only improves the performance but also allows for an efficient parallel handling of computations involving block-structured vectors and matrices.

All of the Blaze documentation and code can be downloaded at https://code.google.com/p/blaze-lib/.

## Bill Gropp on “Engineering for Performance in High Performance Computing”

Posted on by

At the PASC14 conference in Zurich, Bill Gropp gave a great talk on Performance Engineering, the way it should be done. 100% along the lines of what we try to teach people in our tutorials! https://www.youtube.com/watch?v=sadfSARXSC0

## Gaussian Elimination Squad wins Intel Parallel Universe Computing Challenge at SC13!

Posted on by

Final team bracket for the Intel Parallel Universe Computing Challenge

We’ve made it! The final round against the “Coding Illini” team, comprising members from the University of Illinois and the National Center for Supercomputing Applications (NCSA), took place on Thursday, November 21, at 2pm Mountain Time. We were almost 30% ahead of the other team after the trivia, and managed to extend that head start in the coding challenge, which required the parallelization of a histogram computation. This involved quite a lot of typing, since OpenMP does not allow private clauses and reductions for arrays in C++. My request for a standard-sized keyboard was denied so we had to cope with a compact one, which made fast typing a little hard. As you can see in the video, a large crowd had gathered around the Intel booth to cheer for their teams.

Remember, each match consisted of two parts: The first was a trivia challenge with 15 questions about computing, computer history, programming languages, and the SC conference series. We were given 30 seconds per question, and the faster we logged in the correct answer (out of four), the more points we got. The second part was a coding challenge, which gave each team ten minutes to speed up a piece of code they had never seen before as much as possible. On top of it all, the audience could watch what we were doing on two giant screens.

The Gaussian Elimination Squad: (left to right) Michael Kluge, Michael Ott, Joachim Protze, Christian Iwainsky, Christian Terboven, Guido Juckeland, Damian Alvarez, Gerhard Wellein, and Georg Hager (Image: S. Wienke, RWTH Aachen)

In the first match against the Chinese “Milky Way” team, the code was a 2D nine-point Jacobi solver; an easy task by all means. We did well on that, with the exception that we didn’t think of using non-temporal stores for the target array, something that Gerhard and I had described in detail the day before in our tutorial! Fortunately, the Chinese had the same problem and we came out first. In the semi-final against Korea, the code was a multi-phase lattice-Boltzmann solver written in Fortran 90. After a first surge of optimism (after all, LBM has been a research topic in Erlangen for more than a decade) we saw that the code was quite complex. We won, but we’re still not quite sure whether it was sheer luck (and the Korean team not knowing Fortran). Compared to that, the histogram code in the final match was almost trivial apart from the typing problem.

According to an enthusiastic spectator we had “the nerdiest team name ever!”

After winning the challenge, the “Gaussian Elimination Squad” proudly presented a check for a $25,000 donation to the benefit of the Philippines Red Cross. I think that all teams did a great job, especially when taking into account the extreme stress levels caused by the insane time limit, and everyone watching the (sometimes not-so-brilliant) code that we wrote! It’s striking that none of the US HPC leadership facilities made it to the final match – they all dropped out in the first round (see the team bracket above). The Gaussian Elimination Squad represented the German HPC community, with members from RWTH Aachen (Christian Terboven and Joachim Protze), Jülich Supercomputing Center (Damian Alvarez), ZIH Dresden (Michael Kluge and Guido Juckeland), TU Darmstadt (Christian Iwainsky), Leibniz Supercomputing Center (Michael Ott), and Erlangen Regional Computing Center (Gerhard Wellein and Georg Hager). Only four team members were allowed per match, but the others could help by shouting advice and answers they thought were correct. ## SC13 Intel Parallel Universe Challenge: The Squad moves to the final round! Posted on by We’ve made it to the finals! After a slight disadvantage in the trivia round, we managed to catch up and overtake the Korean team in the coding challenge. It was not strictly a surge of pure brilliance, but who cares… Tomorrow, Thursday 2pm Mountain Time it is, against the winner of the other semi-final (Rice University or NCSA/Illinois). ## SC13 Intel Parallel Universe Challenge: China eliminated! Posted on by The winning team – thinking hard! Today, the “Gaussian Elimination Squad” has kicked the Chinese team out of the competition! So we’re up for the semi-finals tomorrow (Wednesday) at 4pm Mountain Time. ## SC13 Intel Parallel Universe Computing Challenge: Teams disclosed Posted on by Intel has finally disclosed the list of teams and team members for the “SC13 Parallel Universe Computing Challenge:” Five teams from the US (including LLNL and ANL), one from China, one from Korea, and one from Germany: http://software.intel.com/en-us/supercomputing#pid-20366-1690 ## Gaussian Elimination Squad to fight in “SC13 Intel Parallel Universe Computing Challenge” Posted on by It seems that Intel’s marketing division has some money to spend for charity –$25,000 to be exact. To make its spending as entertaining as possible, they have set up a stage show at SC13 in Denver, Colorado (Nov 17-22, 2013): The “Intel Parallel Universe Computing Challenge.”

Eight teams from the US, Europe, China, and Korea are going to fight in a three-round sudden death tournament. The German team comprises members from Aachen, Darmstadt, Jülich, Dresden, Garching, and Erlangen, and happens to be led by me. According to the rules each match will be 30 minutes, in which the teams have to answer questions and optimize code. The audience will have the chance to answer questions and win prizes, too! So if you happen to be at SC13 and can spare the time, drop by the Intel booth (#2701, see floor plan) and see us fight! See the link above for the schedule.

Honoring one of the greatest mathematicians of all time, the German team has chosen a peculiar war name: The “Gaussian Elimination Squad.” We will take on “Team Milky Way” from China in our first match on Tuesday, Nov 19th, 11:00am. We do not know why they have picked the name of a popular chocolate bar but probably it was in anticipation of what’s going to happen to them:

Milky Way before elimination

Milky Way after elimination

Now you know why we call ourselves the “Gaussian Elimination Squad.”

## Intel vs. GCC for the OpenMP vector triad: Barrier shootout!

Posted on by

We use the Schönauer Vector Triad for most of our microbenchmarking. It’s a simple benchmark that everyone can write. It looks quite simple when parallelized with OpenMP:

double precision, dimension(N) :: a,b,c,d
! initialization etc. omitted
s = walltime()
!$omp parallel private(R,i) do R=1,NITER !$omp do
do i=1,N
a(i) = b(i) + c(i) * d(i)
enddo
!$omp end do enddo !$omp end parallel
e=walltime()
MFlops = R*N/(e-s)/1.e6

There are some details that are necessary to make it work as intended; you can read all about this in our book [1]. Usually we choose NITER for every N so that the runtime is a couple hundred milliseconds (so it can be measured accurately), and report performance for N ranging from small to large. The performance of the vector triad is determined by the data transfers, even when the data is in the L1 cache. In the parallel case we can additionally see the usual multicore bandwidth bottleneck(s).

The OpenMP parallelization adds its own overhead, of course. As it turns out, it is mostly concentrated in the implicit barrier at the end of the workshared loop in this case. So, when looking at the performance of the OpenMP code vs. N, we usually see that using more threads slows down the code if N is too small. We can even calculate the barrier overhead from this (again, the book will tell you the gory details).

The barrier overhead varies across compilers and compiler versions, and it depends on the positions of the threads in the machine (e.g., sharing caches or not). You can certainly measure it directly with a suitable microbenchmark [2], but it is quite interesting to see the impact directly in the vector triad performance data.

Here we see the OpenMP vector triad performance on one “Intel Xeon Westmere” socket (6 cores) running at about 2.8 GHz, comparing a reasonably current Intel compiler with g++ 4.7.0. With the Intel compiler the sequential code achieves “best possible” performance within the L1 cache (4 flops in 3 cycles). With OpenMP turned on you cannot see this, of course, since the barrier overhead dominates for loop lengths below a couple of 1000s.

Looking at the results for the two compilers we see that GCC has not learned anything over the last five years (this is for how long we have been comparing compilers in terms of OpenMP barrier overhead): The barrier takes roughly a factor of 20 longer with gcc than with the Intel compiler. Comparing with the ECM performance model [3] for the vector triad we see that the Intel compiler’s barrier is fast enough to at least get near the performance limit in the L2 cache (blue dashed line). Both compilers are on par where it’s easy, i.e., in L3 cache and memory, where the loop is so long that the overhead is negligible.

Note that the bad performance of g++ in this benchmark is not due to some “magic” compiler option that I’ve missed. It’s the devastatingly slow OpenMP barrier. For reference, these are the compiler options I have used:

icpc -openmp -Ofast -xHOST -fno-alias ...
g++ -fopenmp -O3 -msse4.2 -fargument-noalias-global ...

In conclusion, the GCC OpenMP barrier is still slooooow. If you have “short” loops to parallelize, GCC is not for you. Of course you might be able to work around such problems (mutilating a popular saying from one of the Great Old Ones: “If synchronization is the problem, don’t synchronize!”), but it’s still good to be aware of them.

If you are interested in concrete numbers you can take a look at any of our recent tutorials [4], where we always include some recent barrier measurements with current compilers.

[1] G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Press, 2010.

[3] G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Computation and Concurrency: Practice and Experience, DOI: 10.1002/cpe.3180 (2013), Preprint: arXiv:1208.2908

[4] My Tutorials blog page

## Fooling the masses – Stunt 14: Secretly use fancy hardware setups and software tricks!

Posted on by

(See the prelude for some general information on what this is all about)

You’re not the average computer or domain scientist, who has to buy off-the-shelf white-box hardware and run GCC-compiled code on it. You deserve something better, shinier, fancier, which, incidentally, makes your code run faster. And even without special hardware there are those neat little dirty tricks you only get to know when diving deep into the hardware-software interactions. In a sense, it’s the complement of slowing down the baseline (see stunt 2 and stunt 9). The only thing you need to care about is make sure nobody notices. Here’s what can give you a decisive edge:

Figure 1: Liquid nitrogen cooling is overclocker’s heaven! (Image by Rico Shen)

Pimped-up hardware: Overclocked, nitrogen-cooled (see Fig. 1), high-bin hardware that takes its own power plant to run is a rewarding platform. Since you do not care about reproducibility of results, the fact that nobody else has access to your hardware shouldn’t give you sleepless nights.

No safety nets: ECC protection and short memory refresh cycles are for cowards! Bleeding-edge research has no use for pathetic whiners who prefer reliability over performance.

Quiet machine: Ask your friendly system administrators to stop the queues for everybody else and let you have the whole machine for your own benchmarking. That will lift the burden of statistical analysis off your shoulders.

Eliminate inconvenient details: Away with that bloody routine which takes only 10% of serial runtime but is so awfully hard to optimize. Who needs boundary handling anyway?

Low-level programming: “A=B+C” is lame. “vaddpd (%r13,%r8,8), %ymm1, %ymm2″ rocks your world. You think assembly, you breathe machine code, and you use C intrinsics only on bad hair day. Keep your secrets and let the losers deal with the deficiencies of their rotten GNU compilers!

This stunt is a combination of #3, #10 and #11 from Bailey’s original collection.

## Fooling the masses – Stunt 13: If they get you cornered, blame it all on OS jitter!

Posted on by

(See the prelude for some general information on what this is all about)

Ok, so you’re in real trouble: The pimpled smarty-pants in the audience has figured out that the data you have shown in your talk was a pile of sugar-coated white noise.  Or the conference deadline draws near, all you have is  a pile of useless data, and your boss desperately needs another publication. Or your paper is already beyond the page limit, and you need a convenient shortcut.

In such a situation it is good to have a backup plan that will allow you to save your neck and still look good. Fortunately, computers are so complex that there is ample opportunity to blame something that is not your fault.

Enter the “technical-detail-not-under-my-control.” Here are your friends – each is followed by a typical usage example:

Stupid compilers: “Our version of the code shows slightly worse single-thread performance, which is presumably due to the limited optimization capabilities of the compiler.

If the compiler can’t fix it, who can? And nobody wants to inspect assembly code anyway.

Out-of-order execution (or lack thereof): “Processor A shows better performance than processor B possibly because of A’s superior out-of-order processing capabilities.

That’s a truly ingenious statement, mainly because it is so very hard to prove. Among friends you may just bluntly state that B has a lousy design.

L1 instruction cache misses: “As shown in Table 1, our optimized code version B is faster because it has 20% fewer L1 instruction cache misses than version A.

Hardly anyone discusses L1 instruction cache misses, because they hardly ever do any harm. Thus they are the ideal scapegoat if you don’t know the exact reason for some effect. Be careful with presenting performance counter data, though: If the instruction count reveals that there is only one L1I miss every 10000 instructions, that’s not what you want to show in public…

TLB misses:Performance shows a gradual breakdown with growing problem size. This may be caused by excessive penalties due to TLB misses.

TLB misses are always a good choice because they have a reputation to be even more evil than cache misses! You may want to add that the time was too short to prove your statement by  measuring the misses and estimate their actual impact, and then search for the real cause.

Bad prefetching: “Performance does not scale beyond four cores on a socket. We attribute this to problems with the prefetching hardware.

Prefetching is necessary. Bad prefetching means bad performance because the latency (gasp!) cannot be hidden. Other reasons for performance saturation such as a bandwidth bottleneck or load imbalance are much too mundane to be interesting.

Bank conflicts: “Processor X has only [sic!] eight cache banks, which may explain the large fluctuations in performance vs. problem size.

Wow. Cache banks. And you are the one to know that eight just won’t cut it.

Transient network errors: “In contrast to other high-performance networks such as Cray’s Gemini, InfiniBand does not have link-level error detection, which impacts the scalability of our highly parallel code.

You are a good boy/girl, so you have read all of Cray’s marketing brochures. Link-level error detection is soooo cool, it just has to be what you need to scale.

OS jitter: “Beyond eight nodes our implementation essentially stops scaling. Since the cluster runs vanilla [insert your dearly hated distro here] Linux OS images, operating system noise (“OS jitter”) is the likely cause for this failure.

My all-time favorite. This Linux c**p is beyond contempt.

The list can be extended, of course. Whatever technical detail you choose, make sure to pick one that is widely misunderstood or has a reputation of being evil, very complicated, hard to understand, and whose responsibility for your problem is next to impossible to prove. Note the peculiar diction; be noncommittal and use “may explain,” “could be,” “likely,” “presumably,” etc., so you won’t be accountable in case you’re proven wrong.