Georg Hager's Blog

Random thoughts on High Performance Computing

Content

Fooling the masses – Stunt 12: Redefine “performance” to suit your needs!

(See the prelude for some general information on what this is all about)

There is actually a very clear definition of the term “performance” in HPC: “performance” is “work” divided by “time.” You may think about “work” as “the problem”; occasionally, Flops is a possible (but easily manipulable) measure. “Time” is the overall wall-clock time to do the work, including everything that needs to be done but counts as overhead, such as communication and synchronization. If you look at performance with respect to parallel resources, i.e., if you want to know how performance scales with the number of workers, the same formula applies – to strong and to weak scaling alike.

Now sometimes the performance results for your parallel program are … mediocre. There may be two reasons for this:

  • Scalability is good but your single-core or single-node performance sucks. This is may be a consequence of your applying stunt 2, either by accident or on purpose. If you only show speedups (see stunt 1) you will be fine; however, sometimes it is not so easy to get away with this.
  • Single-core and single-socket performance is good (i.e., you understand and hit the relevant bottleneck[s]), but scalability is bad. You know the culprits: load imbalance, insufficient parallel work, communication, synchronization.

Either way, you probably don’t want to show actual performance numbers or data about time to solution. As it happens there is a host of options you can choose from. You can certainly play around with the “time” component in the performance calculation, as shown in stunt 5 and stunt 9, or quietly invoke weak scaling as shown in stunt 4. Here are some popular and more imaginative examples:

Employ “floptimization:” Throw in some extra Flops to make the CPU perform more “work” at hardly any extra cost; often there is at least some headroom in the floating-point pipelines when running real applications. Accumulating something in a register is a classic:

  for(s=0.0, i=0; i<N; ++i) {
    p[i] = f(q[i]); // actual work
    s += p[i];      // floptimization
  }

Even if the sum over all elements of p[] is not needed at all, it’s one more Flop per iteration, it boosts your flop rate, and it will usually not impact your time to solution! If you do it right, only the sky is the limit.

Use manpower metrics: Developing code is expensive, so you can take manpower into account. State that your superior hardware or software environment enabled you to get a working code in less than one week, which amounts to so and so many “man-hours per GFlop/s.” This can be extended to other popular software metrics.

Play the money card: “GFlop/s per Dollar” may also be useful, especially if you compare different systems such as your average off-the-shelf departmental cluster and the big DoE-paid iron.

Go green: Since everyone in HPC today is all mad about saving energy, use “Joules per Flop,” “GFlop/s per Watt,” “CO2 equivalents,” “Joule-seconds,” “Joule-seconds-squared,” or any other combination of metrics that shows the superiority of your approach in terms of “green” or “infrastructure-aware” computing. Moreover, energy to solution is almost proportional to runtime, so you can rename your paper on “Performance Optimization of a Kolonovski-Butterfield Solver” to “Energy Optimization of a Kolonovski-Butterfield Solver” and turn a boring case study into bleeding-edge research.

Blame the wimpy hardware: For multi- or many-core chips, per-core or per-thread metrics mask the inherent bottlenecks and are great for bashing your competitors. For instance, when comparing a multi-core CPU with a GPGPU you could declare that “the available bandwidth per thread is 122 times higher on the CPU than on the GPGPU.” Oh wait – make that 121.68 times.

taskmgr.png

Figure 1: Windows task manager is your friend: Four cores, all of them 100% utilized – that’s performance!

Boast frantic activity: MIPS (millions of instructions per second) or, equivalently, IPC (instructions per cycle) are perfect higher-is-better metrics if true performance is low. IPC can be boosted on purpose by several means: You can use a highly abstracting language to make the compiler generate a vast amount of instructions for simple things like loading a value from memory, or disable SIMD so that more instructions are executed for the same amount of work. Waiting in OpenMP barriers or MPI functions is also good for a large IPC count, since it often involves spin-waiting loops. You can use special options to disable compiler optimizations which have the potential to change numerical results (such as common subexpression elimination). But take care: If that means moving slow operations like floating-point divides into an inner loop, the IPC value will go down dramatically.

Fight the evil: State that your new code or new algorithm shows a factor of X fewer cache misses compared to the baseline. Cache misses are the black plague of HPC, so nobody will doubt your success. And they are just as easy to manipulate as Flops. Combining with the “go green” strategy is straightforward: Cache misses cost vast amounts of energy (they say), so you can be green and good at the same time!

Wow the management: State that “all processors are 100% utilized, so the code makes perfect use of the available resources.” Windows task manager is the perfect tool to show off with a stunt like this (see Fig. 1), but “top” under Linux works fine, too:

  PID VIRT RES  SHR %CPU %MEM  TIME+  P COMMAND 
 2318 492m 457m 596 98.6 22.9 0:35.52 2 a.out 
 2319 492m 457m 596 98.6 22.9 0:35.53 0 a.out 
 2321 492m 457m 596 98.6 22.9 0:35.50 1 a.out 
 2320 492m 457m 596 97.6 22.9 0:35.44 3 a.out

In summary, metrics are your wishing well: Find the right metric and any code will look good.

This stunt is essentially #9 of Bailey’s original “Twelve ways to fool the masses.”

Fooling the masses – Stunt 11: Show data! Plenty. And then some.

(See the prelude for some general information on what this is all about)

ShowDataPlenty_Bars

Figure 1: Lots of variants of an experiment, lots of machines, no idea for interpretation – why not show it all?

Computers produce data. Parallel computers produce more data. And doing performance experiments with parallel computers produces even more data, since we often conduct several runs with different code variants on different machines. Now imagine a situation where you’ve got all this data lying around on your disk, but you can’t make any sense of it. Since you wouldn’t bother to establish even a coarse performance model (your setup is just too complex for this kind of bean counting), you don’t know whether some particular performance measurement is “good” or “bad” on some particular machine, or how performance should change when some parameter is altered.

Why not put the cart before the horse and show all the data? This has decisive advantages: You can display all the work you did, without the embarrassing message that five or six numbers is all the relevant data you got from your research. Furthermore, it gives you the opportunity to add “discussion” at any desirable level of detail. For instance, if the page limit for submitting to your favorite conference is ten pages, and you have five pages full of messy bar graphs (such as the one shown in Fig. 1), you can easily fill the rest with meaningless “explanations” like “Variant 11 shows about 90% performance advantage on Machine G over Variant 10 on Machine F while, surprisingly, Variant 9 on Machine A is 6% slower than Variant 7 on Machine C.” You may want to write “92.23%” and “5.982%” here – see Stunt 8 for the reason.

In Summary, you can make up for your lack of insight by showing data. Plenty of it. Swamp your readers or your audience with hundreds of colorful bar graphs, scatter plots, timing diagrams, or whatever your favorite data visualization tool has to offer, and describe them at length. If you really want to leave the impression of true understanding, resort to obscure technical details − but this is the theme of another stunt.

 

Fooling the masses – Stunt 8: Impress your audience with awe-inspiring accuracy

(See the prelude for some general information on what this is all about)

“Qualitative” statements make you cringe. Being a scientist, you know that science is all about accuracy. And what could be more accurate than numbers?  Good for you! There is ample opportunity for generating numbers in computer science. The art is to not block your audience’s view on the raw and unfiltered truth.

# cores Time (plain) [s] Time (optimized) [s] Speedup [%]
1 12.0435766 8.6198441 39.71919283
2 6.1179025 5.5901951 9.439874469
3 4.9041394 4.6966098 4.418710705
4 4.7002801 4.6912042 0.193466317

Fig. 1: Awesome! Up to 1.77992 times faster!

A simple example: When comparing performance numbers between different machines or code versions (see Stunt 7 for some amazing tricks you can play there), don’t hesitate to provide timings with sub-microsecond resolution, even if the results fluctuate like mad. Then state, in a slightly generalizing, “management-style” demeanor, that your “optimizations lead to a performance increase of up to 39.71919283%”, and refer to the data (see the table) to substantiate your assertion. In other words, forget what your high school teachers told you about significant digits and use the full power of your pocket calculator, even if a slide rule would also do the trick.

By the way, the same works for drawing conclusions from data in diagrams. Since it is sometimes hard to read off precise values from graphs, help your audience by stating the raw facts (see Fig. 1). Using boldface and jazzy colors will help to keep peoples’ attention on the important messages. Do not waste time and space dealing with lengthy explanations but let the numbers speak for themselves!

 

Sun UltraSPARC T2 under test

Thanks to Sun Microsystems and the kind people from RWTH Aachen we had access to a pre-production UltraSPARC T2 (a.k.a. “Niagara 2”) system for some tests. We didn’t manage to get the whole RRZE benchmark suite running but could produce some pretty interesting low-level stuff. The peculiar way the N2 addresses its four built-in memory controllers leads to severe congestion if the mutual alignment of memory streams is unfortunate. This can be seen, e.g. with Lattice-Boltzmann codes and, of course, also with STREAM.

That said, if you know what you’re doing and somehow manage to get what we have learned into applications, the Niagara2 is a pretty interesting processor. In a single socket it has a nominal memory bandwidth of >60 GB/s (40 read, 20 write) of which about a third can be actually measured. At a peak performance of 11 GFlop/s (8 cores at 1.4 GHz), this makes for a pretty impressive machine balance which is far beyond any other cache-based CPU. And finally, the multi-threaded architecture is much less strange and hard to grasp than, e.g., the Cell design.

If you want to know more, here’s a presentation we prepared for the recent “SunDay” at RRZE. Please bear in mind that all of this is still preliminary data and will have to be confirmed on production hardware.

rrze-n2-ea.pdf

All about the N2’s microarchitecture can be found in this neat document:
http://opensparc-t2.sunsource.net/specs/OpenSPARCT2_Core_Micro_Arch.pdf

Array summation benchmark

A question came up on the OpenMP mailing list today concerning scalability of simple array summation on an Opteron processor. I have done some tests with the following code, using the Intel C++ compiler version 9.1:

#pragma omp parallel for private(j) reduction(+: sum)
#pragma vector always
  for (j = 0; j < N; j++){
    sum += array2[j];
  }

There is a loop around that to ensure that for small sizes we actually see the cache effect. Here is the result:

The number of threads (1T, 2T,…) is indicated. In case of the Opteron system, this was a 2-socket dual-core 2GHz box and the 2-thread data was correspondingly measured on one (1S) or two (2S) sockets, respectively. Proper NUMA placement was implemented. The “Conroe” system is my standard Core2 workstation.

Data on purely serial runs (no -openmp) is shown for reference. In contrast to low-level benchmarks like the stream or vector triads which have more read streams and at least one write stream, there seems to be a lot of “headroom” for the second thread even for large N on an Opteron socket.

Benchmarking fun with calloc() and zero pages

Benchmarking fun with calloc() and zero pages
In the course of our yearly lecture on “Programming Techniques for Supercomputers (PTfS)” we hand out some homework assignments to the students. During the first weeks of the term, those are simple loop kernel benchmarks that are supposed to sharpen the students’ view with regard to basic performance bottlenecks like memory bandwidth, latency, pipelining and other issues. The first bug that 50% of all people doing such benchmarks encounter is that they forget to initialize their arrays. So, to prevent this, someone used the convenient calloc() function. Memory allocated in this way is zeroed out and will contain floating point zeroes (strictly speaking, a floating point zero does not have to be represented by a bit pattern with all zeroes, but in practice this will work). The code looked roughly like this (simplified):

  double *a = (double*)calloc(N, sizeof(double));
  double *b = (double*)calloc(N, sizeof(double));
  // ... same for c and d  
  double start_t = get_walltime();
  for(i=0; i<N; ++i)
    a[i] = b[i] + c[i] * d[i];
  double wctime = get_walltime() - start_t;
  // now report walltime and performance

The goal was to benchmark the well-known vector triad which is limited by memory bandwidth on all computer architectures for large N (beyond cache sizes). On the system under consideration, we would have expected just below 200 MFlop/s. To our great surprise, the benchmark yielded a blazing performance of 1.4 GFlop/s! Something was obviously wrong.

At first we suspected some ingenious compiler optimization that somehow dumped the whole loop and jumped to the end right away. This could be ruled out easily by making sure the results are actually used and checking that absolute runtime depends on N and other parameters in the expected way. Interestingly, if another (redundant) initialization loop is added after the calloc()s,

  for(i=0; i<N; ++i)
    b[i] = c[i] = d[i] = 0.0;

performance goes down to the expected level, although the array contents are bitwise identical compared to the first version of the code. And what finally left us completely baffled was the assembly code generated for the kernel loop: no difference at all between the two versions.

In the end, Michael found the solution. I must frankly admit that I never would have figured that out by myself – here we go:

When allocating memory using calloc(), the amount of memory requested is not allocated right away. Instead, all pages that belong to the memory block are connected to a single page containing all zeroes by some MMU magic (links below). If such pages are only read (which was true for arrays b, c and d in the original version of the benchmark), the data is provided from the single zero page, which – of course – fits into cache. So much for memory-bound loop kernels. If a page gets written to (no matter how), a fault occurs, the “real” page is mapped and the zero page is copied to memory. This is called copy-on-write, a well-known optimization approach (that I even have taught multiple times in my C++ lectures). After that, the zero-read trick does not work any more for that page and this is why performance was so much lower after inserting the – supposedly redundant – init loop.

It’s fascinating how many hours you can spend in front of your monitor staring at 20 lines of code.

Cheers!

Links:

“Copy on write” page from the Kernel Analysis HOWTO: http://tldp.org/HOWTO/KernelAnalysis-HOWTO-10.html#ss10.4

Wikipedia entry (mentions the calloc() issue explicitly): http://en.wikipedia.org/wiki/Copy-on-write