Georg Hager's Blog

Random thoughts on High Performance Computing

Content

The “roofline model” for kernel performance assessment

Sam Williams from UCB has come up with a very nice method to illustrate optimization potential for loop kernels on a known architecture. Everyone who knows about things like code and machine balance can estimate the expected fraction of “light speed” for some loop kernel. However, depending on your knowledge (or your assumptions) about the architecture under consideration, machine balance can be a moving target: Do you consider SIMD instructions to be applicable? Does the data set fit into some cache? Can the arithmetic pipelines be used to their full capacity? Are MULTs and ADDs balanced in the code? Is prefetching possible? Can non-temporal stores be used? Usually, we compute different machine balance numbers for all those cases to get our estimates.

Williams has found a very nice way to incorporate all this information into a graphical representation, the roofline diagram. With it, one can illustrate not only the architectural limits for kernel performance, but also the optimization potential of some (given) implementation. Read the full presentation: The Roofline Model: A pedagogical tool for program analysis and optimization. There is also a nice poster.

Floating point computations demystified

For those who constantly wonder why floating-point results for the same numerical problem tend to vary across architectures, compilers, compiler options and parallel program runs, this paper may be interesting:

David Monniaux, The pitfalls of verifying floating-point computations. arXiv:cs/0701192v5

Especially the first chapters form a nice intro on things you always wanted to know about floating-point but never dared to ask…

Allocatable array alignment with Intel Fortran compilers

Aligning heap data in C is simple: Just use the standard memalign() or better posix_memalign() functions instead of malloc() and you’re done. Intel compilers also feature special library calls to achieve the same thing, but you don’t really need them (you do need compiler support, though, for stack data, structures etc.). It should be clear for everyone familiar with current x86 architectures what properly aligned data can do for you: Packed aligned loads and non-temporal stores become possible. Even though the compiler can still employ aligned data movement by itself in some cases by loop peeling, one may want to align all references properly to enable the use of the #pragma vector aligned sledgehammer (why don’t they provide an argument list for this directive?).

In Fortran there is no standard way to make allocatable data automatically aligned on some address boundary (standard alignment is on 8 byte). The Intel compiler, however, provides a special directive to do just that. In order to enforce 16-byte alignment you can write:

 double precision, allocatable, dimension(:) :: array
!DEC$ ATTRIBUTES ALIGN: 16 :: array

! ...

 allocate(array(100000))

Although the compiler docs say at some point that this doesn’t work for allocatables, it does at least for versions 9.1 and 10.1 (I’ve checked by printing out the address explicitly).

This should enable the same vectorization stunts as in C without too much hassle.

Energy efficient HPC?

Energy efficient HPC?
Energy efficient HPC seems to be the trend of the hour – everyone talks about strategies to reduce power consumption in high performance systems. BlusGene/P gets a lot of credit for being so power efficient (at least if you consider the common but obviously insane LINPACK metric). Let us do some simple calculations in order to pinpoint the truth behind the hype.

An average federal HPC center in Germany takes about 1-1.5 MW to run today. That’s the equivalent of roughly ten sports cars running 24/7, or maybe 1000 (European) standard cars under average utilization.

What’s the number of standard cars in just a single big city like Munich?

Another comparison: Assuming there are 30 million households in Germany running their (quite power efficient) 100W TV set for two hours a day. If they all left the TV off for just a single day, that would save 6 million kilowatt hours of electrical energy. The HLRB2 federal supercomputer at LRZ Munich could run for almost a full year from this single day of abstinence.

What would you rather spend your high-quality electrical energy on: Supplying redundant non-information (a.k.a. standard TV entertainment) to millions of people who don’t need it anyway or giving a couple of hundred scientists a powerful research tool?

I’m not saying that power saving strategies are no use in HPC. The sheer problem of getting the power needed to a small spot like an HPC center is quite a challenge. Heat production is slowly getting out of hand and needs to be considered. However, arguing that “green HPC” could make a significant contribution to environmental protection is nothing but a convenient device to get funding for some obscure research projects.

So there.

ccNUMA and Linux buffer cache

It may seem surprising to some, but ccNUMA has hit the mass market and will forcefully continue to do so by the end of 2008 with Intel’s Nehalem processor. And although ccNUMA bears the potential of vastly improved bandwidth scalability, many users and sysadmins meet it with ignorance. Alas, their ignorance is all too often vindicated by the fact that they are right – sometimes.

This is because the vast majority of parallel application codes uses MPI. If you run one MPI process per core and the kernel does an average job of maintaining affinity, you will benefit from ccNUMA without the hassle of paying attention to correct parallel page placement. The latter is mandatory with memory-bound OpenMP code and sometimes not easy to implement, in particular when the problem is not as regular as, say, a simple dense matrix-vector multiply.

However, even with MPI you can run into the ccNUMA trap when the system’s memory is filled with something else before your code starts to run. This could be, e.g., file system buffer cache. In order to pinpoint the problem, Michael has done an interesting test on one of our dual-socket (dual-LD) Opteron nodes and, for comparison, on a dual-socket Woodcrest system. The former is ccNUMA while the latter is UMA. The test performs the following steps in a loop:

  1. Drop the FS caches by “echo 1 > /proc/sys/vm/drop_caches“. Btw, this facility exists since Linux kernel 2.6.16. Earlier kernels may have similar features, but those are not “official”. This operation is equivalent to using the bcfree command on an SGI Altix.
  2. Write a file of some size (increasing with iteration count) to the local disk. The maximum file size equals the amount of memory.
  3. sync the cache so that there are no flush operations in the background (but the cache is still filled).
  4. Run a vector triad benchmark with 4 MPI processes that fills exactly half of the node’s memory.

As the triad is purely bandwidth-bound, its performance depends crucially on the memory pages being allocated in the same locality domains as the processors that use them. However, the presence of a huge buffer cache prevents this:

The code can get an aggregate performance of roughly 500 MFlop/s on the Opteron node and close to 400 MFlop/s on the Woodcrest. We see that performance is unharmed by the buffer cache on the UMA system, but there are huge fluctuations on the ccNUMA node. Minimum performance is reached if cache size is about 2 GB which is half of the installed memory. In this case the kernel has filled one LD with buffer cache and all MPI processes map their pages in the second LD. We end up with the well-known congestion problem – a single LD must service the bandwidth demands of all cores. With the file size growing even further the effect vanishes because if all memory is filled by FS cache, any user-allocated page will drop a cache page and access becomes local again.

Linux, by default, keeps the cache even if an application is forced to map memory pages from a foreign LD if there is nothing left in the local LD. You can prevent this, of course, by using appropriate calls from the libnuma library, but you have to know about this. Most users don’t.

On the admin’s side we see that it’s a good idea to drop the cache whenever a cluster’s compute node becomes free. Users can’t do this themselves because you have to be root. As a makeshift, however, a user can flush and drop all cache pages by running a “sweeper” code that allocates and touches all the node’s memory before the real application starts.

Summing up, users and admins must be aware of such effects. Imagine what happens with a job that uses hundreds of ccNUMA nodes on one of which there’s a huge FS cache…

Sun UltraSPARC T2 under test

Thanks to Sun Microsystems and the kind people from RWTH Aachen we had access to a pre-production UltraSPARC T2 (a.k.a. “Niagara 2”) system for some tests. We didn’t manage to get the whole RRZE benchmark suite running but could produce some pretty interesting low-level stuff. The peculiar way the N2 addresses its four built-in memory controllers leads to severe congestion if the mutual alignment of memory streams is unfortunate. This can be seen, e.g. with Lattice-Boltzmann codes and, of course, also with STREAM.

That said, if you know what you’re doing and somehow manage to get what we have learned into applications, the Niagara2 is a pretty interesting processor. In a single socket it has a nominal memory bandwidth of >60 GB/s (40 read, 20 write) of which about a third can be actually measured. At a peak performance of 11 GFlop/s (8 cores at 1.4 GHz), this makes for a pretty impressive machine balance which is far beyond any other cache-based CPU. And finally, the multi-threaded architecture is much less strange and hard to grasp than, e.g., the Cell design.

If you want to know more, here’s a presentation we prepared for the recent “SunDay” at RRZE. Please bear in mind that all of this is still preliminary data and will have to be confirmed on production hardware.

rrze-n2-ea.pdf

All about the N2’s microarchitecture can be found in this neat document:
http://opensparc-t2.sunsource.net/specs/OpenSPARCT2_Core_Micro_Arch.pdf

IMB multi-mode Ping-Pong demystified?

IMB multi-mode Ping-Pong demystified?
Everybody knows the ubiquitous PingPong benchmark, first published in the Pallas MPI benchmarks and today contained in Intels IMB suite. Using one process on each node and plotting bandwidth versus message size, you see the well-known behaviour – saturating bandwidth for large messages and latency domination for small ones (blue curve – measurements were done on our Woody cluster between nodes that are connected to a common IB switch nodeboard, i.e. with DDR speed):

If you run this code in multi-mode, i.e. two or more processes on one node sending and two or more on the other node receiving, the bandwidth pattern changes considerably (black squares). There is a significant “overshoot” for medium-sized messages and a sudden drop at about 1 MByte. Eventually the saturation bandwidth is the same as in the simple case.

It turns out that one can explain this, at least for the 2-2 case, by assuming that MPI (or the IB driver or whoever) splits messages that are larger that a certain limit (e.g., 1 MByte) into chunks and transmits/receives those chunks alternating between the two connections. Using this model and fitting the parameters we can predict the multi-mode Ping-Pong bandwidth quite well (orange curve). In the 3-3 case, however, things get more complicated (green curve) and there is an additional plateau. I’m not sure whether one can really extend the model to encompass this effect as well.

If you are interested, the gory details have been written up: mm-pingpong.pdf

OpenMP, ccNUMA and C++

OpenMP, ccNUMA and C++
If you are interested in programming with C++ and OpenMP, the just-finished diploma thesis of Holger Stengel might be interesting for you (in German – available on request). It studies ccNUMA effects in C++ and ways to circumvent them. To fuel your appetite, there is a nice English poster with most of the results: poster_cppnuma.pdf

This whole work was kicked off by some of the problems I had encountered during my PhD thesis where I had parallelized a C++ code from condensed matter physics. At that time, nobody had even thought about what would happen if standard C++ elements (arrays of objects, std::vector<> etc.) were used on a ccNUMA machine with OpenMP. Another inspiration came from Matt Austern‘s article about Segmented Iterators and Hierarchical Algorithms. The segmented iterator described in this paper could by useful for many purposes, of which NUMA placement is only one. In the thesis we implemented a version in which you could exactly control data placement by configurable padding.

I would be glad to continue on this topic with another diploma/bachelor/masters student. If you are hooked, feel free to contact me.