Georg Hager's Blog

Random thoughts on High Performance Computing

Content

A case for the non-temporal store

Non-temporal stores (also called “streaming stores”) were added to x86 processors with SSE2 (i.e. when the first Pentium IV called “Willamette” appeared in 2000). There was also some support in MMX before, but nobody should be required to use MMX any more. In contrast to standard store instructions which transfer data from registers to memory, NT stores do not require a prior cache line read for ownership (RFO) but write to memory “directly” (well not quite – but to first order it’s true). Beware the usual explanations for RFO that Google finds for you – they almost all describe RFO as something inherently connected to cache coherency in multi-processor systems, but RFO is also required on a single core, whenever there is a cache that is organized in lines.

What are NT stores good for? Again, most references cite them as a means to avoid cache pollution in cases where stored data will not be re-used soon. That’s true to some extent; every word that you don’t store in your cache means more room for other, more frequently used data. But that’s only one half the story. Imagine you have a code whose working set fits into the outer-level cache and which produces data to be stored in memory. Using NT stores for such a code will probably slow it down because, depending on the LD/ST balance, performance is then limited by the memory interface. The true benefit of NT stores can be seen with a truly memory-bound code which is not dominated by loads, like, e.g., the STREAM COPY benchmark, relaxation methods, the lattice-Boltzmann algorithm, or anything that can be formulated in terms of stride-one store streams: By saving the RFO, the pressure on the memory subsystem is reduced. See, e.g., my talk at the KONWIHR results and review workshop in December 2007 at LRZ where the performance boost through NT stores was demonstrated using a 2D Jacobi (5-point stencil) code.

The most important non-temporal store instruction is movntpd, which writes the contents of a full 16-byte SSE register (xmm0…xmm15) to memory. The memory address has to be a multiple of 16, else an exception will be generated. Also, this instruction exists in a “packed” variant only, i.e. there is no movntsd that would only store the lower 8 bytes of the register (AMD has included it, though, into their SSE4a extensions which are implemented with the K10 CPUs – read a nice writeup by Rahul Chaturvedi from AMD for more information). So, if you are stuck with Intel processors or a compiler which does not know about movntsd on K10, loops which use NT stores must be vectorizable and the store stream(s) must be 16-byte aligned.

Or so I thought, for a long time. This assumption was backed, of course, by stupid compilers who blatantly refused to use NT stores if they weren’t 100% sure at compile time whether the store stream was actually aligned. Never heard of loop peeling, eh? Thank goodness great Intel gave us the vector aligned pragma, which you can put in front of a loop in order to tell the compiler that the streams appearing inside the loop were 16-byte aligned! But that always refers to all streams, including reads, and if you somehow forget to pay proper attention, you’re immediately punished by an exception again. To make a long story short, you never really know what’s going on inside the compiler’s twisted little brain, and the diagnostics don’t help either (“LOOP WAS VECTORIZED”, yes, thank you very much).

In hope for some future improvement (unaligned NT store? Or *gasp* maybe even a scalar one?) I looked into Intel’s AVX extensions for SSE, to be implemented in the Sandy Bridge processor, which is to follow Westmere in 2010. That would be a “Streaming SIMD Extension Extension”, right? Whatever, there is neither a scalar nor an unaligned packed NT store in there. But I stumbled across something that had been there since SSE2: The beautiful maskmovdqu instruction. From the manual:


MASKMOVDQU xmm1,xmm2 – Store Selected Bytes of Double Quadword with NT Hint

Stores selected bytes from the source operand (first operand) into an 128-bit memory location. The mask operand (second operand) selects which bytes from the source operand are written to memory. The source and mask operands are XMM registers. The location of the first byte of the memory location is specified by DI/EDI/RDI and DS registers. The memory location does not need to be aligned on a natural boundary. […]


Way cool. Although the fixed address register is somewhat inflexible, the instruction is even better than pure scalar or pure unaligned NT store – it can store any part of an XMM register, down to the byte level, and with no alignment restrictions. And there’s a C/C++ compiler intrinsic, too. That has made it easy to implement our beloved vector triad in explicit SSE. These are two possible variants:

Scalar (unaligned) Packed unaligned
  __m128d xmm1,xmm2,xmm3,xmm4;
  __m128i mask_low=_mm_set_epi32(0,0,-1,-1);
#pragma omp parallel private(j)
{
if(size > 1000000) {
  for(j=0; j<niter; j++){
#pragma omp for private(xmm1,xmm2,xmm3,xmm4)
    for(i=0; i<size; i++) {
      xmm1 = _mm_load_sd(d+i);
      xmm2 = _mm_load_sd(c+i);
      xmm3 = _mm_load_sd(b+i);
      xmm4 = _mm_mul_sd(xmm1,xmm2);
      xmm4 = _mm_add_sd(xmm4,xmm3);
      _mm_maskmoveu_si128(reinterpret_cast<__m128i>(xmm4), mask_low, (char*)(a+i));
    }
  }
} else {
// same with standard store
}
  __m128d xmm1,xmm2,xmm3,xmm4;
  __m128i mask_all=_mm_set_epi32(-1,-1,-1,-1);
#pragma omp parallel private(j)
{
if(size > 1000000) {
  for(j=0; j<niter; j++){
#pragma omp for private(xmm1,xmm2,xmm3,xmm4,xmm5)
    for(i=0; i<size; i+=2) {
      xmm1 = _mm_load_pd(d+i);
      xmm2 = _mm_load_pd(c+i);
      xmm3 = _mm_load_pd(b+i);
      xmm4 = _mm_mul_pd(xmm1,xmm2);
      xmm4 = _mm_add_pd(xmm4,xmm3);
      _mm_maskmoveu_si128(reinterpret_cast<__m128i>(xmm4), mask_all, (char*)(a+i));
    }
  }
} else {
// same with standard store
}

The left one is an example for a purely scalar code which employs NT stores. The right one shows how a vectorized code can be endowed with NT stores if 16-byte alignment cannot be enforced. The compiler does neither by itself. Of course, NT stores only make sense if the arrays are large. I have set the lower limit to N=1000000, which is largish compared to the cache sizes but neatly shows the effect of NT vs. standard store:

The figure includes standard scalar versions without NT stores as well. Obviously, the maskmovdqu instruction does the job. There is hardly any performance loss in the serial case, and none at all with four threads. I have omitted any comment on in-cache behavior of the different variants; this is a separate story (and one which becomes more interesting once Nehalem is out!).

If you want to know more about using SSE intrinsics in C/C++ code, the Intel C++ Compiler for Linux Intrinsics Reference is for you.

The “roofline model” for kernel performance assessment

Sam Williams from UCB has come up with a very nice method to illustrate optimization potential for loop kernels on a known architecture. Everyone who knows about things like code and machine balance can estimate the expected fraction of “light speed” for some loop kernel. However, depending on your knowledge (or your assumptions) about the architecture under consideration, machine balance can be a moving target: Do you consider SIMD instructions to be applicable? Does the data set fit into some cache? Can the arithmetic pipelines be used to their full capacity? Are MULTs and ADDs balanced in the code? Is prefetching possible? Can non-temporal stores be used? Usually, we compute different machine balance numbers for all those cases to get our estimates.

Williams has found a very nice way to incorporate all this information into a graphical representation, the roofline diagram. With it, one can illustrate not only the architectural limits for kernel performance, but also the optimization potential of some (given) implementation. Read the full presentation: The Roofline Model: A pedagogical tool for program analysis and optimization. There is also a nice poster.

Floating point computations demystified

For those who constantly wonder why floating-point results for the same numerical problem tend to vary across architectures, compilers, compiler options and parallel program runs, this paper may be interesting:

David Monniaux, The pitfalls of verifying floating-point computations. arXiv:cs/0701192v5

Especially the first chapters form a nice intro on things you always wanted to know about floating-point but never dared to ask…

Allocatable array alignment with Intel Fortran compilers

Aligning heap data in C is simple: Just use the standard memalign() or better posix_memalign() functions instead of malloc() and you’re done. Intel compilers also feature special library calls to achieve the same thing, but you don’t really need them (you do need compiler support, though, for stack data, structures etc.). It should be clear for everyone familiar with current x86 architectures what properly aligned data can do for you: Packed aligned loads and non-temporal stores become possible. Even though the compiler can still employ aligned data movement by itself in some cases by loop peeling, one may want to align all references properly to enable the use of the #pragma vector aligned sledgehammer (why don’t they provide an argument list for this directive?).

In Fortran there is no standard way to make allocatable data automatically aligned on some address boundary (standard alignment is on 8 byte). The Intel compiler, however, provides a special directive to do just that. In order to enforce 16-byte alignment you can write:

 double precision, allocatable, dimension(:) :: array
!DEC$ ATTRIBUTES ALIGN: 16 :: array

! ...

 allocate(array(100000))

Although the compiler docs say at some point that this doesn’t work for allocatables, it does at least for versions 9.1 and 10.1 (I’ve checked by printing out the address explicitly).

This should enable the same vectorization stunts as in C without too much hassle.

Energy efficient HPC?

Energy efficient HPC?
Energy efficient HPC seems to be the trend of the hour – everyone talks about strategies to reduce power consumption in high performance systems. BlusGene/P gets a lot of credit for being so power efficient (at least if you consider the common but obviously insane LINPACK metric). Let us do some simple calculations in order to pinpoint the truth behind the hype.

An average federal HPC center in Germany takes about 1-1.5 MW to run today. That’s the equivalent of roughly ten sports cars running 24/7, or maybe 1000 (European) standard cars under average utilization.

What’s the number of standard cars in just a single big city like Munich?

Another comparison: Assuming there are 30 million households in Germany running their (quite power efficient) 100W TV set for two hours a day. If they all left the TV off for just a single day, that would save 6 million kilowatt hours of electrical energy. The HLRB2 federal supercomputer at LRZ Munich could run for almost a full year from this single day of abstinence.

What would you rather spend your high-quality electrical energy on: Supplying redundant non-information (a.k.a. standard TV entertainment) to millions of people who don’t need it anyway or giving a couple of hundred scientists a powerful research tool?

I’m not saying that power saving strategies are no use in HPC. The sheer problem of getting the power needed to a small spot like an HPC center is quite a challenge. Heat production is slowly getting out of hand and needs to be considered. However, arguing that “green HPC” could make a significant contribution to environmental protection is nothing but a convenient device to get funding for some obscure research projects.

So there.

ccNUMA and Linux buffer cache

It may seem surprising to some, but ccNUMA has hit the mass market and will forcefully continue to do so by the end of 2008 with Intel’s Nehalem processor. And although ccNUMA bears the potential of vastly improved bandwidth scalability, many users and sysadmins meet it with ignorance. Alas, their ignorance is all too often vindicated by the fact that they are right – sometimes.

This is because the vast majority of parallel application codes uses MPI. If you run one MPI process per core and the kernel does an average job of maintaining affinity, you will benefit from ccNUMA without the hassle of paying attention to correct parallel page placement. The latter is mandatory with memory-bound OpenMP code and sometimes not easy to implement, in particular when the problem is not as regular as, say, a simple dense matrix-vector multiply.

However, even with MPI you can run into the ccNUMA trap when the system’s memory is filled with something else before your code starts to run. This could be, e.g., file system buffer cache. In order to pinpoint the problem, Michael has done an interesting test on one of our dual-socket (dual-LD) Opteron nodes and, for comparison, on a dual-socket Woodcrest system. The former is ccNUMA while the latter is UMA. The test performs the following steps in a loop:

  1. Drop the FS caches by “echo 1 > /proc/sys/vm/drop_caches“. Btw, this facility exists since Linux kernel 2.6.16. Earlier kernels may have similar features, but those are not “official”. This operation is equivalent to using the bcfree command on an SGI Altix.
  2. Write a file of some size (increasing with iteration count) to the local disk. The maximum file size equals the amount of memory.
  3. sync the cache so that there are no flush operations in the background (but the cache is still filled).
  4. Run a vector triad benchmark with 4 MPI processes that fills exactly half of the node’s memory.

As the triad is purely bandwidth-bound, its performance depends crucially on the memory pages being allocated in the same locality domains as the processors that use them. However, the presence of a huge buffer cache prevents this:

The code can get an aggregate performance of roughly 500 MFlop/s on the Opteron node and close to 400 MFlop/s on the Woodcrest. We see that performance is unharmed by the buffer cache on the UMA system, but there are huge fluctuations on the ccNUMA node. Minimum performance is reached if cache size is about 2 GB which is half of the installed memory. In this case the kernel has filled one LD with buffer cache and all MPI processes map their pages in the second LD. We end up with the well-known congestion problem – a single LD must service the bandwidth demands of all cores. With the file size growing even further the effect vanishes because if all memory is filled by FS cache, any user-allocated page will drop a cache page and access becomes local again.

Linux, by default, keeps the cache even if an application is forced to map memory pages from a foreign LD if there is nothing left in the local LD. You can prevent this, of course, by using appropriate calls from the libnuma library, but you have to know about this. Most users don’t.

On the admin’s side we see that it’s a good idea to drop the cache whenever a cluster’s compute node becomes free. Users can’t do this themselves because you have to be root. As a makeshift, however, a user can flush and drop all cache pages by running a “sweeper” code that allocates and touches all the node’s memory before the real application starts.

Summing up, users and admins must be aware of such effects. Imagine what happens with a job that uses hundreds of ccNUMA nodes on one of which there’s a huge FS cache…

Sun UltraSPARC T2 under test

Thanks to Sun Microsystems and the kind people from RWTH Aachen we had access to a pre-production UltraSPARC T2 (a.k.a. “Niagara 2”) system for some tests. We didn’t manage to get the whole RRZE benchmark suite running but could produce some pretty interesting low-level stuff. The peculiar way the N2 addresses its four built-in memory controllers leads to severe congestion if the mutual alignment of memory streams is unfortunate. This can be seen, e.g. with Lattice-Boltzmann codes and, of course, also with STREAM.

That said, if you know what you’re doing and somehow manage to get what we have learned into applications, the Niagara2 is a pretty interesting processor. In a single socket it has a nominal memory bandwidth of >60 GB/s (40 read, 20 write) of which about a third can be actually measured. At a peak performance of 11 GFlop/s (8 cores at 1.4 GHz), this makes for a pretty impressive machine balance which is far beyond any other cache-based CPU. And finally, the multi-threaded architecture is much less strange and hard to grasp than, e.g., the Cell design.

If you want to know more, here’s a presentation we prepared for the recent “SunDay” at RRZE. Please bear in mind that all of this is still preliminary data and will have to be confirmed on production hardware.

rrze-n2-ea.pdf

All about the N2’s microarchitecture can be found in this neat document:
http://opensparc-t2.sunsource.net/specs/OpenSPARCT2_Core_Micro_Arch.pdf

IMB multi-mode Ping-Pong demystified?

IMB multi-mode Ping-Pong demystified?
Everybody knows the ubiquitous PingPong benchmark, first published in the Pallas MPI benchmarks and today contained in Intels IMB suite. Using one process on each node and plotting bandwidth versus message size, you see the well-known behaviour – saturating bandwidth for large messages and latency domination for small ones (blue curve – measurements were done on our Woody cluster between nodes that are connected to a common IB switch nodeboard, i.e. with DDR speed):

If you run this code in multi-mode, i.e. two or more processes on one node sending and two or more on the other node receiving, the bandwidth pattern changes considerably (black squares). There is a significant “overshoot” for medium-sized messages and a sudden drop at about 1 MByte. Eventually the saturation bandwidth is the same as in the simple case.

It turns out that one can explain this, at least for the 2-2 case, by assuming that MPI (or the IB driver or whoever) splits messages that are larger that a certain limit (e.g., 1 MByte) into chunks and transmits/receives those chunks alternating between the two connections. Using this model and fitting the parameters we can predict the multi-mode Ping-Pong bandwidth quite well (orange curve). In the 3-3 case, however, things get more complicated (green curve) and there is an additional plateau. I’m not sure whether one can really extend the model to encompass this effect as well.

If you are interested, the gory details have been written up: mm-pingpong.pdf