Georg Hager's Blog

Random thoughts on High Performance Computing

Content

Allocatable array alignment with Intel Fortran compilers

Aligning heap data in C is simple: Just use the standard memalign() or better posix_memalign() functions instead of malloc() and you’re done. Intel compilers also feature special library calls to achieve the same thing, but you don’t really need them (you do need compiler support, though, for stack data, structures etc.). It should be clear for everyone familiar with current x86 architectures what properly aligned data can do for you: Packed aligned loads and non-temporal stores become possible. Even though the compiler can still employ aligned data movement by itself in some cases by loop peeling, one may want to align all references properly to enable the use of the #pragma vector aligned sledgehammer (why don’t they provide an argument list for this directive?).

In Fortran there is no standard way to make allocatable data automatically aligned on some address boundary (standard alignment is on 8 byte). The Intel compiler, however, provides a special directive to do just that. In order to enforce 16-byte alignment you can write:

 double precision, allocatable, dimension(:) :: array
!DEC$ ATTRIBUTES ALIGN: 16 :: array

! ...

 allocate(array(100000))

Although the compiler docs say at some point that this doesn’t work for allocatables, it does at least for versions 9.1 and 10.1 (I’ve checked by printing out the address explicitly).

This should enable the same vectorization stunts as in C without too much hassle.

Energy efficient HPC?

Energy efficient HPC?
Energy efficient HPC seems to be the trend of the hour – everyone talks about strategies to reduce power consumption in high performance systems. BlusGene/P gets a lot of credit for being so power efficient (at least if you consider the common but obviously insane LINPACK metric). Let us do some simple calculations in order to pinpoint the truth behind the hype.

An average federal HPC center in Germany takes about 1-1.5 MW to run today. That’s the equivalent of roughly ten sports cars running 24/7, or maybe 1000 (European) standard cars under average utilization.

What’s the number of standard cars in just a single big city like Munich?

Another comparison: Assuming there are 30 million households in Germany running their (quite power efficient) 100W TV set for two hours a day. If they all left the TV off for just a single day, that would save 6 million kilowatt hours of electrical energy. The HLRB2 federal supercomputer at LRZ Munich could run for almost a full year from this single day of abstinence.

What would you rather spend your high-quality electrical energy on: Supplying redundant non-information (a.k.a. standard TV entertainment) to millions of people who don’t need it anyway or giving a couple of hundred scientists a powerful research tool?

I’m not saying that power saving strategies are no use in HPC. The sheer problem of getting the power needed to a small spot like an HPC center is quite a challenge. Heat production is slowly getting out of hand and needs to be considered. However, arguing that “green HPC” could make a significant contribution to environmental protection is nothing but a convenient device to get funding for some obscure research projects.

So there.

Intel Threading Building Blocks (TBB 2.0) on RRZE systems

Intel Threading Building Blocks (TBB 2.0) on RRZE systems
On Woody and Altix the current version of Intel’s TBB library is now installed. TBB is a C++ framework for shared-memory parallel programming. If you are fed up with marrying C++ and OpenMP, TBB might be the right choice. Use it by loading the tbb/2.0 or tbb/2.0-debug modules, respectively. On Cluster32 you can access the modules from Woody by typing, e.g., module use /apps/modules/modulefiles/libraries, but remember that we have only provided a 64 bit version. The library was compiled with the Intel compiler and we don’t guarantee that it will work with anything else. Documentation can be found on the TBB website:

http://threadingbuildingblocks.org/documentation.php

We are very much interested in sharing experiences with all users of the library. So please if you have comments or questions, contact hpc@rrze.

ccNUMA and Linux buffer cache

It may seem surprising to some, but ccNUMA has hit the mass market and will forcefully continue to do so by the end of 2008 with Intel’s Nehalem processor. And although ccNUMA bears the potential of vastly improved bandwidth scalability, many users and sysadmins meet it with ignorance. Alas, their ignorance is all too often vindicated by the fact that they are right – sometimes.

This is because the vast majority of parallel application codes uses MPI. If you run one MPI process per core and the kernel does an average job of maintaining affinity, you will benefit from ccNUMA without the hassle of paying attention to correct parallel page placement. The latter is mandatory with memory-bound OpenMP code and sometimes not easy to implement, in particular when the problem is not as regular as, say, a simple dense matrix-vector multiply.

However, even with MPI you can run into the ccNUMA trap when the system’s memory is filled with something else before your code starts to run. This could be, e.g., file system buffer cache. In order to pinpoint the problem, Michael has done an interesting test on one of our dual-socket (dual-LD) Opteron nodes and, for comparison, on a dual-socket Woodcrest system. The former is ccNUMA while the latter is UMA. The test performs the following steps in a loop:

  1. Drop the FS caches by “echo 1 > /proc/sys/vm/drop_caches“. Btw, this facility exists since Linux kernel 2.6.16. Earlier kernels may have similar features, but those are not “official”. This operation is equivalent to using the bcfree command on an SGI Altix.
  2. Write a file of some size (increasing with iteration count) to the local disk. The maximum file size equals the amount of memory.
  3. sync the cache so that there are no flush operations in the background (but the cache is still filled).
  4. Run a vector triad benchmark with 4 MPI processes that fills exactly half of the node’s memory.

As the triad is purely bandwidth-bound, its performance depends crucially on the memory pages being allocated in the same locality domains as the processors that use them. However, the presence of a huge buffer cache prevents this:

The code can get an aggregate performance of roughly 500 MFlop/s on the Opteron node and close to 400 MFlop/s on the Woodcrest. We see that performance is unharmed by the buffer cache on the UMA system, but there are huge fluctuations on the ccNUMA node. Minimum performance is reached if cache size is about 2 GB which is half of the installed memory. In this case the kernel has filled one LD with buffer cache and all MPI processes map their pages in the second LD. We end up with the well-known congestion problem – a single LD must service the bandwidth demands of all cores. With the file size growing even further the effect vanishes because if all memory is filled by FS cache, any user-allocated page will drop a cache page and access becomes local again.

Linux, by default, keeps the cache even if an application is forced to map memory pages from a foreign LD if there is nothing left in the local LD. You can prevent this, of course, by using appropriate calls from the libnuma library, but you have to know about this. Most users don’t.

On the admin’s side we see that it’s a good idea to drop the cache whenever a cluster’s compute node becomes free. Users can’t do this themselves because you have to be root. As a makeshift, however, a user can flush and drop all cache pages by running a “sweeper” code that allocates and touches all the node’s memory before the real application starts.

Summing up, users and admins must be aware of such effects. Imagine what happens with a job that uses hundreds of ccNUMA nodes on one of which there’s a huge FS cache…

Restricting member function calls by numeric template parameters

Thanks to Johannes for this interesting problem.

Let’s say you have a class template B where T is an arbitrary type and C is an integer template argument:

template  <class T, int C> B;

B has two member functions with the same name:

template  <class T, int C> void B::member(T _t);
template  <class T, int C> void B::member(T _t, int _i);

How do you make sure that the first member (the one with just a single argument) can only be called for instances of the class template with C==1? This is supposed to happen at compile time (runtime would be easy, of course).

One could (partially) specialize the whole class for C=1, which generates a whole lot of code bloat. Another solution would be to have a base class with only the two-argument member and a derived class (inheriting from B) implementing the single-argument member. This is also unsatisfactory because the derived class must have a name different from the first.

A more elegant solution is to have a private template class declaration encapsulated into the base class which gets instantiated when calling the single-argument member:

template  class B {
  template  <int U> class BX {
  public:
    BX(B<T,U>* p) {}
  };
public:
  B();
  // no problem here
  void member(T _x, int _i);
  // only valid if C==1
  void member(T _x) {
    BX<2-C> bx(this);
    ...
  }
  ...
};

Only if C==1 will calling the single-argument member not fail, because BX<2-C> gets instantiated using a pointer to class B. If C!=1, the compiler spits out an error message saying that it can’t find a constructor with the appropriate argument. In this special case we could just have used BX<1>, but I wanted to show that any simple integer expression will do (see below).

Admittedly, the error message is a little clumsy, but it does the job. This example can easily be generalized – remember that the ternary operator will also be evaluated at compile time. All you have to do is provide a function of C that is equal to C for all permitted values of C, and different from C otherwise.

I’ve checked that this trick works using the Intel 10.1 and GNU 4.1.3 compilers.

Sun SPARC Enterprise T5120 at RRZE

As of this week we are proud owners of an UltraSPARC T2 system (1 socket @ 8 cores, 32 GB, 1.2 GHz). This one is a little slower than the 1.4 GHz machine we had access to at RWTH Aachen, but the characteristics are quite the same. After some intensive benchmarking, this machine will be made available in our test cluster for all HPC users to evaluate.

Sun UltraSPARC T2 under test

Thanks to Sun Microsystems and the kind people from RWTH Aachen we had access to a pre-production UltraSPARC T2 (a.k.a. “Niagara 2”) system for some tests. We didn’t manage to get the whole RRZE benchmark suite running but could produce some pretty interesting low-level stuff. The peculiar way the N2 addresses its four built-in memory controllers leads to severe congestion if the mutual alignment of memory streams is unfortunate. This can be seen, e.g. with Lattice-Boltzmann codes and, of course, also with STREAM.

That said, if you know what you’re doing and somehow manage to get what we have learned into applications, the Niagara2 is a pretty interesting processor. In a single socket it has a nominal memory bandwidth of >60 GB/s (40 read, 20 write) of which about a third can be actually measured. At a peak performance of 11 GFlop/s (8 cores at 1.4 GHz), this makes for a pretty impressive machine balance which is far beyond any other cache-based CPU. And finally, the multi-threaded architecture is much less strange and hard to grasp than, e.g., the Cell design.

If you want to know more, here’s a presentation we prepared for the recent “SunDay” at RRZE. Please bear in mind that all of this is still preliminary data and will have to be confirmed on production hardware.

rrze-n2-ea.pdf

All about the N2’s microarchitecture can be found in this neat document:
http://opensparc-t2.sunsource.net/specs/OpenSPARCT2_Core_Micro_Arch.pdf

IMB multi-mode Ping-Pong demystified?

IMB multi-mode Ping-Pong demystified?
Everybody knows the ubiquitous PingPong benchmark, first published in the Pallas MPI benchmarks and today contained in Intels IMB suite. Using one process on each node and plotting bandwidth versus message size, you see the well-known behaviour – saturating bandwidth for large messages and latency domination for small ones (blue curve – measurements were done on our Woody cluster between nodes that are connected to a common IB switch nodeboard, i.e. with DDR speed):

If you run this code in multi-mode, i.e. two or more processes on one node sending and two or more on the other node receiving, the bandwidth pattern changes considerably (black squares). There is a significant “overshoot” for medium-sized messages and a sudden drop at about 1 MByte. Eventually the saturation bandwidth is the same as in the simple case.

It turns out that one can explain this, at least for the 2-2 case, by assuming that MPI (or the IB driver or whoever) splits messages that are larger that a certain limit (e.g., 1 MByte) into chunks and transmits/receives those chunks alternating between the two connections. Using this model and fitting the parameters we can predict the multi-mode Ping-Pong bandwidth quite well (orange curve). In the 3-3 case, however, things get more complicated (green curve) and there is an additional plateau. I’m not sure whether one can really extend the model to encompass this effect as well.

If you are interested, the gory details have been written up: mm-pingpong.pdf