Content

IMB multi-mode Ping-Pong demystified?

October 18, 2007

IMB multi-mode Ping-Pong demystified?
Everybody knows the ubiquitous PingPong benchmark, first published in the Pallas MPI benchmarks and today contained in Intels IMB suite. Using one process on each node and plotting bandwidth versus message size, you see the well-known behaviour – saturating bandwidth for large messages and latency domination for small ones (blue curve – measurements were done on our Woody cluster between nodes that are connected to a common IB switch nodeboard, i.e. with DDR speed):

If you run this code in multi-mode, i.e. two or more processes on one node sending and two or more on the other node receiving, the bandwidth pattern changes considerably (black squares). There is a significant “overshoot” for medium-sized messages and a sudden drop at about 1 MByte. Eventually the saturation bandwidth is the same as in the simple case.

It turns out that one can explain this, at least for the 2-2 case, by assuming that MPI (or the IB driver or whoever) splits messages that are larger that a certain limit (e.g., 1 MByte) into chunks and transmits/receives those chunks alternating between the two connections. Using this model and fitting the parameters we can predict the multi-mode Ping-Pong bandwidth quite well (orange curve). In the 3-3 case, however, things get more complicated (green curve) and there is an additional plateau. I’m not sure whether one can really extend the model to encompass this effect as well.

If you are interested, the gory details have been written up: mm-pingpong.pdf

OpenMP, ccNUMA and C++

September 30, 2007

OpenMP, ccNUMA and C++
If you are interested in programming with C++ and OpenMP, the just-finished diploma thesis of Holger Stengel might be interesting for you (in German – available on request). It studies ccNUMA effects in C++ and ways to circumvent them. To fuel your appetite, there is a nice English poster with most of the results: poster_cppnuma.pdf

This whole work was kicked off by some of the problems I had encountered during my PhD thesis where I had parallelized a C++ code from condensed matter physics. At that time, nobody had even thought about what would happen if standard C++ elements (arrays of objects, std::vector<> etc.) were used on a ccNUMA machine with OpenMP. Another inspiration came from Matt Austern‘s article about Segmented Iterators and Hierarchical Algorithms. The segmented iterator described in this paper could by useful for many purposes, of which NUMA placement is only one. In the thesis we implemented a version in which you could exactly control data placement by configurable padding.

I would be glad to continue on this topic with another diploma/bachelor/masters student. If you are hooked, feel free to contact me.

Array summation benchmark

July 13, 2007

A question came up on the OpenMP mailing list today concerning scalability of simple array summation on an Opteron processor. I have done some tests with the following code, using the Intel C++ compiler version 9.1:

#pragma omp parallel for private(j) reduction(+: sum)
#pragma vector always
  for (j = 0; j < N; j++){
    sum += array2[j];
  }

There is a loop around that to ensure that for small sizes we actually see the cache effect. Here is the result:

The number of threads (1T, 2T,…) is indicated. In case of the Opteron system, this was a 2-socket dual-core 2GHz box and the 2-thread data was correspondingly measured on one (1S) or two (2S) sockets, respectively. Proper NUMA placement was implemented. The “Conroe” system is my standard Core2 workstation.

Data on purely serial runs (no -openmp) is shown for reference. In contrast to low-level benchmarks like the stream or vector triads which have more read streams and at least one write stream, there seems to be a lot of “headroom” for the second thread even for large N on an Opteron socket.

Benchmarking fun with `calloc()` and zero pages

May 8, 2007

Benchmarking fun with calloc() and zero pages
In the course of our yearly lecture on “Programming Techniques for Supercomputers (PTfS)” we hand out some homework assignments to the students. During the first weeks of the term, those are simple loop kernel benchmarks that are supposed to sharpen the students’ view with regard to basic performance bottlenecks like memory bandwidth, latency, pipelining and other issues. The first bug that 50% of all people doing such benchmarks encounter is that they forget to initialize their arrays. So, to prevent this, someone used the convenient calloc() function. Memory allocated in this way is zeroed out and will contain floating point zeroes (strictly speaking, a floating point zero does not have to be represented by a bit pattern with all zeroes, but in practice this will work). The code looked roughly like this (simplified):

  double *a = (double*)calloc(N, sizeof(double));
  double *b = (double*)calloc(N, sizeof(double));
  // ... same for c and d  
  double start_t = get_walltime();
  for(i=0; i<N; ++i)
    a[i] = b[i] + c[i] * d[i];
  double wctime = get_walltime() - start_t;
  // now report walltime and performance

The goal was to benchmark the well-known vector triad which is limited by memory bandwidth on all computer architectures for large N (beyond cache sizes). On the system under consideration, we would have expected just below 200 MFlop/s. To our great surprise, the benchmark yielded a blazing performance of 1.4 GFlop/s! Something was obviously wrong.

At first we suspected some ingenious compiler optimization that somehow dumped the whole loop and jumped to the end right away. This could be ruled out easily by making sure the results are actually used and checking that absolute runtime depends on N and other parameters in the expected way. Interestingly, if another (redundant) initialization loop is added after the calloc()s,

  for(i=0; i<N; ++i)
    b[i] = c[i] = d[i] = 0.0;

performance goes down to the expected level, although the array contents are bitwise identical compared to the first version of the code. And what finally left us completely baffled was the assembly code generated for the kernel loop: no difference at all between the two versions.

In the end, Michael found the solution. I must frankly admit that I never would have figured that out by myself – here we go:

When allocating memory using calloc(), the amount of memory requested is not allocated right away. Instead, all pages that belong to the memory block are connected to a single page containing all zeroes by some MMU magic (links below). If such pages are only read (which was true for arrays b, c and d in the original version of the benchmark), the data is provided from the single zero page, which – of course – fits into cache. So much for memory-bound loop kernels. If a page gets written to (no matter how), a fault occurs, the “real” page is mapped and the zero page is copied to memory. This is called copy-on-write, a well-known optimization approach (that I even have taught multiple times in my C++ lectures). After that, the zero-read trick does not work any more for that page and this is why performance was so much lower after inserting the – supposedly redundant – init loop.

It’s fascinating how many hours you can spend in front of your monitor staring at 20 lines of code.

Cheers!

Links:

“Copy on write” page from the Kernel Analysis HOWTO: http://tldp.org/HOWTO/KernelAnalysis-HOWTO-10.html#ss10.4

Wikipedia entry (mentions the calloc() issue explicitly): http://en.wikipedia.org/wiki/Copy-on-write

Gauss Centre for Supercomputing founded

November 12, 2006

Recently, the three national supercomputer centers in Germany (HLRS, LRZ, NIC) have joined forces and founded GCS, the “Gauss Centre for Supercomputing” (see autumn 2006 edition of inSiDE). Apart from giving the German supercomputing community a voice in the upcoming process of building a European HPC infrastructure, it is the declared goal of this alliance to “synchronize and optimize [the centers’] existing successful support structures within the GCS”, something that raises great expectations from the side of the scientific users.

It will be the latter point on which success or failure of this new construct will be judged.

Georg Hager's Blog

Random thoughts on High Performance Computing