# ccNUMA and Linux buffer cache

It may seem surprising to some, but ccNUMA has hit the mass market and will forcefully continue to do so by the end of 2008 with Intel’s Nehalem processor. And although ccNUMA bears the potential of vastly improved bandwidth scalability, many users and sysadmins meet it with ignorance. Alas, their ignorance is all too often vindicated by the fact that they are right – sometimes.

This is because the vast majority of parallel application codes uses MPI. If you run one MPI process per core and the kernel does an average job of maintaining affinity, you will benefit from ccNUMA without the hassle of paying attention to correct parallel page placement. The latter is mandatory with memory-bound OpenMP code and sometimes not easy to implement, in particular when the problem is not as regular as, say, a simple dense matrix-vector multiply.

However, even with MPI you can run into the ccNUMA trap when the system’s memory is filled with something else before your code starts to run. This could be, e.g., file system buffer cache. In order to pinpoint the problem, Michael has done an interesting test on one of our dual-socket (dual-LD) Opteron nodes and, for comparison, on a dual-socket Woodcrest system. The former is ccNUMA while the latter is UMA. The test performs the following steps in a loop:

1. Drop the FS caches by “echo 1 > /proc/sys/vm/drop_caches“. Btw, this facility exists since Linux kernel 2.6.16. Earlier kernels may have similar features, but those are not “official”. This operation is equivalent to using the bcfree command on an SGI Altix.
2. Write a file of some size (increasing with iteration count) to the local disk. The maximum file size equals the amount of memory.
3. sync the cache so that there are no flush operations in the background (but the cache is still filled).
4. Run a vector triad benchmark with 4 MPI processes that fills exactly half of the node’s memory.

As the triad is purely bandwidth-bound, its performance depends crucially on the memory pages being allocated in the same locality domains as the processors that use them. However, the presence of a huge buffer cache prevents this:

The code can get an aggregate performance of roughly 500 MFlop/s on the Opteron node and close to 400 MFlop/s on the Woodcrest. We see that performance is unharmed by the buffer cache on the UMA system, but there are huge fluctuations on the ccNUMA node. Minimum performance is reached if cache size is about 2 GB which is half of the installed memory. In this case the kernel has filled one LD with buffer cache and all MPI processes map their pages in the second LD. We end up with the well-known congestion problem – a single LD must service the bandwidth demands of all cores. With the file size growing even further the effect vanishes because if all memory is filled by FS cache, any user-allocated page will drop a cache page and access becomes local again.

Linux, by default, keeps the cache even if an application is forced to map memory pages from a foreign LD if there is nothing left in the local LD. You can prevent this, of course, by using appropriate calls from the libnuma library, but you have to know about this. Most users don’t.

On the admin’s side we see that it’s a good idea to drop the cache whenever a cluster’s compute node becomes free. Users can’t do this themselves because you have to be root. As a makeshift, however, a user can flush and drop all cache pages by running a “sweeper” code that allocates and touches all the node’s memory before the real application starts.

Summing up, users and admins must be aware of such effects. Imagine what happens with a job that uses hundreds of ccNUMA nodes on one of which there’s a huge FS cache…

# OpenMP, ccNUMA and C++

OpenMP, ccNUMA and C++
If you are interested in programming with C++ and OpenMP, the just-finished diploma thesis of Holger Stengel might be interesting for you (in German – available on request). It studies ccNUMA effects in C++ and ways to circumvent them. To fuel your appetite, there is a nice English poster with most of the results: poster_cppnuma.pdf

This whole work was kicked off by some of the problems I had encountered during my PhD thesis where I had parallelized a C++ code from condensed matter physics. At that time, nobody had even thought about what would happen if standard C++ elements (arrays of objects, std::vector<> etc.) were used on a ccNUMA machine with OpenMP. Another inspiration came from Matt Austern‘s article about Segmented Iterators and Hierarchical Algorithms. The segmented iterator described in this paper could by useful for many purposes, of which NUMA placement is only one. In the thesis we implemented a version in which you could exactly control data placement by configurable padding.

I would be glad to continue on this topic with another diploma/bachelor/masters student. If you are hooked, feel free to contact me.

# Array summation benchmark

A question came up on the OpenMP mailing list today concerning scalability of simple array summation on an Opteron processor. I have done some tests with the following code, using the Intel C++ compiler version 9.1:

#pragma omp parallel for private(j) reduction(+: sum)
#pragma vector always
for (j = 0; j < N; j++){
sum += array2[j];
}

There is a loop around that to ensure that for small sizes we actually see the cache effect. Here is the result:

The number of threads (1T, 2T,…) is indicated. In case of the Opteron system, this was a 2-socket dual-core 2GHz box and the 2-thread data was correspondingly measured on one (1S) or two (2S) sockets, respectively. Proper NUMA placement was implemented. The “Conroe” system is my standard Core2 workstation.

Data on purely serial runs (no -openmp) is shown for reference. In contrast to low-level benchmarks like the stream or vector triads which have more read streams and at least one write stream, there seems to be a lot of “headroom” for the second thread even for large N on an Opteron socket.