Benchmarking fun with calloc() and zero pages

Benchmarking fun with calloc() and zero pages
In the course of our yearly lecture on “Programming Techniques for Supercomputers (PTfS)” we hand out some homework assignments to the students. During the first weeks of the term, those are simple loop kernel benchmarks that are supposed to sharpen the students’ view with regard to basic performance bottlenecks like memory bandwidth, latency, pipelining and other issues. The first bug that 50% of all people doing such benchmarks encounter is that they forget to initialize their arrays. So, to prevent this, someone used the convenient calloc() function. Memory allocated in this way is zeroed out and will contain floating point zeroes (strictly speaking, a floating point zero does not have to be represented by a bit pattern with all zeroes, but in practice this will work). The code looked roughly like this (simplified):

  double *a = (double*)calloc(N, sizeof(double));
double *b = (double*)calloc(N, sizeof(double));
// ... same for c and d
double start_t = get_walltime();
for(i=0; i<N; ++i)
a[i] = b[i] + c[i] * d[i];
double wctime = get_walltime() - start_t;
// now report walltime and performance

The goal was to benchmark the well-known vector triad which is limited by memory bandwidth on all computer architectures for large N (beyond cache sizes). On the system under consideration, we would have expected just below 200 MFlop/s. To our great surprise, the benchmark yielded a blazing performance of 1.4 GFlop/s! Something was obviously wrong.

At first we suspected some ingenious compiler optimization that somehow dumped the whole loop and jumped to the end right away. This could be ruled out easily by making sure the results are actually used and checking that absolute runtime depends on N and other parameters in the expected way. Interestingly, if another (redundant) initialization loop is added after the calloc()s,

  for(i=0; i<N; ++i)
b[i] = c[i] = d[i] = 0.0;

performance goes down to the expected level, although the array contents are bitwise identical compared to the first version of the code. And what finally left us completely baffled was the assembly code generated for the kernel loop: no difference at all between the two versions.

In the end, Michael found the solution. I must frankly admit that I never would have figured that out by myself – here we go:

When allocating memory using calloc(), the amount of memory requested is not allocated right away. Instead, all pages that belong to the memory block are connected to a single page containing all zeroes by some MMU magic (links below). If such pages are only read (which was true for arrays b, c and d in the original version of the benchmark), the data is provided from the single zero page, which – of course – fits into cache. So much for memory-bound loop kernels. If a page gets written to (no matter how), a fault occurs, the “real” page is mapped and the zero page is copied to memory. This is called copy-on-write, a well-known optimization approach (that I even have taught multiple times in my C++ lectures). After that, the zero-read trick does not work any more for that page and this is why performance was so much lower after inserting the – supposedly redundant – init loop.

It’s fascinating how many hours you can spend in front of your monitor staring at 20 lines of code.

Cheers!

“Copy on write” page from the Kernel Analysis HOWTO: http://tldp.org/HOWTO/KernelAnalysis-HOWTO-10.html#ss10.4

Wikipedia entry (mentions the calloc() issue explicitly): http://en.wikipedia.org/wiki/Copy-on-write