The McCalpin STREAM benchmark: How to do it right and interpret the results

The STREAM benchmark [1] is more than 20 years old, and it has become the de facto standard for measuring the maximum achievable memory bandwidth of processors. However, some of the published numbers are misinterpreted. This is partly because STREAM is, contrary to expectations, not a “black box” benchmark that you can trust to yield the right answer. In order to understand the STREAM results, it is necessary to grasp the following concepts:

1. Machine topology and thread affinity
2. Write-allocate transfers

Most of the mistakes people make when taking STREAM measurements are based on a mis- or non-understanding of these two concepts.

The STREAM loops

Leaving out the gory details, the STREAM benchmark works as follows [correction 2019-03-18: spurious factor 8 removed]:

double precision A(ndim),B(ndim),C(ndim),D(ndim),t
! initialize here
t = timestamp()
C(:) = A(:)
print *,"COPY:  ",ndim*16.d0/(timestamp()-t)/1.d6," Mbyte/s"
t = timestamp()
B(:) = s * C(:)
print *,"SCALE: ",ndim*16.d0/(timestamp()-t)/1.d6," Mbyte/s"
t = timestamp()
C(:) = A(:) + B(:)
t = timestamp()
A(:) = B(:) + s * C(:)


The real code runs each loop a number of times, reports min/max/average times, and computes the bandwidth based on the fastest run per loop. Also it tries to figure out the clock granularity and other low-level stuff. Of course, all loops are parallelized with OpenMP using standard “parallel do” without any specified OpenMP schedule.1 And finally, proper parallel first touch initialization is employed for best performance on ccNUMA systems. Let’s not make a fuss about why exactly these four loops were chosen when STREAM was developed. The rules laid down by Dr. Bandwidth say that each array must have a size of at least four times the outer-level cache of the machine.

Runtime variations and affinity

All of this is usually not your concern. What you get is four numbers: one bandwidth measurement for each of the four loops. For the first test I compiled the C version with GCC 7.3.0 with options -Ofast -fargument-noalias -mavx2 -fopenmp on a dual-socket AMD Epyc 7451 node (48 physical cores, two SMT threads per core). The ndim parameter was set to 88000000, i.e., the complete working set was about 2 GiB. I did not set the OMP_NUM_THREADS variable, nor did I enforce any thread affinity. The compiler or the runtime should do this, right? This is the result of running the benchmark 50 times (statistics by me):

Copy  Mean: 79386.4 +- 11999.8 | Median: 76768.1 Min: 67028.9 Max: 140198.5
Scale Mean: 84607.9 +- 31959.9 | Median: 82606.8 Min: 46424.2 Max: 142902.3
Add   Mean: 59882.3 +- 21599.1 | Median: 51740.6 Min: 46641.9 Max: 151342.3
Triad Mean: 64959.2 +- 29583.2 | Median: 52831.1 Min: 47525.7 Max: 155740.6

This machine has a peak (theoretical) memory bandwidth of 341 GByte/s. The first thing that leaps to the eye is that even the maximum readings (last column) fall short of the theoretical limit by more than 2x. Also there seems to be a large variation across measurements. Knowing that few performance measurements in high performance computing are actually normally distributed [2] we can safely ignore the mean and standard deviation; the “poor man’s” solution here is to report at least minimum, maximum, and median, as I have done.

First, let’s take control of thread affinity so that the results won’t fluctuate that much. Our hope that the runtime would do the right thing here was deceived, so we use likwid-pin to run 48 threads on the physical cores only. The first run above was with 96 threads [the default set by the GNU OpenMP runtime], and we know that SMT will usually do no good to a memory-bound code):

\$ likwid-pin -c N:0-47 ./stream

These are the results:

Copy  Mean: 240636.6 +- 672.7 | Median: 240602.2 Min: 238609.3 Max: 242429.4
Scale Mean: 175613.1 +- 127.2 | Median: 175615.0 Min: 175359.4 Max: 175892.2
Add   Mean: 186221.1 +- 197.9 | Median: 186209.7 Min: 185819.1 Max: 186653.1
Triad Mean: 186519.8 +- 173.3 | Median: 186523.4 Min: 186111.9 Max: 186885.4

Write-allocate transfers

Well, at least the variation across runs is gone, but we now see two (maybe three) different performance levels: about 240 GByte/s for Copy, about 186 GByte/s for Add and Triad, and about 176 GByte/s for Scale. Where do these come from?

We could take the easy way out and say that this machine was obviously built for Copy-like streaming patterns. Or we could man up and take a look at the assembly code, which reveals that the Scale, Add, and Triad loops are actually compiled code but Copy isn’t: It is substituted by a call to the memcpy() function, obviously in hopes that whoever implemented memcpy() has done a good job. As a side note, to avoid this kind of interference we can compile with the option -ffreestanding (-fno-builtin will do the same), which assumes that there is no libc and thus must produce actual code for the copy loop. With this option, the Copy performance indeed goes down to the level of Scale.

But why? What does memcpy() do to speed up the loop so much? This brings us to the second point above, the write-allocate. Since the registers of a core can usually only talk to the L1 cache directly, machines with write-back caches have to bring a cache line into the cache whenever a write miss to that cache line occurs. Hence, for every store without a previous read to the same cache line we have to count an additional read. The bandwidth numbers  that the benchmark code above assumes are based on the minimum data traffic of 16 bytes per iteration for Copy and Scale, and 24 bytes per iteration for Add and Triad. If the write-allocate applies, the correct numbers are 24 and 32 bytes, respectively, and the output of the benchmark must be corrected by a factor of 1.5 (Copy/Scale) or 1.33 (Add/Triad). This line of thinking is what John McCalpin calls “hardware byte counting” [3], and for me it is the only valid option. I want to know how many bytes the memory interface can transfer per second, and if that means I have to account for write-allocates, so be it.

Since $176\times 1.5=264$ and $186\times 1.33=248$, the true memory bandwidth for Copy/Scale and Add/Triad was 264 GByte/s and 248 GByte/s, respectively. Most modern CPUs have a rather “symmetric” memory interface, meaning that it doesn’t matter much for the achievable bandwidth if you write only, read only, or do both (IBM Power and Intel Xeon Knights Landing are exceptions). A small deviation (6% in our case) is normal.

Still we don’t know how memcpy() somehow managed to be better than the compiler. Wild speculation is not unheard of in such cases; simple math shows that it cannot just be “better loop code,” or “less threading overhead,” or “better prefetching”; the data transfer volume must have been reduced somehow.

Nontemporal stores

X86 CPUs have learned to circumvent the write-allocate quite early by supporting nontemporal (NT), or streaming store instructions. A streaming store can work around the restriction that registers can only talk to the inner cache: It stores directly from the register to memory (in fact it’s a little more complicated than that, but the details do not matter). If this is possible, the data traffic volumes hard-coded into the STREAM source are correct. Streaming stores are what memcpy() uses under the hood in our case. This does not seem to work on all x86 machines; if you know why sometimes memcpy() fails to use NT stores, please educate me.

There are some restrictions: Intel’s NT store instructions come only in SIMD-vectorized variants, so if a loop cannot be vectorized it may be very hard to use them. In simple cases the compiler can employ them automatically. Gcc doesn’t, but icc (with options -Ofast -qopenmp -ffreestanding) does and boosts the STREAM bandwidth quite a bit:

Copy  Mean: 280000.5 +- 2443.9 | Median: 280483.5 Min: 268142.9 Max: 280816.9
Scale Mean: 279588.6 +- 624.4  | Median: 279699.7 Min: 276452.6 Max: 280031.3
Add   Mean: 277762.8 +- 424.3  | Median: 277530.9 Min: 277239.9 Max: 278635.2
Triad Mean: 277654.9 +- 374.4  | Median: 277474.4 Min: 277205.2 Max: 278372.5

The 280 GByte/s are 82% of the theoretical peak bandwidth, which is OK for a machine of this type. The option -qopt-streaming-stores never|always can influence the compiler’s behavior, but it will never use streaming stores if it cannot prove that they are safe. With -qopt-streaming-stores never, the icc-compiled STREAM performance is on par with gcc.

Cache line claim

Streaming stores are not the only way to avoid write-allocates. Some processors (not x86) have instructions to claim the ownership of a cache line in the cache without actually reading it (in the PowerPC ISA, the “data cache block touch (dcbz)” instruction exists for that). I have never seen a compiler use these, but the principle can also be cast into hardware: Once the core detects that a full cache line will be overwritten, it can claim the line in the cache without reading it, all automatically. ARM processors are capable of this, which means that even if the compiler doesn’t use streaming stores, write-allocates do not occur with STREAM (if the feature is activated). There is no guarantee, though, that the hardware does the right thing in more complicated cases than STREAM, which is giving me headaches right now when benchmarking stencil codes on a Cavium ThunderX2 processor.

Summary

One might think that proper default affinity should be a no-brainer for current compilers, especially with OpenMP code. This is not true: You have to take control of thread-core affinity. In case of STREAM, running one thread per physical core is usually best. The benchmark itself has no means of knowing whether or not write-allocate transfers happen, so it assumes the best case of 16 bytes per iteration for Copy and Scale, and 24 bytes per iteration for Add and Triad, respectively. It is the benchmarker’s task to figure out if a correction factor must be applied to get the real bandwidth. If you take the benchmark’s output for the truth, chances are that the capabilities of your machine are underestimated. I am deliberately not giving a table of “optimal” compiler switches and affinity settings for different compiler and hardware combinations here. You figure that out for yourself, you learn something.

Of course, all of this has been known for a long time, and there are several articles on the web (some by Dr. Bandwidth himself) that give all the necessary details and advice. It is therefore all the more surprising that blatant mistakes are still being made when presenting STREAM numbers. STREAM is not an automatic benchmark framework that you can run and expect to give the right answer. Some brain is required.

1 Intentionally or not, this choice and the popularity of the STREAM benchmark make sure that, although the default OpenMP schedule is implementation dependent, all compilers will actually choose “static” in practice. Anything else would have the potential to mess up STREAM results in very weird ways.

[1] McCalpin, John D., 1995: “Memory Bandwidth and Machine Balance in Current High Performance Computers”, IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.

[2] T. Hoefler and R. Belli: Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15). ACM, New York, NY, USA, Article 73, 12 pages. DOI: 10.1145/2807591.2807644