Thomas Zeiser

Switching from one compiler version to an other can have significant influence on performance, but even moving one patch level ahead may change your performance …

The Vector-TRIAD benchmark (a(:)=b(:)+c(:)*d(:) according to Schoenauer) was run on the new Woodcrest cluster at RRZE which consists of HP DL140G3 boxes. The performance is given in MFlop/s for a loop length of 8388608. The value is the aggregated bandwidth of 4 MPI processes running on the node in saturation mode.

SNOOP filter of the 5000X chipset enabled

Performance in MFlop/s for loop length 8388608
on 2-socket Woodcrest node with 4 MPI processes
compiler| default | USE_COMMON
--------+---------+-----------
9.1-033 | 374.0   | 352.5
9.1-039 | 374.1   | 352.3
9.1-045 | 359.0 ! | 352.4
10.0-13 | 373.4   | 377.6 !
10.0-17 | 373.7   | 352.0

SNOOP filter of the 5000X chipset disabled (switching the snoop filter off only works with the latest BIOS (v1.12) released on April 16!)

compiler| default | USE_COMMON
--------+---------+-----------
9.1-033 | 376     | 332
9.1-039 | 376     | 331
9.1-045 | 341  !! | 331
10.0-13 | 376     | 380 !!
10.0-17 | 376     | 331

The “default” version always refers to arrays which were known at compile time; “USE_COMMON” meaans that the arrays have additionally been put into a common block.

And for reference the STREAM values in MB/s for 4 OpenMP threads (and added NONTEMPORAL directives and Array size = 20000000, Offset = 0) are also given:

      Snoopfilter on | Snoopfilter off
Function Rate (MB/s) | Rate (MB/s)
Copy:      7492.0444 | 6178.7991
Scale:     7485.3591 | 6174.8763
Add:       6145.5004 | 6180.6296
Triad:     6152.6559 | 6189.2369

The results were more or less identical when using fce-9.1.039 and fce-9.1.45!

Reasons for performance differences: The main reason for the performance differences is unrolling (add -unroll0 to avoid it – thanks to Intel for pointing this out). The 10.0 compilers seem to be much more aggressive when doing optimizations and vectorization. By default, non-temporal stores may now be used in certain cases automatically by the 10.0 compiler. If you want to avoid that use -opt-streaming-stores never. Even if non-temporal stores were disabled via command line, the compiler directive vector nontemporal will still be respected. A directive to avoid non-temporal stores for a specific loop only is not (yet) available.

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

Vector-TRIAD on woody using different versions of the Intel EM64T Fortran compiler