Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

Vector-TRIAD on woody using different versions of the Intel EM64T Fortran compiler

Switching from one compiler version to an other can have significant influence on performance, but even moving one patch level ahead may change your performance …

The Vector-TRIAD benchmark (a(:)=b(:)+c(:)*d(:) according to Schoenauer) was run on the new Woodcrest cluster at RRZE which consists of HP DL140G3 boxes. The performance is given in MFlop/s for a loop length of 8388608. The value is the aggregated bandwidth of 4 MPI processes running on the node in saturation mode.

SNOOP filter of the 5000X chipset enabled

Performance in MFlop/s for loop length 8388608
on 2-socket Woodcrest node with 4 MPI processes
compiler| default | USE_COMMON
--------+---------+-----------
9.1-033 | 374.0   | 352.5
9.1-039 | 374.1   | 352.3
9.1-045 | 359.0 ! | 352.4
10.0-13 | 373.4   | 377.6 !
10.0-17 | 373.7   | 352.0

SNOOP filter of the 5000X chipset disabled (switching the snoop filter off only works with the latest BIOS (v1.12) released on April 16!)

compiler| default | USE_COMMON
--------+---------+-----------
9.1-033 | 376     | 332
9.1-039 | 376     | 331
9.1-045 | 341  !! | 331
10.0-13 | 376     | 380 !!
10.0-17 | 376     | 331

The “default” version always refers to arrays which were known at compile time; “USE_COMMON” meaans that the arrays have additionally been put into a common block.

 

And for reference the STREAM values in MB/s for 4 OpenMP threads (and added NONTEMPORAL directives and Array size = 20000000, Offset = 0) are also given:

      Snoopfilter on | Snoopfilter off
Function Rate (MB/s) | Rate (MB/s)
Copy:      7492.0444 | 6178.7991
Scale:     7485.3591 | 6174.8763
Add:       6145.5004 | 6180.6296
Triad:     6152.6559 | 6189.2369

The results were more or less identical when using fce-9.1.039 and fce-9.1.45!

Reasons for performance differences: The main reason for the performance differences is unrolling (add -unroll0 to avoid it – thanks to Intel for pointing this out). The 10.0 compilers seem to be much more aggressive when doing optimizations and vectorization. By default, non-temporal stores may now be used in certain cases automatically by the 10.0 compiler. If you want to avoid that use -opt-streaming-stores never. Even if non-temporal stores were disabled via command line, the compiler directive vector nontemporal will still be respected. A directive to avoid non-temporal stores for a specific loop only is not (yet) available.