Switching from one compiler version to an other can have significant influence on performance, but even moving one patch level ahead may change your performance …
The Vector-TRIAD benchmark (a(:)=b(:)+c(:)*d(:)
according to Schoenauer) was run on the new Woodcrest cluster at RRZE which consists of HP DL140G3 boxes. The performance is given in MFlop/s for a loop length of 8388608. The value is the aggregated bandwidth of 4 MPI processes running on the node in saturation mode.
SNOOP filter of the 5000X chipset enabled
Performance in MFlop/s for loop length 8388608 on 2-socket Woodcrest node with 4 MPI processes compiler| default | USE_COMMON --------+---------+----------- 9.1-033 | 374.0 | 352.5 9.1-039 | 374.1 | 352.3 9.1-045 | 359.0 ! | 352.4 10.0-13 | 373.4 | 377.6 ! 10.0-17 | 373.7 | 352.0
SNOOP filter of the 5000X chipset disabled (switching the snoop filter off only works with the latest BIOS (v1.12) released on April 16!)
compiler| default | USE_COMMON --------+---------+----------- 9.1-033 | 376 | 332 9.1-039 | 376 | 331 9.1-045 | 341 !! | 331 10.0-13 | 376 | 380 !! 10.0-17 | 376 | 331
The “default” version always refers to arrays which were known at compile time; “USE_COMMON” meaans that the arrays have additionally been put into a common block.
And for reference the STREAM values in MB/s for 4 OpenMP threads (and added NONTEMPORAL directives and Array size = 20000000, Offset = 0) are also given:
Snoopfilter on | Snoopfilter off Function Rate (MB/s) | Rate (MB/s) Copy: 7492.0444 | 6178.7991 Scale: 7485.3591 | 6174.8763 Add: 6145.5004 | 6180.6296 Triad: 6152.6559 | 6189.2369
The results were more or less identical when using fce-9.1.039 and fce-9.1.45!
Reasons for performance differences: The main reason for the performance differences is unrolling (add -unroll0
to avoid it – thanks to Intel for pointing this out). The 10.0 compilers seem to be much more aggressive when doing optimizations and vectorization. By default, non-temporal stores may now be used in certain cases automatically by the 10.0 compiler. If you want to avoid that use -opt-streaming-stores never
. Even if non-temporal stores were disabled via command line, the compiler directive vector nontemporal
will still be respected. A directive to avoid non-temporal stores for a specific loop only is not (yet) available.