

# Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering

Jan Treibig, Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg Erlangen, Germany

PROPER Workshop at Euro-Par 2012 August 28, 2012

Rhodes Island, Greece



### **Hardware performance metrics**



- are ubiquitous as a starting point for performance analysis (including automatic analysis)
- ... are supported by many tools
- ... are often reduced to cache misses (what could be worse than cache misses?)

### Reality:

- Modern parallel computing is plagued by bottlenecks
- There are typical performance patterns that cover a large part of possible performance behaviors
  - HPM signatures
  - Scaling behavior
  - Other sources of information

"Performance pattern"



- LIKWID: Lightweight command line tools for Linux
- Help to face the challenges without getting in the way
- Focus on x86 architecture
- Philosophy:
  - Simple
  - Efficient
  - Portable
  - Extensible





Open source project (GPL v2):

http://code.google.com/p/likwid/

### **Overview of LIKWID tools**



### Topology and Affinity:

- likwid-topology
- likwid-pin
- likwid-mpirun

### Performance Profiling/Benchmarking:

- likwid-perfctr
- likwid-bench
- likwid-powermeter

### Probing performance behavior with likwid-perfctr



- How do we find out about the performance properties and requirements of a parallel code?
  - Profiling via advanced tools is often overkill
- A coarse overview is often sufficient
  - likwid-perfctr (similar to "perfex" on IRIX, "hpmcount" on AIX, "lipfpm" on Linux/Altix)
  - Simple end-to-end measurement of hardware performance metrics
  - Operating modes:
    - Wrapper
    - Stethoscope
    - Timeline
    - Marker API
  - Preconfigured and extensible metric groups, list with

```
likwid-perfctr -a
```

```
BRANCH: Branch prediction miss rate/ratio
```

CACHE: Data cache miss rate/ratio

CLOCK: Clock of cores

DATA: Load to store ratio

FLOPS\_DP: Double Precision MFlops/s FLOPS\_SP: Single Precision MFlops/s

FLOPS X87: X87 MFlops/s

L2: L2 cache bandwidth in MBytes/s
L2CACHE: L2 cache miss rate/ratio
L3: L3 cache bandwidth in MBytes/s
L3CACHE: L3 cache miss rate/ratio

MEM: Main memory bandwidth in MBytes/s

TLB: TLB miss rate/ratio

### likwid-perfctr

### Example usage with preconfigured metric group



```
$ env OMP NUM THREADS=4 likwid-perfctr -C N:0-3 -t intel -g FLOPS DP ./stream.exe
CPU type:
                Intel Core Lynnfield processor
CPU clock:
                2.93 GHz
Measuring group FLOPS DP
                                                Always
                                                                         Configured
                                                                         metrics (this
                                              measured
YOUR PROGRAM OUTPUT
                                                                        core 2
                                                          core 1
                                                                                      core 3
                 Event
                                            core 0
                                          1.97463e+08
           INSTR RETIRED ANY
                                                        2.31001e+08 | 2.30963e+08
                                                                                    2.31885e+08
          PU CLK UNHALTED CORE
                                           56999e+08
                                                        9.58401e+08
                                                                      9.58637e+08
         OMP OPS FYF SSE FF PACKED
                                          4.00294e+07
                                                        3.08927e+07
                                                                      3.08866e+07
     FP COMP OPS EXE SSE FP SCALAR
                                             882
  FP COMP OPS EXE SSE SINGLE PRECISION
  PR COMP OPS EXE SSE DOUBLE PRECISION
                                          4.00303e+07
                                                      1 3.08927e+07
                                                                      3.08866e+07
           Metric
                               core 0
                                                      core 2
                                                                core 3
                                          core 1
        Runtime [s]
                              0.326242
                                                                0.326358
            CPI
                              4.84647
                                                     4.15061
                                                                4.12849
                                                                                      Derived
  DP MFlops/s (DP assumed) |
                              245.399
                                                     189.024
                                                                189.304
                                                                                      metrics
       Packed MUOPS/s
                              122.698
                                                                94.6519
       Scalar MUOPS/s
                             0.00270351
         SP MUOPS/s
         DP MUOPS/s
```



- To measure only parts of an application a marker API is available.
- The API only turns counters on/off. The configuration of the counters is still done by likwid-perfctr application.
- Multiple named regions can be measured
- Results on multiple calls are accumulated
- Inclusive and overlapping Regions are allowed

```
likwid_markerInit(); // must be called from serial region
likwid_markerStartRegion("Compute");
. . .
likwid_markerStopRegion("Compute");
likwid_markerStartRegion("postprocess");
. . .
likwid_markerStopRegion("postprocess");
likwid_markerStopRegion("postprocess");
```

### likwid-perfctr

### Group files



#### SHORT PSTI

#### **EVENTSET**

```
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU CLK UNHALTED CORE
```

FIXC2 CPU CLK UNHALTED REF

PMC0 FP\_COMP\_OPS\_EXE\_SSE\_FP\_PACKED

PMC1 FP\_COMP\_OPS\_EXE\_SSE\_FP\_SCALAR

PMC2 FP\_COMP\_OPS\_EXE\_SSE\_SINGLE\_PRECISION

PMC3 FP\_COMP\_OPS\_EXE\_SSE\_DOUBLE\_PRECISION

UPMC0 UNC\_QMC\_NORMAL\_READS\_ANY

UPMC1 UNC QMC WRITES\_FULL\_ANY

UPMC2 UNC QHL REQUESTS REMOTE READS

UPMC3 UNC QHL REQUESTS LOCAL READS

#### **METRICS**

Runtime [s] FIXC1\*inverseClock

CPI FIXC1/FIXC0

Clock [MHz] 1.E-06\*(FIXC1/FIXC2)/inverseClock

DP MFlops/s (DP assumed) 1.0E-06\*(PMC0\*2.0+PMC1)/time

Packed MUOPS/s 1.0E-06\*PMC0/time

Scalar MUOPS/s 1.0E-06\*PMC1/time

SP MUOPS/s 1.0E-06\*PMC2/time

DP MUOPS/s 1.0E-06\*PMC3/time

Memory bandwidth [MBytes/s] 1.0E-06\*(UPMC0+UPMC1)\*64/time;

Remote Read BW [MBytes/s] 1.0E-06\*(UPMC2)\*64/time;

#### LONG

#### Formula:

DP MFlops/s = (FP\_COMP\_OPS\_EXE\_SSE\_FP\_PACKED\*2 + FP\_COMP\_OPS\_EXE\_SSE\_FP\_SCALAR) / runtime.

- Groups are architecture-specific
- They are defined in simple text files
- Code is generated on recompile

### **Performance patterns (1)**



| Pattern                            | Peformance behavior                                   | Metric signature                                                                                                      |  |  |  |  |  |  |
|------------------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
| Load imbalance                     | Saturating/sub-linear speedup                         | Different amount of "work" on the cores (FLOPS_DP, FLOPS_SP, FLOPS_AVX); note that instruction count is not reliable! |  |  |  |  |  |  |
| BW saturation in outer-level cache | Saturating speedup across cores of OL cache group     | OLC bandwidth meets BW of suitable streaming benchmark (L3)                                                           |  |  |  |  |  |  |
| Memory BW saturation               | Saturating speedup across cores on a memory interface | Memory BW meets BW of suitable streaming benchmark (MEM)                                                              |  |  |  |  |  |  |
| Strided or erratic data access     | Simple BW performance model much too optimistic       | Low BW utilization / Low cache hit ratio, frequent CL evicts or replacements (CACHE, DATA, MEM)                       |  |  |  |  |  |  |

### **Performance patterns (2)**



| Pattern                              | Peformance behavior                                                                      | Metric signature                                                                                                                                                                                                                                 |  |  |  |  |
|--------------------------------------|------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| Bad<br>instruction<br>mix            | Peformance insensitive to problem size vs. cache levels                                  | Large ratio of instructions retired to FP instructions if the useful work is FP / Many cycles per instruction (CPI) if the problem is large-latency arithmetic / Scalar instructions dominating in data-parallel loops (FLOPS_DP, FLOPS_SP, CPI) |  |  |  |  |
| Limited instruction throughput       | Large discrepancy from simple performance model based on LD/ST and arithmetic throughput | Low CPI near theoretical limit if instruction throughput is the problem / Static code analysis predicting large pressure on single execution port / High CPI due to bad pipelining (FLOPS_DP, FLOPS_SP, DATA)                                    |  |  |  |  |
| Micro-<br>architectural<br>anomalies | Large discrepancy from performance model                                                 | Relevant events are very hardware-specific, e.g., stalls due to 4k memory aliasing, conflict misses, unaligned vs. aligned LD/ST, requeue events. Code review required, with architectural features in mind.                                     |  |  |  |  |

### **Performance patterns (3)**



| Pattern                      | Peformance behavior                                                                                                  | Metric signature                                                                                       |  |  |  |  |  |
|------------------------------|----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| Synchronization overhead     | Speedup going down as more cores are added / No speedup with small problem sizes / Cores busy but low FP performance | Large non-FP instruction count (growing with number of cores used) / Low CPI (FLOPS_DP, FLOPS_DP, CPI) |  |  |  |  |  |
| False sharing of cache lines | Small speedup or slowdown when adding cores                                                                          | Frequent (remote) CL evicts (CACHE)                                                                    |  |  |  |  |  |
| Bad ccNUMA page placement    | Bad or no scaling across NUMA domains                                                                                | Unbalanced bandwidth on memory interfaces / High remote traffic (MEM)                                  |  |  |  |  |  |

### The problem of instructions retired (1)



- Instructions retired / CPI may not be a good indication of useful workload – at least for numerical / FP intensive codes....
- Floating Point Operations Executed is often a better indicator
- Waiting / "Spinning" in barrier generates a high instruction count



### The problem of instructions retired (2)

!\$OMP END PARALLEL DO



| L                                                                                                                                                                                                        | L                                                                                            | L                              |                                                                    | L                        |                                                                        |                                                                                 |                                | L                                |                                                                         | L                    |                                                                                |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|--------------------------------|--------------------------------------------------------------------|--------------------------|------------------------------------------------------------------------|---------------------------------------------------------------------------------|--------------------------------|----------------------------------|-------------------------------------------------------------------------|----------------------|--------------------------------------------------------------------------------|
| Event                                                                                                                                                                                                    | core 0                                                                                       | c                              | ore 1                                                              | c                        | ore 2                                                                  | core 3                                                                          | ]                              | cor                              | e 4                                                                     |                      | core 5                                                                         |
| INSTR_RETIRED_ANY  CPU_CLK_UNHALTED_CORE  CPU_CLK_UNHALTED_REF  FP_COMP_OPS_EXE_SSE_FP_PACKED  FP_COMP_OPS_EXE_SSE_FP_SCALAR  FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION  FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION | 1.83124e+10<br>2.24797e+10<br>2.04416e+10<br>3.45348e+09<br>2.93108e+07<br>19<br>3.48279e+09 | 2.2<br>  2.0<br>  3.4<br>  3.0 | 74784e+10<br>23789e+10<br>33445e+10<br>43035e+09<br>06063e+07<br>0 | 2.2<br>2.0<br>3.3<br>2.9 | 8453e+10  <br>3802e+10  <br>3456e+10  <br>7573e+09  <br>704e+07  <br>0 | 1.66794e+<br>2.23808e+<br>2.03462e+<br>3.39272e+<br>2.96507e+<br>0<br>3.42237e+ | -10  <br>-10  <br>-09  <br>-07 | 2.237<br>2.034<br>3.261<br>2.411 | 85e+10  <br>99e+10  <br>53e+10  <br>32e+09  <br>41e+07  <br>0<br>43e+09 | 2.<br>2.<br>3.<br>2. | 91736e+10<br>23805e+10<br>03459e+10<br>2377e+09<br>37397e+07<br>0<br>26144e+09 |
|                                                                                                                                                                                                          | Metric                                                                                       |                                | core 0                                                             |                          | core 1                                                                 | core 2                                                                          | coi                            | re 3                             | core 4                                                                  | +-                   | core 5                                                                         |
| Higher CPI but better<br>performance                                                                                                                                                                     | Runtime [s<br>  Clock [MHz                                                                   |                                | 8.4293<br>2932.7                                                   |                          | 8.39157<br>2933.5                                                      | 8.39206<br>2933.51                                                              |                                | 3923  <br>33.51                  | 8.3919<br>2933.5                                                        |                      | 8.39218  <br>2933.51                                                           |
| periormance                                                                                                                                                                                              | CPI<br>  DP MFlops/:                                                                         | S                              | 1.2275<br>  850.72                                                 |                          | 1.28037<br>845.212                                                     | 1.32857<br>  831.703                                                            | •                              | 34182  <br>5.865                 | 1.26666<br>802.952                                                      |                      | 1.16726  <br>797.113                                                           |
|                                                                                                                                                                                                          | Packed MUOP:<br>  Scalar MUOP:<br>  SP MUOPS/:                                               | S/s                            | 423.56<br>3.5949<br>2.33033e                                       | 4 j                      | 420.729<br>3.75383<br>0                                                | 414.03<br>  3.64317<br>  0                                                      |                                | 6.114  <br>63663  <br>0          | 399.99<br>2.9575                                                        |                      | 397.101  <br>2.91165  <br>0                                                    |
| !\$OMP PARALLEL DO                                                                                                                                                                                       | DP MUOPS/                                                                                    |                                | 427.16                                                             |                          | 424.483                                                                | 417.673<br>                                                                     | <br>  419<br>                  | 9.751                            | 402.95                                                                  | 5 ¦<br>+-            | 400.013                                                                        |
| DO I = 1, N                                                                                                                                                                                              |                                                                                              |                                |                                                                    |                          |                                                                        |                                                                                 |                                |                                  |                                                                         |                      |                                                                                |
| DO J = 1, N<br>x(I) = x(I) + A(J,I) * y(J)                                                                                                                                                               |                                                                                              |                                |                                                                    |                          |                                                                        |                                                                                 |                                |                                  |                                                                         |                      |                                                                                |
| ENDDO ENDDO                                                                                                                                                                                              |                                                                                              | y (0                           | , ,                                                                |                          |                                                                        |                                                                                 |                                |                                  |                                                                         |                      |                                                                                |
| ENDDO                                                                                                                                                                                                    |                                                                                              |                                |                                                                    |                          |                                                                        |                                                                                 |                                |                                  |                                                                         |                      |                                                                                |

### **Example 1:**

### Abstraction penalties in C++ code



C++ codes which suffer from overhead (inlining problems, complex abstractions) need a lot more overall instructions related to the arithmetic instructions.

- Often (but not always) "good" (i.e., low) CPI → "Bad instruction mix" pattern
- Lower bandwidth
- Instruction throughput limited
- High-level optimizations complex or impossible → "Strided access" pattern

Example: Matrix-matrix multiply with expression template frameworks on a 2.93 GHz Westmere core

|             | Total retired instructions [10 <sup>11</sup> ] | CPI  | Memory<br>Bandwidth [MB/s] | MFlops/s |
|-------------|------------------------------------------------|------|----------------------------|----------|
| Classic     | 12.5                                           | 0.44 | 5300                       | 1250     |
| Boost uBLAS | 10.1                                           | 4.6  | 630                        | 156      |
| Eigen3      | 2.1                                            | 0.41 | 371                        | 8555     |
| Blaze/DGEMM | 2.0                                            | 0.32 | 531                        | 11260    |

## Example 2: Image reconstruction by backprojection





- Simple roofline analysis
  - → Memory-bound algorithm → "Memory BW saturation" pattern
- Closer look via likwid-perfctr MEM group and IACA tool
  - → "Limited instruction throughput" pattern
- Work reduction optimization
  - → "Load imbalance" pattern identified by likwid-perfctr FLOPS\_SP group → corrected by round-robin schedule

### **Conclusions**



 Automatic analysis is useful for the beginner, but will never match an experienced analyst

- Performance patterns are more than simple numbers
  - Scaling behavior
  - Bottleneck saturation
  - HPM signatures
- The set presented here is just a suggestion; it will have to be tested against more codes
- Power/energy patterns are still missing, but will have to be included



# Thank you.



### References



- J. Treibig, G. Hager and G. Wellein: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. Proceedings of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego CA, September 13, 2010.
  DOI: 10.1109/ICPPW.2010.38
- K. Iglberger, G. Hager, J. Treibig, and U. Rüde: Expression Templates Revisited: A Performance Analysis of Current ET Methodologies. SIAM Journal on Scientific Computing 34(2), C42-C69 (2012). DOI: 10.1137/110830125
- J. Treibig, G. Hager, H. G. Hofmann, J. Hornegger, and G. Wellein: Pushing the limits for medical image reconstruction on recent standard multicore processors. International Journal of High Performance Computing Applications, (online first) DOI: 10.1177/1094342012442424