

**BERKELEY LAB** 

# **Roofline: A Throughput Oriented Performance Model**

Lenny Oliker

Jack Deslippe, Tuomas Koskela, Samuel Williams Lawrence Berkeley National Laboratory, USA

Roman Belenov, Zakhar Matveev, Philippe Thierry Intel Corporation





### Why Use Performance Models or Tools?

- Identify performance bottlenecks
- Motivate software optimizations
- **Determine when we're done optimizing** 
  - Assess performance relative to machine capabilities ٠
  - Motivate need for algorithmic changes ٠
- Predict performance on future machines / architectures
  - Sets realistic expectations on performance for future procurements
  - Used for HW/SW Co-Design to ensure future architectures are well-suited for the computational needs of today's applications.





- Many different components can contribute to kernel run time.
- Some are characteristics of the application, some are characteristics of the machine, and some are both (memory access pattern + caches).

**#FP operations** Flop/s Cache data movement Cache GB/s DRAM data movement DRAM GB/s PCIe data movement PCIe bandwidth Depth OMP Overhead MPI Message Size Network Bandwidth MPI Send: Wait ratio Network Gap #MPI Wait's Network Latency



Can't think about all these terms all the time for every application...





Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

> Roofline **#FP operations** Flop/s Model Cache data movement Cache GB/s DRAM data movement DRAM GB/s PCIe data movement PCIe bandwidth Depth OMP Overhead MPI Message Size Network Bandwidth MPI Send: Wait ratio Network Gap #MPI Wait's Network Latency

Williams et al, "Roofline: An Insightful Visual Performance Model For Multicore Architectures", CACM, 2009.



Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

**#FP operations** Flop/s Cache data movement Cache GB/s DRAM data movement DRAM GB/s LogCA PCIe data movement PCIe bandwidth Depth OMP Overhead MPI Message Size Network Bandwidth MPI Send: Wait ratio Network Gap #MPI Wait's Network Latency

Bin Altaf et al, "LogCA: A High-Level Performance Model for Hardware Accelerators", ISCA, 2017.



Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

> **#FP operations** Flop/s Cache data movement Cache GB/s DRAM data movement DRAM GB/s PCIe data movement PCIe bandwidth Depth OMP Overhead MPI Message Size Network Bandwidth MPI Send: Wait ratio Network Gap #MPI Wait's Network Latency

Alexandrov, et al, "LogGP: incorporating long messages into the LogP model - one step closer towards a realistic model for parallel computation", SPAA, 1995.





Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

**#FP operations** Flop/s Cache data movement Cache GB/s DRAM data movement DRAM GR/P Depth OMP C Size Network B dependencial of the second sec PCIe data movement PCIe band MPI Message Size Network B and MPI Send: Wait ratio Network LogF #MPI Wait's Network Latericy

Culler, et al, "LogP: a practical model of parallel computation", CACM, 1996







**BERKELEY LAB** 

BERKELEY NATIONAL LABORATORY

# **Roofline Model:** Arithmetic Intensity and Bandwidth



### **Performance Models / Simulators**

- Historically, many performance models and simulators tracked latencies to predict performance (i.e. counting cycles)
- The last two decades saw a number of latency-hiding techniques...
  - Out-of-order execution (hardware discovers parallelism to hide latency)
  - HW stream prefetching (hardware speculatively loads data)
  - Massive thread parallelism (independent threads satisfy the latency-bandwidth product) ullet
- Effective latency hiding has resulted in a shift from a latency-limited computing regime to a **throughput-limited computing regime**





## **Roofline Model**

- **Roofline Model** is a throughput-oriented performance model...
  - Tracks rates not times
  - Augmented with Little's Law (concurrency = latency\*bandwidth)
  - Independent of ISA and architecture (applies to CPUs, GPUs, Google TPUs<sup>1</sup>, etc...)

|  |                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                    | a crd.lbl.gov                                                                                                                                                                                                                                                                                                                                                                                                    |
|--|----------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  |                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|  |                                                                                                                                              | PERFORMANCE AND ALGORITHMS RESE<br>Home » Performance and Algorithms Research »                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|  | Performance and Algorithn                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|  | PERFORMANCE<br>AND<br>ALGORITHMS<br>RESEARCH<br>Auto-tuning<br>BeBOP<br>EDGAR<br>HipGISAXS<br>HPGMG<br>ScIDAC<br>TOP500<br>Previous Projects | Roofline Bed<br>Roofline is a visually intuitive performan<br>multicore, manycore, or accelerator pro<br>assess the quality of attained performan<br>performance figure. One can examine t<br>limitations.<br>Arthenetic Internetic Definition the<br>Compared to the performant<br>ransform. If out of place on a write a<br>would have an arithmetic intensity of<br>limit FFT arithmetic intensity of<br>arithmetic intensity grow very quickly | the model used to bound the perform<br>cessor architectures. Rather than sim<br>noe by combining locality, bandwidth,<br>he resultant Roofline figure in order to<br>e model is Arithmetic Intensity. Arithm<br>vector-vector increment (x[]+=y[]) th<br>to f the vector size. Conversely, F<br>illocate cache architecture, the trai<br>(0.104*logN and would grow slow)<br>aps 2 flops per byte. Finally, BLAS |
|  | Facebook                                                                                                                                     | 0.1-1.0 flops pe                                                                                                                                                                                                                                                                                                                                                                                                                                   | thmetic I                                                                                                                                                                                                                                                                                                                                                                                                        |
|  |                                                                                                                                              | Des files Martal                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                                                                                                                                                                                                                                                                                                                  |

https://crd.lbl.gov/departments/computer-science/PAR/research/roofline

**Roofline Mode** 





- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)

**#FP ops / Peak GFlop/s #Bytes / Peak GB/s** Time = max  $\prec$ 





- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)



1 / Peak GFlop/s #Bytes / #FP ops / Peak GB/s Time  $= \max$ **#FP ops** 



- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)



## Peak GFlop/s (#FP ops / #Bytes) \* Peak GB/s #FP ops Time = min



- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)

Note, Arithmetic Intensity (AI) = Flops / Bytes (as presented to DRAM)





- Plot Roofline bound using Arithmetic Intensity as the x-axis
- Log-log scale makes it easy to doodle, extrapolate performance along Moore's Law, etc...
- Kernels with AI less than machine balance are ultimately DRAM bound (we'll refine this later...)





## **Roofline Example #1**

- Typical machine balance is 5-10 flops per byte...
  - 40-80 flops per double to exploit compute capability ٠
  - Artifact of technology and money ٠
  - Unlikely to improve •
- Consider STREAM Triad...

#pragma omp parallel for for(i=0;i<N;i++){</pre> Z[i] = X[i] + alpha\*Y[i];

- 2 flops per iteration ٠
- Transfer 24 bytes per iteration (read X[i], Y[i], write Z[i]) ٠
- AI = 0.083 flops per byte == Memory bound ٠





### **Roofline Example #2**

### Conversely, 7-point constant coefficient stencil...

- 7 flops ٠
- 8 memory references (7 reads, 1 store) per point ٠
- Cache can filter all but 1 read and 1 write per point ٠
- AI = 0.44 flops per byte == memory bound, •

```
but 5x the flop rate
```

```
#pragma omp parallel for
for(k=1;k<dim+1;k++){</pre>
for(j=1;j<dim+1;j++){</pre>
for(i=1;i<dim+1;i++){</pre>
  new[k][j][i] = -6.0*old[k ][j
                                    ][i
                      + old[k ][j
                                    ][i-1]
                      + old[k
                               _1[i
                                    ][i+1]
                      + old[k
                               ][i-1][i
                      + old[k ][j+1][i
                      + old[k-1][i
                                    lΓi
                      + old[k+1][j
                                    ][i
                                          1:
}}}
```





- Real processors have multiple levels of memory
  - Registers
  - L1, L2, L3 cache
  - MCDRAM/HBM (KNL/GPU device memory)
  - DDR (main memory)
  - NVRAM (non-volatile memory)
- Applications can have locality in each level
  - Unique data movements imply unique Al's
  - Moreover, each level will have a unique bandwidth



- Construct superposition of Rooflines...
  - Measure a bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





- Construct superposition of Rooflines...
  - Measure a bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





- Construct superposition of Rooflines...
  - Measure a bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





- Construct superposition of Rooflines...
  - Measure a bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum







**BERKELEY LAB** 

BERKELEY NATIONAL LABORATORY

# **Roofline Model: Modeling In-core Performance Effects**





### Data, Instruction, Thread-Level Parallelism...

Modern CPUs use several techniques to increase per core Flop/s 

### **Fused Multiply Add**

- $w = x^*y + z$  is a common idiom in kinez algebra
- Ra  $\bullet$ SOL se a ada (FMA)
- ∧ J chains the mutiply and add in a single pipeline so that it can complete FMA/cycle

### **Vector Instructions**

- Many HPC codes apply the same operation to a vector of elements
- Vendors provide vector instructions that apply the same operation to 2, 4, 8, 16 elements...

x [0:7] \*y [0:7] + z [0:7]

Vector FPUs complete 8 vector operations/cycle

### **Deep pipelines**

- is substantial.
- increase GHz
- FP\_bandwidth



The hardware for a FMA

Breaking a single FMA up into several smaller operations and pipelining them allows vendors to

Little's Law applies... need FP\_Latency \* independent instructions



### Data, Instruction, Thread-Level Parallelism...

- If every instruction were an ADD (instead) of FMA), performance would drop by 2x on KNL or 4x on Haswell
- Similarly, if one failed to vectorize, performance would drop by another 8x on KNL and 4x on Haswell
- Lack of threading (or load imbalance) will reduce performance by another 64x on KNL.







### **Superscalar vs. Instruction mix**

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar ullet
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance







### **Superscalar vs. Instruction mix**

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance
- e.g. KNL
  - 2-issue superscalar
  - 2 FP data paths
  - Requires 100% of the instructions to be FP to get peak performance







### **Superscalar vs. instruction mix**

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance
- e.g. KNL
  - 2-issue superscalar
  - 2 FP data paths
  - Requires 100% of the instructions to be FP to get peak performance







### Superscalar vs. instruction mix

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance
- e.g. KNL
  - 2-issue superscalar
  - 2 FP data paths
  - Requires 100% of the instructions to be FP to get peak performance







### **Divides and other Slow FP instructions**

- FP Divides (sqrt, rsqrt, ...) might support only limited pipelining
- As such, their throughput is substantially lower than FMA's
- If divides constitute even if 3% of the flop's come from divides, performance can be cut in half.
- Penalty varies substantially between architectures and generations (e.g. IVB, HSW, KNL, ...)









**BERKELEY LAB** 

CE BERKELEY NATIONAL LABORATORY

# **Roofline Model:** Modeling Cache Effects





 Naively, we can bound AI using only compulsory cache misses







- Naively, we can bound AI using only compulsory cache misses
- However, write allocate caches can lower Al



 $AI = \frac{\#Flop's}{Compulsory Misses + Write Allocates}$ 



- Naively, we can bound AI using only compulsory cache misses
- However, write allocate caches can lower Al
- Cache capacity misses can have a huge penalty



 $AI = \frac{\#Flop's}{Compulsory Misses + Write Allocates + Capacity Misses}$ 



- Naively, we can bound AI using only compulsory cache misses
- However, write allocate caches can lower Al
- Cache capacity misses can have a huge penalty
- Compute bound became memory bound









**BERKELEY LAB** 

CE BERKELEY NATIONAL LABORATORY

# **Roofline Model:** General Strategy Guide





Broadly speaking, there are three approaches to improving performance:





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)
- Maximize memory bandwidth (e.g. NUMA-aware allocation)





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)
- Maximize memory bandwidth (e.g. NUMA-aware allocation)
- Minimize data movement (increase AI)







**BERKELEY LAB** 

NATIONAL LABORATORY

# Constructing a Roofline Model requires answering some questions...



### **Questions can overwhelm users...**

What is my Properties of the target machine

### (Benchmarking)

FMA on my machine?

> What is my machine's **DDR GB/s?** L2 GB/s?

How much data did my kernel actually move? **Properties of an** application's execution How many flop (Instrumentation) do? How much did that divide hurt?



### **Fundamental** properties of the kernel constrained **by** hardware







LAWRENCE BERKELEY NATIONAL LABORATORY

# We need tools...





## **Node Characterization?**

- "Marketing Numbers" can be deceptive...
  - Pin BW vs. real bandwidth
  - TurboMode / Underclock for AVX
  - compiler failings on high-AI loops.
- LBL developed the Empirical Roofline Toolkit (ERT)...
  - Characterize CPU/GPU systems
  - Peak Flop rates
  - Bandwidths for each level of memory
  - MPI+OpenMP/CUDA == multiple GPUs







### **Instrumentation with Performance Counters?**

- Characterizing applications with performance counters can be problematic...
  - **x** Flop Counters can be broken/missing in production processors
  - **x** Vectorization/Masking can complicate counting Flop's
  - **x** Counting Loads and Stores doesn't capture cache reuse while counting cache misses doesn't account for prefetchers.
  - **X** DRAM counters (Uncore PMU) might be accurate, but...
    - x are privileged and thus nominally inaccessible in user mode
    - may need vendor (e.g. Cray) and center (e.g. NERSC) approved X OS/kernel changes





### **Forced to Cobble Together Tools...**

- Use tools known/observed to work on NERSC's Cori (KNL, HSW)...
  - Used Intel SDE (Pin binary instrumentation + ٠ emulation) to create software Flop counters
  - Used Intel VTune performance tool (NERSC/Cray ٠ approved) to access uncore counters
- Accurate measurement of Flop's (HSW) and DRAM data movement (HSW and KNL)
- Used by NESAP (NERSC KNL application) readiness project) to characterize apps on Cori...

| HOME ABOUT SCIENCE AT NE                      | Home + For Users + Applica                            |
|-----------------------------------------------|-------------------------------------------------------|
| FOR USERS                                     | Home # For Osers # Applica                            |
| Live Status                                   | MEASURING                                             |
| User Announcements                            | In Ericounity                                         |
| My NERSC                                      | A differentia laterality in a sec                     |
| Getting Started                               | Arithmetic intensity is a m<br>amount of memory acces |
| Connecting to NERSC                           | ratio (F/B). This applicatio                          |
| Accounts & Allocations                        | Emulator Toolkit (SDE) a                              |
| Computational Systems                         | on using VTune can be fo                              |
| Storage & File Systems Application Deformance | Performance Model.                                    |
| Application Performance<br>NESAP              |                                                       |
| Application Porting and                       | Historically, processor ma                            |
| Performance                                   | calculation. Some modern<br>provide counters for FLO  |
| IXPUG                                         | memory accesses, and V                                |
| Performance and Debugging                     | mannuny autoaaaaa, ahd v                              |
| Tools                                         | The SDE dynamic instruc                               |
| Measuring Arithmetic                          | instruction length, instruc                           |
| Data & Analytics                              | with SDE. In general the<br>Edison and Cori Phase 1.  |
| Job Logs & Statistics                         | Edison and Con Phase 1.                               |
| Training & Tutorials                          | This application note prov                            |
| Software                                      | critical for real application                         |
| Policies                                      | more than a few minutes.                              |
| User Surveys                                  |                                                       |
| NERSC Users Group                             | An example command lin                                |
| ≥ Help                                        |                                                       |
| Staff Blogs                                   | \$ srun -n 4 -c 6 sde -ivt                            |
| Request Repository Mailing<br>List            |                                                       |
|                                               |                                                       |
|                                               | Where:                                                |
|                                               | <ul> <li>-ivb is used to tar</li> </ul>               |
| Out-of-hours Status                           | <ul> <li>-d specifies to on</li> </ul>                |
| and Password help                             | · -iform 1 turns on                                   |
| Call operations:                              | <ul> <li>-omix specifies th</li> </ul>                |
|                                               |                                                       |
|                                               | <ul> <li>-i specifies that ex</li> </ul>              |
|                                               | <ul> <li>-global_region will</li> </ul>               |
|                                               |                                                       |
|                                               |                                                       |
| Account Support<br>https://nim.nersc.gov      |                                                       |
|                                               | An example command lin                                |

### http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/

NERSC is LBL's production computing division CRD is LBL's Computational Research Division NESAP is NERSC's KNL application readiness project LBL is part of SUPER (DOE SciDAC3 Computer Science Institute)









### **Initial Roofline Analysis of NESAP Codes**







## **Evaluation of LIKWID**

- LIKWID provides easy to use wrappers for measuring performance counters...
  - Works on NERSC production systems  $\checkmark$
  - Minimal overhead (<1%)  $\checkmark$
  - Scalable in distributed memory (MPI-friendly)  $\checkmark$
  - Fast, high-level characterization  $\checkmark$
  - No detailed timing breakdown or optimization advice X
  - Limited by quality of hardware performance counter X implementation (garbage in/garbage out)
- Useful tool that complements other tools





## Intel Advisor

### Includes Roofline Automation...

- Automatically instruments applications (one dot per loop nest/function)
- Computes FLOPS and AI for each function
- ✓ AVX-512 support that incorporates masks
- Integrated Cache Simulator<sup>1</sup> (hierarchical roofline / multiple Al's)
- Automatically benchmarks target system (calculates ceilings)
- Full integration with existing Advisor capabilities



### http://www.nersc.gov/users/training/events/roofline-training-1182017-1192017

<sup>1</sup>Technology Preview, not in official product roadmap so far.



## **Tracking Optimization Behavior**



BerkeleyGW is a material science application that is dominated by dense linear algebra, including distributed matrix multiplication, inversion, diagonalization, and contraction and fast fourier transforms (FFT).



BerkeleyGW: Optimization process for Kernel-C (Sigma

1. Refactor (3 Loops for MPI, OpenMP, Vectors) 2. Add OpenMP 3. Initial Vectorization (loop reordering, conditional removal) 4. Cache-Blocking 5. Improved Vectorization (Divides) 6. Hyper-threading





# Acknowledgements

- This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.
- This material is based upon work supported by the DOE RAPIDS SciDAC Institute.
- This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-05CH11231.
- Special Thanks to:

**BERKELEY LAB** 

ENCE BERKELEY NATIONAL LABORATORY

- Zakhar Matveev, Intel Corporation •
- Roman Belenov, Intel Corporation



