

# Node-Level Performance Engineering https://tiny.cc/NLPE-SC23

Georg Hager, Thomas Gruber, Gerhard Wellein Erlangen National High Performance Computing Center (NHR@FAU)

SC23 Full-Day Tutorial Monday, November 13, 2023





Friedrich-Alexander-Universität Erlangen-Nürnberg

# **Node-Level Performance Engineering**

### https://tiny.cc/NLPE-SC23

Georg Hager, Thomas Gruber, Gerhard Wellein Erlangen National High Performance Computing Center (NHR@FAU)

SC23 Full-Day Tutorial Monday, November 13, 2023

# Agenda

#### Part I

- Introduction to compute node architecture
- Performance tools 1: topology and affinity
- Microbenchmarking as a tool
- Demo
- Introduction to the Roofline model
- Performance tools 2: hardware performance counters
- Demo

#### Part II

- Case study: tall & skinny matrix-matrix multiplication
- Case study: Stencil codes
- Demo
- Case study: sparse matrix-vector multiplication
- Programming for Single Instruction Multiple Data (SIMD) parallelism
- Programming for ccNUMA





Friedrich-Alexander-Universität Erlangen-Nürnberg

# Prelude: Scalability 4 teh win!













# Questions to ask in high performance computing

- Do I understand the performance behavior of my code?
  - Does the performance behave in accordance with a model I have made?
- What is the optimal performance for my code on a given machine?
  High Performance Computing == Computing at the bottleneck
- Can I change my code so that the "optimal performance" gets higher?
  Circumventing/ameliorating the impact of the bottleneck
- My model yields wrong predictions what's wrong?
  - This is the good case, because you learn something
  - Performance monitoring / microbenchmarking may help clear up the situation





Friedrich-Alexander-Universität Erlangen-Nürnberg

# Modern computer architecture

### An introduction for software developers



# Multi-core today: Intel Xeon Sapphire Rapids (2023)

- Xeon "Sapphire Rapids" (Platinum/Gold/Silver/Bronze): Up to 60 cores running at 1.7+ GHz (+ "Turbo Mode" 4.8 GHz),
- Simultaneous Multithreading
   → reports as 120-way chip
- "Intel 7" process / up to 350 W
- Multi-die package (4 chips)
- Clock frequency: flexible <sup>(i)</sup>



https://www.techpowerup.com/292204/intel-sapphire-rapids-xeon-4-tile-mcm-annotated

# Multi-core today: Intel Xeon Sapphire Rapids (2023)

- Xeon "Sapphire Rapids" (Platinum/Gold/Silver/Bronze): Up to 60 cores running at 1.7+ GHz (+ "Turbo Mode" 4.8 GHz),
- Simultaneous Multithreading
   → reports as 120-way chip
- "Intel 7" process / up to 350 W
- Multi-die package (4 chips)
- Clock frequency: flexible <sup>(i)</sup>

Optional: "Sub-NUMA Clustering" (SNC) mode boot option

→ One memory domain per die



https://www.techpowerup.com/292204/intel-sapphire-rapids-xeon-4-tile-mcm-annotated

# Multi-core today: Intel Xeon Sapphire Rapids (2023)

- Xeon "Sapphire Rapids" (Platinum/Gold/Silver/Bronze): Up to 60 cores running at 1.7+ GHz (+ "Turbo Mode" 4.8 GHz),
- Simultaneous Multithreading
   → reports as 120-way chip
- "Intel 7" process / up to 350 W
- Multi-die package (4 chips)
- Clock frequency: flexible <sup>(i)</sup>

Optional: "Sub-NUMA Clustering" (SNC) mode boot option

→ One memory domain per die





https://www.techpowerup.com/292204/intel-sapphire-rapids-xeon-4-tile-mcm-annotated

### General-purpose cache-based microprocessor core

- Implements "Stored Program Computer" concept (Turing 1936)
- Similar designs on all modern systems
- (Still) multiple potential bottlenecks

The clock cycle is the "heartbeat" of the core







Friedrich-Alexander-Universität Erlangen-Nürnberg

# **In-core features**

### Pipelining, Superscalarity, SIMD, SMT







#### Superscalarity: Multiple instructions per cycle

|  | Fetch Instruction 1<br>from L1I  | Decede<br>Decede        |                          |
|--|----------------------------------|-------------------------|--------------------------|
|  | Fetch Instruction 5<br>from L1I  | Decode<br>Instruction 1 | - Free and -             |
|  | Fetch Instruction 9<br>from L1I  | Decode<br>Instruction 5 | Execute<br>Instruction 1 |
|  | Fetch Instruction 13<br>from L1I | Decode<br>Instruction 9 | Execute<br>Instruction 5 |





Instruction 9

Execute

Instruction 5

Fetch Instruction 13

from L11

#### Single Instruction Multiple Data:

Multiple operations per instruction



#### Node-Level Performance Engineering





#### Single Instruction Multiple Data:

Multiple operations per instruction



#### Simultaneous Multi-Threading: Multiple instruction sequences in parallel





























Single instruction takes 5 cycles (latency)



Throughput:

1 instruction per cycle after pipeline is full

→ Speedup by factor 5



# Simultaneous multi-threading (SMT)



### Simultaneous multi-threading (SMT)



## SIMD processing

- Single Instruction Multiple Data (SIMD) operations allow the execution of the same operation on "wide" registers from a single instruction
- Adding two registers holding double precision floating point operands:



# Single-core DP floating-point performance







Friedrich-Alexander-Universität Erlangen-Nürnberg

# **Example: The sum reduction**



### A "simple" example: The sum reduction

```
for (int i=0; i<N; i++) {
    sum += a[i];
}</pre>
```

...in single precision on an AVXcapable core (ADD latency = 3 cy)

How fast can this loop possibly run with data in the L1 cache?

```
for (int i=0; i<N; i++) {
    sum += a[i];
}</pre>
```

...in single precision on an AVXcapable core (ADD latency = 3 cy)

How fast can this loop possibly run with data in the L1 cache?

- Loop-carried dependency on summation variable
- Execution stalls at every ADD until previous ADD is complete

→No pipelining?→No SIMD?









```
for (int i=0; i<N; i+=3) {
    s1 += a[i+0];
    s2 += a[i+1];
    s3 += a[i+2];
}
sum = sum + s1+s2+s3;</pre>
```



Scalar code, 3-way "modulo variable expansion" LOAD r1.0  $\leftarrow$  0 LOAD  $r2.0 \leftarrow 0$ LOAD r3.0  $\leftarrow$  0 i ← 1 loop: LOAD r4.0  $\leftarrow$  a(i) LOAD r5.0  $\leftarrow$  a(i+1) LOAD r6.0  $\leftarrow$  a(i+2) ADD  $r1.0 \leftarrow r1.0 + r4.0 \# scalar ADD$ ADD r2.0  $\leftarrow$  r2.0 + r5.0 # scalar ADD ADD  $r3.0 \leftarrow r3.0 + r6.0 \# scalar ADD$ i+=3 →? loop result  $\leftarrow$  r1.0+r2.0+r3.0

```
for (int i=0; i<N; i+=3) {
    s1 += a[i+0];
    s2 += a[i+1];
    s3 += a[i+2];
}
sum = sum + s1+s2+s3;</pre>
```



Scalar code, 3-way "modulo variable expansion" LOAD r1.0  $\leftarrow$  0 LOAD  $r2.0 \leftarrow 0$ LOAD r3.0  $\leftarrow$  0 i ← 1 loop: LOAD r4.0  $\leftarrow$  a(i) LOAD r5.0  $\leftarrow$  a(i+1) LOAD r6.0  $\leftarrow$  a(i+2) ADD  $r1.0 \leftarrow r1.0 + r4.0 \# scalar ADD$ ADD r2.0  $\leftarrow$  r2.0 + r5.0 # scalar ADD ADD  $r3.0 \leftarrow r3.0 + r6.0 \# scalar ADD$ i+=3 →? loop result  $\leftarrow$  r1.0+r2.0+r3.0

```
for (int i=0; i<N; i+=3) {
    s1 += a[i+0];
    s2 += a[i+1];
    s3 += a[i+2];
}
sum = sum + s1+s2+s3;</pre>
```



```
SIMD vectorization (8-way MVE) x pipelining (3-way MVE)
```

```
LOAD [r1.0,...,r1.7] \leftarrow [0,...,0]
LOAD [r2.0,...,r2.7] \leftarrow [0,...,0]
LOAD [r3.0,...,r3.7] \leftarrow [0,...,0]
i \leftarrow 1
```

```
for (int i=0; i<N; i+=24) {
   s10 += a[i+0]; s20 += a[i+8]; s30 += a[i+16];
   s11 += a[i+1]; s21 += a[i+9]; s31 += a[i+17];
   s12 += a[i+2]; s22 += a[i+10]; s32 += a[i+18];
   s13 += a[i+3]; s23 += a[i+11]; s33 += a[i+19];
   s14 += a[i+4]; s24 += a[i+12]; s34 += a[i+20];
   s15 += a[i+5]; s25 += a[i+13]; s35 += a[i+21];
   s16 += a[i+6]; s26 += a[i+14]; s36 += a[i+22];
   s17 += a[i+7]; s27 += a[i+15]; s37 += a[i+23];
}</pre>
```

```
sum = sum + s10+s11+...+s37;
```

| → ADD peak | s10 | s20 | s30 |
|------------|-----|-----|-----|
|            | s11 | s21 | s31 |
|            | s12 | s22 | s32 |
|            | s13 | s23 | s33 |
|            | s14 | s24 | s34 |
|            | s15 | s25 | s35 |
|            | s16 | s26 | s36 |
|            | s17 | s27 | s37 |

| LOAD $[r4.0,,r4.7] \leftarrow [a(i),,a(i+7)]$<br>LOAD $[r5.0,,r5.7] \leftarrow [a(i+8),,a(i+15)]$<br>LOAD $[r6.0,,r6.7] \leftarrow [a(i+16),,a(i+23)]$ | # | SIMD | LOAD |
|--------------------------------------------------------------------------------------------------------------------------------------------------------|---|------|------|
| ADD $r1 \leftarrow r1 + r4$ # SIMD ADD<br>ADD $r2 \leftarrow r2 + r5$ # SIMD ADD<br>ADD $r3 \leftarrow r3 + r6$ # SIMD ADD                             |   |      |      |
| i+=24 →? loop<br>result ← r1.0+r1.1++r3.6+r3.7                                                                                                         |   |      |      |

loop:

#### Questions

When can this performance actually be achieved?

- When can this performance actually be achieved?
  - No data transfer bottlenecks
  - No other in-core bottlenecks
    - Need to execute (3 LOADs + 3 ADDs + 1 increment + 1 compare + 1 branch) in 3 cycles

- When can this performance actually be achieved?
  - No data transfer bottlenecks
  - No other in-core bottlenecks
    - Need to execute (3 LOADs + 3 ADDs + 1 increment + 1 compare + 1 branch) in 3 cycles
- What does the compiler do?

- When can this performance actually be achieved?
  - No data transfer bottlenecks
  - No other in-core bottlenecks
    - Need to execute (3 LOADs + 3 ADDs + 1 increment + 1 compare + 1 branch) in 3 cycles
- What does the compiler do?
  - If allowed and capable, the compiler will do this automatically

- When can this performance actually be achieved?
  - No data transfer bottlenecks
  - No other in-core bottlenecks
    - Need to execute (3 LOADs + 3 ADDs + 1 increment + 1 compare + 1 branch) in 3 cycles
- What does the compiler do?
  - If allowed and capable, the compiler will do this automatically
- Is the compiler allowed to do this at all?

- When can this performance actually be achieved?
  - No data transfer bottlenecks
  - No other in-core bottlenecks
    - Need to execute (3 LOADs + 3 ADDs + 1 increment + 1 compare + 1 branch) in 3 cycles
- What does the compiler do?
  - If allowed and capable, the compiler will do this automatically
- Is the compiler allowed to do this at all?
  - Not according to language standards
  - High optimization levels can violate language standards

- When can this performance actually be achieved?
  - No data transfer bottlenecks
  - No other in-core bottlenecks
    - Need to execute (3 LOADs + 3 ADDs + 1 increment + 1 compare + 1 branch) in 3 cycles
- What does the compiler do?
  - If allowed and capable, the compiler will do this automatically
- Is the compiler allowed to do this at all?
  - Not according to language standards
  - High optimization levels can violate language standards
- What about the "accuracy" of the result?

- When can this performance actually be achieved?
  - No data transfer bottlenecks
  - No other in-core bottlenecks
    - Need to execute (3 LOADs + 3 ADDs + 1 increment + 1 compare + 1 branch) in 3 cycles
- What does the compiler do?
  - If allowed and capable, the compiler will do this automatically
- Is the compiler allowed to do this at all?
  - Not according to language standards
  - High optimization levels can violate language standards
- What about the "accuracy" of the result?
  - Good question ;-)





Friedrich-Alexander-Universität Erlangen-Nürnberg

# Memory Hierarchy

### In-cache performance (L2, L3) Main memory performance



You can either build a small and fast memory or a large and slow memory.



### Purpose of many optimizations is to load data from fast memory

Caches help with getting instructions and data to the CPU "fast"

How does data travel from memory to the CPU and back?

CPU registers

Cache



### Data transfers in a memory hierarchy

Caches help with getting instructions and data to the CPU "fast" How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache

- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required







Caches help with getting instructions and data to the CPU "fast" How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required

Example: Array copy A(:)=C(:)



**CPU** registers

LD C(1)



## Data transfers in a memory hierarchy

Caches help with getting instructions and data to the CPU "fast" How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required
  - Example: Array copy A(:)=C(:)







### Data transfers in a memory hierarchy

Caches help with getting instructions and data to the CPU "fast" How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache

Example: Array copy A(:)=C(:)

- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required





**SC23** 

Caches help with getting instructions and data to the CPU "fast"

How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required



Caches help with getting instructions and data to the CPU "fast"

How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required



Caches help with getting instructions and data to the CPU "fast"

How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required



Caches help with getting instructions and data to the CPU "fast"

How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required



Caches help with getting instructions and data to the CPU "fast" How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required



Caches help with getting instructions and data to the CPU "fast" How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required



Caches help with getting instructions and data to the CPU "fast" How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in a cache level
  - $\rightarrow$  CL transfer required







Friedrich-Alexander-Universität Erlangen-Nürnberg

## Multicore

### Node topology and performance



Putting the cores & caches together AMD Epyc 7742 64-Core Processor («Rome»)

- Core features:
  - Two-way SMT
  - Two 256-bit SIMD FMA units (AVX2)
     →16 flops/cycle (actually 24 because 2 ADDs can be done alongside)
  - 32 KiB L1 data cache per core
  - 512 KiB L2 cache per core



#### Core features:

- Two-way SMT
- Two 256-bit SIMD FMA units (AVX2)
   →16 flops/cycle (actually 24 because 2 ADDs can be done alongside)
- 32 KiB L1 data cache per core
- 512 KiB L2 cache per core
- 64 cores per socket hierarchically built up from
  - 16 CCX with 4 cores and 16 MiB of L3 cache
  - 2 CCX form 1 CCD (silicon die)
  - 8 CCDs connected to IO device "Infinity Fabric" (memory controller & PCIe)



#### Core features:

- Two-way SMT
- Two 256-bit SIMD FMA units (AVX2)
   →16 flops/cycle (actually 24 because 2 ADDs can be done alongside)
- 32 KiB L1 data cache per core
- 512 KiB L2 cache per core
- 64 cores per socket hierarchically built up from
  - 16 CCX with 4 cores and 16 MiB of L3 cache
  - 2 CCX form 1 CCD (silicon die)
  - 8 CCDs connected to IO device "Infinity Fabric" (memory controller & PCIe)
- 8 channels of DDR4-3200 per IO device
  - MemBW: 8 ch x 8 byte x 3.2 GHz = 204.8 GB/s
- ccNUMA-feature (Boot time option):
  - Node Per Socket (NPS)=1, 2 or 4



#### Core features:

- Two-way SMT
- Two 256-bit SIMD FMA units (AVX2)
   →16 flops/cycle (actually 24 because 2 ADDs can be done alongside)
- 32 KiB L1 data cache per core
- 512 KiB L2 cache per core
- 64 cores per socket hierarchically built up from
  - 16 CCX with 4 cores and 16 MiB of L3 cache
  - 2 CCX form 1 CCD (silicon die)
  - 8 CCDs connected to IO device "Infinity Fabric" (memory controller & PCIe)
- 8 channels of DDR4-3200 per IO device
  - MemBW: 8 ch x 8 byte x 3.2 GHz = 204.8 GB/s
- ccNUMA-feature (Boot time option):
  - Node Per Socket (NPS)=1, 2 or 4
  - NPS=4  $\rightarrow$  4 ccNUMA domains



Parallel and shared resources within a shared-memory node



### **Parallel resources:**

- -

### **Shared resources:**

- .

Parallel and shared resources within a shared-memory node

GPU #1 Ρ Ρ Ρ P Ρ Ρ Ρ Ρ L1D L1D L1D L1D L1D L1D L1D L1D L2 L2 L2 L2 L2 L2 L2 L2 Other I/O L3 L3 coherent Memory Interface **Memory Interface PCIe link** link GPU #2 Memory Memory

### **Parallel resources:**

- Execution/SIMD units 1

### **Shared resources:**

Parallel and shared resources within a shared-memory node



### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- .

### **Shared resources:**

Parallel and shared resources within a shared-memory node



#### **Parallel resources:**

Execution/SIMD units 1

- Cores 2
- Inner cache levels

- **Shared resources:**

Parallel and shared resources within a shared-memory node



#### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- Inner cache levels 3
- Sockets / ccNUMA domains

### **Shared resources:**

Parallel and shared resources within a shared-memory node



#### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- Inner cache levels 3
- Sockets / ccNUMA domains
- Multiple accelerators 5

### **Shared resources:**

Parallel and shared resources within a shared-memory node



#### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- Inner cache levels 3
- Sockets / ccNUMA domains
- Multiple accelerators 5

### **Shared resources:**

Outer cache level per socket

Parallel and shared resources within a shared-memory node



#### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- Inner cache levels 3
- Sockets / ccNUMA domains
- Multiple accelerators 5

#### **Shared resources:**

- Outer cache level per socket
- Memory bus per socket 7

Parallel and shared resources within a shared-memory node



#### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- Inner cache levels 3
- Sockets / ccNUMA domains
- Multiple accelerators 5

#### **Shared resources:**

- Outer cache level per socket
- Memory bus per socket 7
- Intersocket link 8

Parallel and shared resources within a shared-memory node



#### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- Inner cache levels 3
- Sockets / ccNUMA domains
- Multiple accelerators 5

#### **Shared resources:**

- Outer cache level per socket
- Memory bus per socket 7
- Intersocket link 8
- PCIe bus(es) 9

Parallel and shared resources within a shared-memory node



#### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- Inner cache levels 3
- Sockets / ccNUMA domains
- Multiple accelerators 5

### **Shared resources:**

- Outer cache level per socket
- Memory bus per socket 7
- Intersocket link 8
- PCIe bus(es) 9
- Other I/O resources 10

Parallel and shared resources within a shared-memory node



### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- Inner cache levels 3
- Sockets / ccNUMA domains
- Multiple accelerators 5

### **Shared resources:**

- Outer cache level per socket
- Memory bus per socket 7
- Intersocket link 8
- PCIe bus(es) 9
- Other I/O resources 10

## How does your application react to all of those details?





Friedrich-Alexander-Universität Erlangen-Nürnberg

# **GPGPU** accelerators

NVIDIA "Hopper" H100 vs. AMD Zen4 "Genoa"



- 80 B Transistors
- ~ 1.8 GHz clock speed
- ~ 144 "SM" units
  - 128 SP "cores" each (FMA)
  - 64 DP "cores" each (FMA)
  - 4 "Tensor Cores" each
  - 2:1 SP:DP performance
- ~ 34 TFlop/s DP peak (FP64)
- 50 MiB L2 Cache
- 80 GB HBM3
- MemBW ~ 3300 GB/s (theoretical)
- MemBW ~ 3000 GB/s (measured)



- 80 B Transistors
- ~ 1.8 GHz clock speed
- ~ 144 "SM" units
  - 128 SP "cores" each (FMA)
  - 64 DP "cores" each (FMA)
  - 4 "Tensor Cores" each
  - 2:1 SP:DP performance
- ~ 34 TFlop/s DP peak (FP64)
- 50 MiB L2 Cache
- 80 GB HBM3
- MemBW ~ 3300 GB/s (theoretical)
- MemBW ~ 3000 GB/s (measured)



- 80 B Transistors
- ~ 1.8 GHz clock speed
- ~ 144 "SM" units
  - 128 SP "cores" each (FMA)
  - 64 DP "cores" each (FMA)
  - 4 "Tensor Cores" each
  - 2:1 SP:DP performance
- ~ 34 TFlop/s DP peak (FP64)
- 50 MiB L2 Cache
- 80 GB HBM3
- MemBW ~ 3300 GB/s (theoretical)
- MemBW ~ 3000 GB/s (measured)



$$P_{peak}^{DP} = n_{SM} \cdot n_{core} \cdot n_{FP} \cdot f$$
# SMs
# CUDA
# FP
ops/cy

- 80 B Transistors
- ~ 1.8 GHz clock speed
- ~ 144 "SM" units
  - 128 SP "cores" each (FMA)
  - 64 DP "cores" each (FMA)
  - 4 "Tensor Cores" each
  - 2:1 SP:DP performance
- ~ 34 TFlop/s DP peak (FP64)
- 50 MiB L2 Cache
- 80 GB HBM3
- MemBW ~ 3300 GB/s (theoretical)
- MemBW ~ 3000 GB/s (measured)



$$P_{peak}^{DP} = n_{SM} \cdot n_{core} \cdot n_{FP} \cdot f$$

$$n_{SM} = 144$$

$$n_{core} = 64$$

$$n_{FP} = 2 \frac{\text{flops}}{\text{cy}}$$

$$f = 1.8 \frac{\text{Gcy}}{\text{S}}$$

# Trading single thread performance for parallelism: *GPGPUs vs. CPUs*

| GPU vs. CPU<br>light speed estimate<br>(per processor chip) | Control            | ALU<br>ALU | ALU ALU |                                |
|-------------------------------------------------------------|--------------------|------------|---------|--------------------------------|
|                                                             | DRAM               |            |         | DRAM                           |
|                                                             | CI                 | PU         |         | GPU                            |
|                                                             | 2 x AMD EPYC S     | 9654 "G    | enoa"   | NVidia Tesla H100 SXM "Hopper" |
| Cores@Clock                                                 | 2 x 96 @ 2         | 2.4 GHz    |         | 144 SMs @ ~1.8 GHz             |
| FP32 Performance/core                                       | 76.8 GF            | lop/s      |         | ~ 230 GFlop/s                  |
| Threads@STREAM                                              | ~ 24               | 4          |         | ~ 100000                       |
| FP32 peak                                                   | 14.7 TF            | lop/s      |         | ~ 67 TFlop/s                   |
| Stream BW (meas.)                                           | 2 x 360            | GB/s       |         | ~ 3000 GB/s                    |
| Transistors / TDP                                           | ~ 2x 80 (?) Billic | on / 2x 3  | 60 W    | 80 Billion/700 W               |

## **Conclusions about architecture**

- Performance is a result of
  - How many instructions you require to implement an algorithm
  - How efficiently those instructions are executed on a processor
  - Runtime contribution of the triggered data transfers
- Modern computer architecture has a rich "topology"
- Node-level hardware parallelism takes many forms
  - Sockets/devices CPU: 1-4 or more, GPGPU: 1-8
  - Cores moderate (CPU: 20-128, GPGPU: 10-100)
  - SIMD moderate (CPU: 2-16) to massive (GPGPU: 10's-100's)
  - Superscalarity (CPU: 2-6)
- Exploiting performance: parallelism + bottleneck awareness
  - "High Performance Computing" == computing at a bottleneck
- Performance of programs is sensitive to architecture





Friedrich-Alexander-Universität Erlangen-Nürnberg

# **Multicore Performance and Tools**

## Part 1: Topology, affinity control, clock speed



# **Tools for Node-level Performance Engineering**

## Node Information

/proc/cpuinfo, numactl, hwloc, likwid-topology, likwid-powermeter

- Affinity control and data placement OpenMP and MPI runtime environments, hwloc, numactl, likwid-pin
- Runtime Profiling Compilers, gprof, perf, HPCToolkit, Intel Amplifier, gprof-ng, ...
- Performance Analysis Intel VTune, likwid-perfctr, PAPI-based tools, HPCToolkit, perf
- Microbenchmarking

STREAM, likwid-bench, lmbench, uarch-bench





Friedrich-Alexander-Universität Erlangen-Nürnberg

DEMO

# **Reporting topology**

## likwid-topology



Node-Level Performance Engineering



# Output of likwid-topology

| CPU name: In                  | tel(R) Xeon(R)        | Platinum 83 | 60Y CPU @ 2 | .40GHz       |                |                            |
|-------------------------------|-----------------------|-------------|-------------|--------------|----------------|----------------------------|
| CPU type: In                  | tel Icelake SP        | processor   |             |              |                |                            |
| CPU stepping                  | r: 6                  |             |             |              |                |                            |
| *******                       | *****                 | *********   | ********    | ***********  | *****          | * * *                      |
| Hardware Thr                  | ead Topology          |             |             |              |                |                            |
|                               | *****                 | *********   | ********    | ************ | ****           | * * *                      |
| Sockets:                      | 2                     |             |             |              |                |                            |
| Cores per so                  | ocket: 36             |             |             |              |                |                            |
| Threads per                   | core: 1               |             |             |              |                |                            |
| micado per                    | 0010. 1               |             |             |              |                |                            |
|                               | Thread                | Core        | Die         | Socket       | Available      |                            |
| HWThread                      |                       | Core<br>0   | Die<br>0    | Socket<br>0  | Available<br>* |                            |
| HWThread                      | Thread                |             |             |              |                | All physical               |
| HWThread<br>0<br>2            | Thread<br>0           | 0           | 0           | 0            | *              | All physical processor IDs |
| HWThread<br>0<br>1<br>2<br>[] | Thread<br>0<br>0      | 0<br>1<br>2 | 0<br>0      | 0<br>0       | *              |                            |
| HWThread<br>0<br>1<br>2       | Thread<br>0<br>0      | 0<br>1      | 0<br>0      | 0<br>0       | *              |                            |
| HWThread<br>0<br>1<br>2<br>[] | Thread<br>0<br>0<br>0 | 0<br>1<br>2 | 0<br>0<br>0 | 0<br>0<br>0  | *<br>*<br>*    |                            |

optional

| Output o                             | t likwid-topology                                                                                                                                                     |
|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ****                                 | **************************************                                                                                                                                |
| Cache Topology<br>****************** | ****                                                                                                                                                                  |
| Level:                               | 1                                                                                                                                                                     |
| Size:                                | 48 kB                                                                                                                                                                 |
| Cache groups:                        | (0)(1)(2)(3)(4)(5)(64)(65)(66)(67)(68)(69)(70)(71)                                                                                                                    |
| Level:                               | 2                                                                                                                                                                     |
| Size:                                | 1.25 MB                                                                                                                                                               |
| Cache groups:                        | (0)(1)(2)(3)(4)(5)(64)(65)(66)(67)(68)(69)(70)(71)                                                                                                                    |
| Level:                               | 3                                                                                                                                                                     |
| Size:                                | 54 MB                                                                                                                                                                 |
| Туре:                                | Unified cache                                                                                                                                                         |
| Associativity:                       | 12                                                                                                                                                                    |
| Number of sets:                      | 73728 Additional cache info                                                                                                                                           |
| Cache line size:                     | 64 with -c option                                                                                                                                                     |
| Cache type:                          | Non Inclusive                                                                                                                                                         |
| Shared by threads:                   | 36                                                                                                                                                                    |
| Cache groups:                        | ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 23 24 25 26 27 28 29 30 31 32 33 34 35 )<br>( 36 37 38 39 40 41 42 43 44 45 46 47 48 59 60 61 62 63 64 65 66 67 68 69 70 71 ) |

# Output of likwid-topology

| Output of likwid-topology              |                                                           |                                                     |  |  |  |
|----------------------------------------|-----------------------------------------------------------|-----------------------------------------------------|--|--|--|
| ************************************** | *********************                                     |                                                     |  |  |  |
|                                        | *****                                                     |                                                     |  |  |  |
| NUMA domains:                          | 4 4                                                       | Output similar to                                   |  |  |  |
| Domain:                                | 0                                                         |                                                     |  |  |  |
| Processors:                            | (01234567891011121314151617)                              |                                                     |  |  |  |
| Distances:                             | 10 11 20 20                                               | Sockets: 2                                          |  |  |  |
| Free memory:                           | 119059 MB                                                 | Threads per core:1                                  |  |  |  |
| Total memory:                          | 128553 МВ                                                 |                                                     |  |  |  |
| <br>Domain:<br>-                       | 1                                                         | Sub-NUMA clustering (SNC)<br>enabled, SMT disabled! |  |  |  |
|                                        | ( 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ) |                                                     |  |  |  |
|                                        | 11 10 20 20<br>100106 ND                                  |                                                     |  |  |  |
| Free memory:                           |                                                           |                                                     |  |  |  |
| Total memory:                          | 129020 МВ                                                 |                                                     |  |  |  |
| Domain:                                | 2                                                         |                                                     |  |  |  |
| Processors:                            | ( 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 ) |                                                     |  |  |  |
| Distances:                             | 20 20 10 11                                               |                                                     |  |  |  |
| Free memory:                           | 128033 MB                                                 |                                                     |  |  |  |
| Total memory:                          | 128978 MB                                                 |                                                     |  |  |  |
| Domain:                                | 3                                                         |                                                     |  |  |  |
| Processors:                            | ( 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 ) |                                                     |  |  |  |
| Distances:                             | 20 20 11 10                                               |                                                     |  |  |  |
| Free memory:                           | 128719 MB                                                 |                                                     |  |  |  |
| —                                      | 129017 MB                                                 |                                                     |  |  |  |





Friedrich-Alexander-Universität Erlangen-Nürnberg

DEMO

# Enforcing thread/process affinity under Linux OS

## likwid-pin



https://youtu.be/PSJKNQaqwB0

## DAXPY test on A64FX

### Anarchy vs. thread pinning



## DAXPY test on A64FX

### Anarchy vs. thread pinning



## DAXPY test on A64FX

### Anarchy vs. thread pinning





- Eliminating performance variation
- Making use of architectural features
- Avoiding resource contention



# More thread/process affinity ("pinning") options

- Highly OS-dependent system calls but available on all systems
  - Linux: sched\_setaffinity()
  - Windows: SetThreadAffinityMask()
- Hwloc project (<u>http://www.open-mpi.de/projects/hwloc/</u>)
- Support for "semi-automatic" pinning
  - All modern compilers with OpenMP support OpenMP 4.0 (OMP\_PLACES, OMP\_PROC\_BIND)
  - CPUset reduction utils: taskset or numactl
  - Job scheduler like SLURM
- Affinity awareness in MPI libraries (OpenMPI, Intel MPI, ...)
- Or likwid-pin and likwid-mpirun

https://youtu.be/IKW0kRLnhyc

# Overview likwid-pin

- Pins processes and threads to specific cores without touching code
- Directly supports pthreads, gcc OpenMP, Intel OpenMP
- Based on combination of wrapper tool together with overloaded pthread library

   → binary must be dynamically linked!
- Supports logical core numbering within topological entities (thread domains)
- Simple usage with physical (kernel) core IDs:
- \$ likwid-pin -c 0-3,4,6 ./myApp parameters
- \$ OMP\_NUM\_THREADS=4 likwid-pin -c 0-9 ./myApp params
- Simple usage with logical IDs ("thread groups expressions"):
- \$ likwid-pin -c S0:0-7 ./myApp params
- \$ likwid-pin -c C1:0-2 ./myApp params

# LIKWID terminology: Thread group syntax

- The OS numbers all processors (hardware threads) on a node
- The numbering is enforced at boot time by the BIOS
- LIKWID introduces thread domains consisting of hardware threads sharing a topological entity (e.g. socket or shared cache)
- A thread domain is defined by a single character + index

----+ +----+ +----+ +----+ Example for likwid-pin: likwid-pin -c S0:0-3 ./a.out Thread group expressions may be chained with @: likwid-pin -c S0:0-2@S1:0-2 ./a.out

-----+ +-----+ +-----+ +-----+ || +-----+ +-----+ +-----+

-----+ +-----+ +-----+ +-----+ || +-----+ +-----+ +-----+

| 1 9 | | 2 10 | | 3 11 | || | 4 12 | | 5 13 | | 6 14 | | 7 15 |

48

Physical cores first!

4 | 1 5 | 2 6 | 3 7 | +----+ +----+ +----+ +----+









## Available thread domains/unit prefixes (LIKWID 5.2)



## Available thread domains/unit prefixes (LIKWID 5.2)



## Example: likwid-pin with Intel OpenMP

#### Running the STREAM benchmark with likwid-pin:

\$ likwid-pin -c S0:0-3 ./stream



Processor: smallest entity able to run a thread or task (hardware thread) Place: one or more processors  $\rightarrow$  thread pinning is done place by place Free migration of the threads on a place between the processors of that place.

Or use explicit numbering, e.g. 8 places, each consisting of 4 processors:

- OMP\_PLACES="{0,1,2,3}, {4,5,6,7}, {8,9,10,11}, ... {28,29,30,31}"
- OMP\_PLACES="{0:4}, {4:4}, {8:4}, ... {28:4}"
- OMP\_PLACES="{0:4}:8:4"

Processor: smallest entity able to run a thread or task (hardware thread) Place: one or more processors  $\rightarrow$  thread pinning is done place by place Free migration of the threads on a place between the processors of that place.

| OMP_PLACES                | Place ==                        |
|---------------------------|---------------------------------|
| threads                   | Hardware thread (hyper-thread)  |
| cores                     | All HW threads of a single core |
| sockets                   | All HW threads of a socket      |
| abstract_name(num_places) | Restrict # of places available  |

Or use explicit numbering, e.g. 8 places, each consisting of 4 processors:

- OMP\_PLACES="{0,1,2,3},{4,5,6,7},{8,9,10,11}, ... {28,29,30,31}"
- OMP\_PLACES="{0:4}, {4:4}, {8:4}, ... {28:4}"
- OMP\_PLACES="{0:4}:8:4"

Processor: smallest entity able to run a thread or task (hardware thread) Place: one or more processors  $\rightarrow$  thread pinning is done place by place Free migration of the threads on a place between the processors of that place.

| abstract name  | OMP_PLACES                           | Place ==                        |
|----------------|--------------------------------------|---------------------------------|
| aboliactinanie | threads                              | Hardware thread (hyper-thread)  |
|                | cores                                | All HW threads of a single core |
|                | sockets                              | All HW threads of a socket      |
|                | <pre>abstract_name(num_places)</pre> | Restrict # of places available  |

Or use explicit numbering, e.g. 8 places, each consisting of 4 processors:

- OMP\_PLACES="{0,1,2,3}, {4,5,6,7}, {8,9,10,11}, ... {28,29,30,31}"
- OMP\_PLACES="{0:4}, {4:4}, {8:4}, ... {28:4}" >
- OMP\_PLACES=" { 0 : 4 } : 8 : 4 "

<lower-bound>:<number of entries>[:<stride>]

Processor: smallest entity able to run a thread or task (hardware thread) Place: one or more processors  $\rightarrow$  thread pinning is done place by place Free migration of the threads on a place between the processors of that place.

| abstract name  | OMP_PLACES                           | Place ==                        |
|----------------|--------------------------------------|---------------------------------|
| aboliactitatio | threads                              | Hardware thread (hyper-thread)  |
|                | cores                                | All HW threads of a single core |
|                | sockets                              | All HW threads of a socket      |
|                | <pre>abstract_name(num_places)</pre> | Restrict # of places available  |

Or use explicit numbering, e.g. 8 places, each consisting of 4 processors:

- OMP\_PLACES="{0,1,2,3},{4,5,6,7},{8,9,10,11}, ... {28,29,30,31}"
- OMP\_PLACES="{0:4}, {4:4}, {8:4}, ... {28:4}" >
- OMP\_PLACES="{0:4}:8:4"

Caveat: Actual behavior is implementation defined!

<lower-bound>:<number of entries>[:<stride>]

## OMP\_PROC\_BIND variable / proc\_bind() clause

#### Determines how places are used for pinning:

| OMP_PROC_BIND | Meaning                                                                                                     |
|---------------|-------------------------------------------------------------------------------------------------------------|
| FALSE         | Affinity disabled                                                                                           |
| TRUE          | Affinity enabled, implementation defined strategy                                                           |
| CLOSE         | Threads bind to consecutive places                                                                          |
| SPREAD        | Threads are evenly scattered among places                                                                   |
| MASTER        | Threads bind to the same place as the master thread that was running before the parallel region was entered |

If there are more threads than places, consecutive threads are put into individual places ("balanced")

## Some simple OMP\_PLACES examples

Intel Xeon w/ SMT, 2x10 cores, 1 thread per physical core, fill 1 socket OMP\_NUM\_THREADS=10 OMP\_PLACES=cores OMP\_PROC\_BIND=close Always prefer abstract places instead of HW thread IDs!

Intel Xeon, 2 sockets, 4 threads per socket (no binding within socket!) OMP\_NUM\_THREADS=8 OMP\_PLACES=sockets OMP\_PROC\_BIND=close # spread will also do

Intel Xeon, 2 sockets, 4 threads per socket, binding to cores OMP\_NUM\_THREADS=8 OMP\_PLACES=cores OMP\_PROC\_BIND=spread

# MPI startup and hybrid pinning: likwid-mpirun

- How do you manage affinity with MPI or hybrid MPI/threading?
- In the long run a unified standard is needed
- Till then, likwid-mpirun provides a portable/flexible solution
- The examples here are for Intel MPI/OpenMP programs, but are also applicable to other threading models

Pure MPI:

\$ likwid-mpirun -np 16 -nperdomain S:2 ./a.out

Hybrid:

\$ likwid-mpirun -np 16 -pin S0:0,1\_S1:0,1 ./a.out

## likwid-mpirun 1 MPI process per socket

\$ likwid-mpirun -np 4 -pin S0:0-5\_S1:0-5 ./a.out

\$ likwid-mpirun -np 4 -nperdomain S:1 6 ./a.out

| 32kB     32kB     32kB     32kB     32kB       256kB     256kB     256kB     256kB     256kB | 32kB       32kB       32kB       32kB       32kB       32kB         256kB       256kB       256kB       256kB       256kB       256kB |
|----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| 12 MB                                                                                        | 12 MB                                                                                                                                 |
|                                                                                              |                                                                                                                                       |
|                                                                                              |                                                                                                                                       |
|                                                                                              |                                                                                                                                       |
|                                                                                              |                                                                                                                                       |
| 32kB         32kB         32kB         32kB         32kB                                     | 32kB         32kB         32kB         32kB         32kB         32kB                                                                 |
|                                                                                              |                                                                                                                                       |

Intel MPI + compiler:

OMP\_NUM\_THREADS=6 mpirun -ppn 2 -np 4 -env I\_MPI\_PIN\_DOMAIN socket -env KMP\_AFFINITY scatter ./a.out





Friedrich-Alexander-Universität Erlangen-Nürnberg

## Microbenchmarking for architectural exploration

Probing of the memory hierarchy

Saturation effects

OpenMP barrier overhead



## Motivation for Microbenchmarking as a tool

- Isolate small kernels to:
  - Separate influences
  - Determine specific machine capabilities (light speed)
  - Gain experience about software/hardware interaction
  - Determine programming model overhead
  - • •
- Possibilities:
  - Readymade benchmark collections (epcc OpenMP, IMB)
  - STREAM benchmark for memory bandwidth
  - Implement own benchmarks (difficult and error prone)
  - likwid-bench tool: Offers collection of benchmarks and framework for rapid development of assembly code kernels

#### likwid-bench

- Microbenchmarking in high-level language is often difficult
- Solution: assembly-based microbenchmarking framework
  - e.g., likwid-bench
  - \$ likwid-bench -t triad\_avx512\_fma -W S0:28kB:1

benchmark type topological entity (see likwid-pin) working set

# of threads

Classic benchmark: Schönauer Triad a[i] = b[i] + d[i] \* c[i]
 This kernel is limited by data transfer performance for all memory levels on all architectures, ever!



Node-Level Performance Engineering



Node-Level Performance Engineering





Node-Level Performance Engineering

- How does the bandwidth scale across cores?
- Are there any bottlenecks?
- How large are the caches?

```
likwid-bench \
  -t triad_avx512_fma
  -W S0:$size:$threads:1:2
```

- Scan \$size and \$threads
- Pin threads in chunks of 1 with distance of 2 (skip SMT threads)

- How does the bandwidth scale across cores?
- Are there any bottlenecks?
- How large are the caches?

- Scan \$size and \$threads
- Pin threads in chunks of 1 with distance of 2 (skip SMT threads)



- How does the bandwidth scale across cores?
- Are there any bottlenecks?
- How large are the caches?

- Scan \$size and \$threads
- Pin threads in chunks of 1 with distance of 2 (skip SMT threads)



- How does the bandwidth scale across cores?
- Are there any bottlenecks?
- How large are the caches?

- Scan \$size and \$threads
- Pin threads in chunks of 1 with distance of 2 (skip SMT threads)



- How does the bandwidth scale across cores?
- Are there any bottlenecks?
- How large are the caches?

- Scan \$size and \$threads
- Pin threads in chunks of 1 with distance of 2 (skip SMT threads)



- How does the bandwidth scale across cores?
- Are there any bottlenecks?
- How large are the caches?

- Scan \$size and \$threads
- Pin threads in chunks of 1 with distance of 2 (skip SMT threads)























Node-Level Performance Engineering



Node-Level Performance Engineering

### Memory bandwidth saturation (read-only)



Node-Level Performance Engineering

### Memory bandwidth saturation (read-only)



Node-Level Performance Engineering

69

# The OpenMP-parallel vector triad benchmark

#### OpenMP worksharing in the benchmark loop

```
S = getTimeStamp();
#pragma omp parallel
ł
    for(int j = 0; j < iter; j++) {</pre>
      #pragma omp for
      for (int i=0; i<N; i++) {</pre>
         a[i] = b[i] + d[i] * c[i];
                               Implicit barrier
E = getTimeStamp();
```









Typical barrier cost

- ~ 10000 cy full node
- Scales with log(#cores)
- Depends on positions of threads (topology)

# Conclusions from the microbenchmarks

- Microbenchmarks can yield surprisingly deep insights
- Affinity matters!
  - Almost all performance properties depend on the position of
    - Data
    - Threads/processes
  - Consequences
    - Know where your threads are running
    - Know where your data is (see later for that)
- Bandwidth bottlenecks are ubiquitous
- Synchronization overhead may be an issue
  - ... and depends on the system topology!
  - Many-core poses new challenges in terms of synchronization





Friedrich-Alexander-Universität Erlangen-Nürnberg

## "Simple" performance modeling: The Roofline Model

#### Loop-based performance modeling: Execution vs. data transfer

R.W. Hockney and I.J. Curington:  $f_{1/2}$ : A parameter to characterize memory and communication bottlenecks. Parallel Computing 10, 277-286 (1989). DOI: 10.1016/0167-8191(89)90100-2

W. Schönauer: <u>Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers</u>. Self-edition (2000)

S. Williams: <u>Auto-tuning Performance on Multicore Computers</u>. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)

#### A simple performance model for loops

Simplistic view of the hardware:



#### A simple performance model for loops



# Naïve Roofline Model

How fast can tasks be processed at most? P [flop/s]

#### The bottleneck is either

- The execution of work:
- The data path:

 $P_{\text{peak}}$  [flop/s]  $I \cdot b_S$  [flop/byte x byte/s]

#### This is the "Naïve Roofline Model"

- High intensity: P limited by execution
- Low intensity: P limited by data transfer
- "Knee" at P<sub>peak</sub> = I · b<sub>S</sub>: Best use of resources
- Roofline is an "optimistic" model (think "light speed")

# Naïve Roofline Model

How fast can tasks be processed at most? P [flop/s]

#### The bottleneck is either

- The execution of work:
- The data path:

 $P_{\text{peak}}$  [flop/s]  $I \cdot b_S$  [flop/byte x byte/s]

 $P = \min(P_{\text{peak}}, I \cdot b_S)$ 

This is the "Naïve Roofline Model"

- High intensity: P limited by execution
- Low intensity: P limited by data transfer
- "Knee" at P<sub>peak</sub> = I · b<sub>S</sub>:
   Best use of resources
- Roofline is an "optimistic" model (think "light speed")

# Naïve Roofline Model

How fast can tasks be processed at most? P [flop/s]

The bottleneck is either

- The execution of work:
- The data path:

 $P_{\text{peak}}$  [flop/s]  $I \cdot b_S$  [flop/byte x byte/s]



#### Apply the naive Roofline model in practice

- Machine parameter #1:
- Machine parameter #2:
- Code characteristic:

Peak performance: $P_{peak} \left[ \frac{F}{s} \right]$ Memory bandwidth: $b_S \left[ \frac{B}{s} \right]$ 

Computational intensity:  $I = \frac{F}{B}$ 

**SC23** 

#### Apply the naive Roofline model in practice

- Machine parameter #1:
- Machine parameter #2:
- Code characteristic:

Peak performance: Memory bandwidth:

Computational intensity: I



Machine model

#### Apply the naive Roofline model in practice

- Machine parameter #1:
- Machine parameter #2:
- Code characteristic:

Peak performance: $P_{peak} \begin{bmatrix} F \\ s \end{bmatrix}$ MachMemory bandwidth: $b_S \begin{bmatrix} B \\ s \end{bmatrix}$ MachComputational intensity: $I \begin{bmatrix} F \\ B \end{bmatrix}$ Applie

Machine model

Application model

#### Apply the naive Roofline model in practice

- Machine parameter #1:
- Machine parameter #2:
- Code characteristic:



0,25

1/64

1/32

1/16

#### Apply the naive Roofline model in practice

- Machine parameter #1:
- Machine parameter #2:
- Code characteristic:



1/2

2

1/4

Computational intensity I [F/B]

1/8

#### Apply the naive Roofline model in practice

- Machine parameter #1:
- Machine parameter #2:
- Code characteristic:

Peak performance: $P_{peak} \begin{bmatrix} F\\ s \end{bmatrix}$ Machine modelMemory bandwidth: $b_S \begin{bmatrix} B\\ s \end{bmatrix}$ Application modelComputational intensity: $I \begin{bmatrix} F\\ B \end{bmatrix}$ Application modelties: $a_{4}$  $P_{peak}$ 

Machine properties:  $P_{peak} = 4 \frac{\text{GF}}{\text{S}}$ Performance P [GF/s] 1<sup>30 55</sup>  $\boldsymbol{b}_{\boldsymbol{S}} = 10 \frac{\text{GB}}{\text{GB}}$ 0,5 0,25 Application property: I 1/8 1/21/64 1/32 1/16 1/42 Computational intensity I [F/B]

#### Apply the naive Roofline model in practice

- Machine parameter #1:
- Machine parameter #2:
- Code characteristic:



#### Apply the naive Roofline model in practice

- Machine parameter #1:
- Machine parameter #2:
- Code characteristic:



# Prerequisites for the Roofline Model

- Data transfer and core execution overlap perfectly!
  - Either the limit is core execution or it is data transfer
- Slowest limiting factor "wins"; all others are assumed to have no impact
  - If two bottlenecks are "close," no interaction is assumed
- Data access latency is ignored, i.e. perfect streaming mode
  Achievable bandwidth is the limit
- Chip must be able to saturate the bandwidth bottleneck(s)
  Always model the full chip







- Compare capabilities of different machines
- Compare performance expectations for different loops



- Compare capabilities of different machines
- Compare performance expectations for different loops



- Compare capabilities of different machines
- Compare performance expectations for different loops



- Compare capabilities of different machines
- Compare performance expectations for different loops



- Compare capabilities of different machines
- Compare performance expectations for different loops



- Compare capabilities of different machines
- Compare performance expectations for different loops



- Compare capabilities of different machines
- Compare performance expectations for different loops

- Roofline always provides upper bound but is it realistic?
  - Simple case: Loop kernel has loop-carried dependecncies → cannot achieve peak
  - Other bandwidth bottlenecks may apply



P<sub>max</sub> = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily P<sub>peak</sub>)
 → e.g., P<sub>max</sub> = 176 GFlop/s

- P<sub>max</sub> = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily P<sub>peak</sub>)
   → e.g., P<sub>max</sub> = 176 GFlop/s
- 2. *I* = Computational intensity ("work" per byte transferred) over the slowest data path utilized (code balance  $B_C = I^{-1}$ ) → e.g., *I* = 0.167 Flop/Byte →  $B_C = 6$  Byte/Flop

- P<sub>max</sub> = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily P<sub>peak</sub>)
   → e.g., P<sub>max</sub> = 176 GFlop/s
- 2. *I* = Computational intensity ("work" per byte transferred) over the slowest data path utilized (code balance  $B_C = I^{-1}$ ) → e.g., *I* = 0.167 Flop/Byte →  $B_C = 6$  Byte/Flop

- 1.  $P_{\text{max}}$  = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily  $P_{\text{peak}}$ )  $\rightarrow$  e.g.,  $P_{\text{max}}$  = 176 GFlop/s
- 2. *I* = Computational intensity ("work" per byte transferred) over the slowest data path utilized (code balance  $B_C = I^{-1}$ ) → e.g., *I* = 0.167 Flop/Byte →  $B_C = 6$  Byte/Flop
- 3.  $b_{\rm S}$  = Applicable (saturated) peak bandwidth of the slowest data path utilized  $\rightarrow$  e.g.,  $b_{\rm S}$  = 56 GByte/s

- 1.  $P_{\text{max}}$  = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily  $P_{\text{peak}}$ )  $\rightarrow$  e.g.,  $P_{\text{max}}$  = 176 GFlop/s
- 2. *I* = Computational intensity ("work" per byte transferred) over the slowest data path utilized (code balance  $B_C = I^{-1}$ ) → e.g., *I* = 0.167 Flop/Byte →  $B_C = 6$  Byte/Flop
- 3.  $b_s$  = Applicable (saturated) peak bandwidth of the slowest data path utilized  $\rightarrow$  e.g.,  $b_s$  = 56 GByte/s

Performance limit:

$$P = \min(P_{\max}, I \cdot b_S) = \min\left(P_{\max}, \frac{b_S}{B_C}\right)$$
[Byte/Flop]

## A refined Roofline Model

- 1.  $P_{\text{max}}$  = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily  $P_{\text{peak}}$ )  $\rightarrow$  e.g.,  $P_{\text{max}}$  = 176 GFlop/s
- 2. *I* = Computational intensity ("work" per byte transferred) over the slowest data path utilized (code balance  $B_C = I^{-1}$ ) → e.g., *I* = 0.167 Flop/Byte →  $B_C = 6$  Byte/Flop
- 3.  $b_{\rm S}$  = Applicable (saturated) peak bandwidth of the slowest data path utilized  $\rightarrow$  e.g.,  $b_{\rm S}$  = 56 GByte/s

Performance limit:

$$P = \min(P_{\max}, I \cdot b_S) = \min\left(P_{\max}, \frac{b_S}{B_C}\right)$$
 [Byte/Flop]

R.W. Hockney and I.J. Curington:  $f_{1/2}$ : A parameter to characterize memory and communication bottlenecks.

Parallel Computing 10, 277-286 (1989). DOI: 10.1016/0167-8191(89)90100-2

W. Schönauer: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition (2000)

S. Williams: Auto-tuning Performance on Multicore Computers. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)

Node-Level Performance Engineering

Flop" is not the only

useful unit of work!

## Refined Roofline models: graphical representation

### Multiple ceilings may apply

- Different bandwidths / data paths
   → different inclined ceilings
- Different P<sub>max</sub>
   → different flat ceilings

In fact,  $P_{max}$  should always come from code analysis; generic ceilings are usually impossible to attain



## Hardware features of (some) Intel Xeon processors

| Microarchitecture       | Ivy Bridge EP             | Broadwell EP              | Cascade Lake SP                    | Ice Lake SP                      |
|-------------------------|---------------------------|---------------------------|------------------------------------|----------------------------------|
| Introduced              | 09/2013                   | 03/2016                   | 04/2019                            | 06/2021                          |
| Cores                   | ≤ 12                      | ≤ 22                      | ≤ 28                               | ≤ 40                             |
| LD/ST throughput per cy | /:                        |                           |                                    |                                  |
| AVX(2), AVX512          | 1 LD + ½ ST               | 2 LD + 1 ST               | 2 LD + 1 ST                        | 2 LD + 1 ST                      |
| SSE/scalar              | 2 LD    1 LD & 1 ST       | 2 LD + 1 31               | 2 LD + 1 31                        | 2 LD + 1 31                      |
| ADD throughput          | 1 / cy                    | 1 / cy                    | 2 / cy                             | 2 / cy                           |
| MUL throughput          | 1 / cy                    | 2 / cy                    | 2 / cy                             | 2 / cy                           |
| FMA throughput          | N/A                       | 2 / cy                    | 2 / cy                             | 2 / cy                           |
| L1-L2 data bus          | 32 B/cy                   | 64 B/cy                   | 64 B/cy                            | 64 B/cy                          |
| L2-L3 data bus          | 32 B/cy                   | 32 B/cy                   | 16+16 B/cy                         | 16+16 B/cy                       |
| L1/L2 per core          | 32 KiB / 256 KiB          | 32 KiB / 256 KiB          | 32 KiB / 1 MiB                     | 48 KiB / 1.25 MiB                |
| LLC                     | 2.5 MiB/core<br>inclusive | 2.5 MiB/core<br>inclusive | 1.375 MiB/core<br>exclusive/victim | 1.5 MiB/core<br>exclusive/victim |
| Memory                  | 4ch DDR3                  | 4ch DDR3                  | 6ch DDR4                           | 8ch DDR4                         |
| Memory BW (meas.)       | ~ 48 GB/s                 | ~ 62 GB/s                 | ~ 115 GB/s                         | ~ 160 GB/s                       |

<u>manual.html</u>

ntel-64-and-ia-32-architectures-optimization-reference-

#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo



#### Example: do i=1,N; s=s+a(i); enddo







 Hit the BW bottleneck by good serial code (e.g., plain Python → Fortran)



- Hit the BW bottleneck by good serial code (e.g., plain Python → Fortran)
- 2. Increase intensity to make better use of BW bottleneck (e.g., spatial loop blocking)



- Hit the BW bottleneck by good serial code (e.g., plain Python → Fortran)
- 2. Increase intensity to make better use of BW bottleneck (e.g., spatial loop blocking)
- 3. Increase intensity and go from memory bound to core bound (e.g., temporal blocking)



- Hit the BW bottleneck by good serial code (e.g., plain Python → Fortran)
- 2. Increase intensity to make better use of BW bottleneck (e.g., spatial loop blocking)
- 3. Increase intensity and go from memory bound to core bound (e.g., temporal blocking)
- 4. Hit the core bottleneck by good serial code (e.g., -fno-alias, SIMD intrinsics)







Friedrich-Alexander-Universität Erlangen-Nürnberg

# Diagnostic / phenomenological Roofline modeling



- What if we cannot predict the intensity/balance?
  - Code very complicated
  - Code not available
  - Parameters unknown
  - Doubts about correctness of analysis

- What if we cannot predict the intensity/balance?
  - Code very complicated
  - Code not available
  - Parameters unknown
  - Doubts about correctness of analysis



- What if we cannot predict the intensity/balance?
  - Code very complicated
  - Code not available
  - Parameters unknown
  - Doubts about correctness of analysis
- Measure data volume V<sub>meas</sub> (and work N<sub>meas</sub>)
  - Hardware performance counters
  - Tools: likwid-perfctr, PAPI, Intel Vtune,...



- What if we cannot predict the intensity/balance?
  - Code very complicated
  - Code not available
  - Parameters unknown
  - Doubts about correctness of analysis
- Measure data volume V<sub>meas</sub> (and work N<sub>meas</sub>)
  - Hardware performance counters
  - Tools: likwid-perfctr, PAPI, Intel Vtune,...
- Insights + benefits
  - Compare analytic model and measurement  $\rightarrow$  validate model
  - Can be applied (semi-)automatically
  - Useful in performance monitoring of user jobs on clusters





Kernel 1

Multiple bandwidth bottlenecks  $\rightarrow$  need *I* for each one ( $I_{mem}$ ,  $I_{L3}$ ,  $I_{L2}$ , ...)



Kernel 1

Multiple bandwidth bottlenecks  $\rightarrow$  need *I* for each one ( $I_{mem}$ ,  $I_{L3}$ ,  $I_{L2}$ , ...)



Multiple bandwidth bottlenecks  $\rightarrow$  need *I* for each one ( $I_{mem}$ ,  $I_{L3}$ ,  $I_{L2}$ , ...)





Performance close to memory BW ceiling but far away from others

→ indicates **memory bound** 

Multiple bandwidth bottlenecks  $\rightarrow$  need *I* for each one ( $I_{mem}$ ,  $I_{L3}$ ,  $I_{L2}$ , ...)



#### Kernel 1

- Performance close to memory BW ceiling but far away from others
  - $\rightarrow$  indicates **memory bound**

### Kernel 2

Multiple bandwidth bottlenecks  $\rightarrow$  need *I* for each one ( $I_{mem}$ ,  $I_{L3}$ ,  $I_{L2}$ , ...)



#### Kernel 1

 Performance close to memory BW ceiling but far away from others
 → indicates memory bound

### Kernel 2

- Performance not near any BW ceiling
- Performance close to scalar peak ceiling
   indicates scalar core-bound peak code

Multiple bandwidth bottlenecks  $\rightarrow$  need *I* for each one ( $I_{mem}$ ,  $I_{L3}$ ,  $I_{L2}$ , ...)



#### Kernel 1

 Performance close to memory BW ceiling but far away from others
 → indicates memory bound

### Kernel 2

- Performance not near any BW ceiling
- Performance close to scalar peak ceiling
   indicates scalar core-bound peak code

Multiple bandwidth bottlenecks  $\rightarrow$  need *I* for each one ( $I_{mem}$ ,  $I_{L3}$ ,  $I_{L2}$ , ...)



#### Kernel 1

 Performance close to memory BW ceiling but far away from others
 → indicates memory bound

### Kernel 2

- Performance not near any BW ceiling
- Performance close to scalar peak ceiling
   indicates scalar core-bound peak code

### Kernel 3

Performance not anywhere near any ceiling
 → There must be an (as yet) unknown in-core performance limit P<sub>max</sub>

# Roofline and performance monitoring of clusters

#### Two cluster jobs...



Cluster monitoring framework: ClusterCockpit <u>https://clustercockpit.org</u> Come visit Booth #1311 (LRZ)

# Roofline and performance monitoring of clusters

#### Two cluster jobs...



Cluster monitoring framework: ClusterCockpit <u>https://clustercockpit.org</u> Come visit Booth #1311 (LRZ)

## **Roofline conclusion**

- Roofline = simple first-principle model for upper performance limit of datastreaming loops
  - Machine model  $(P_{max}, b_S)$  + application model (I)
  - Conditions apply, extensions exist
- Two modes of operation
  - Predictive: Calculate I, calculate upper limit, validate model, optimize, iterate
  - Diagnostic: Measure I and P, compare with roof
- Challenge of predictive modeling: Getting  $P_{max}$  and I right





Friedrich-Alexander-Universität Erlangen-Nürnberg

# Performance analysis with hardware metrics

likwid-perfctr



## Probing performance behavior

- How do we find out about the performance properties and requirements of a parallel code?
   Profiling via advanced tools is often overkill
- A coarse overview is often sufficient: likwid-perfctr
- Simple measurement of hardware performance metrics
- Preconfigured and extensible metric groups, list with likwid-perfctr -a: BRANCH: Branch pred
- Operating modes:
  - Wrapper
  - Stethoscope
  - Timeline
  - Marker API

BRANCH: Branch prediction miss rate/ratio CLOCK: Clock frequency of cores DATA: Load to store ratio FLOPS\_DP: Double Precision MFlops/s FLOPS\_SP: Single Precision MFlops/s L2: L2 cache bandwidth in MBytes/s L2CACHE: L2 cache miss rate/ratio L3: L3 cache bandwidth in MBytes/s L3CACHE: L3 cache miss rate/ratio MEM: Main memory bandwidth in MBytes/s ENERGY: Power and energy consumption

# Best practices for Performance profiling

Focus on resource utilization and instruction mix! Metrics to measure:

- Operation throughput (Flops/s)
- Overall instruction throughput (CPI or IPC)
- Instruction breakdown:
  - FP instructions
  - loads and stores
  - branch instructions
  - other instructions
- Instruction breakdown to SIMD width (scalar, SSE, AVX, AVX512 for X86). (only arithmetic instructions on most architectures)

All above metrics can be acquired using performance groups: MEM DP, MEM SP, BRANCH, DATA, L2, L3

- Data volumes and bandwidths to
  - main memory (GB and GB/s)
  - cache levels (GB and GB/s)

Useful diagnostic metrics are:

- Clock frequency (GHz)
- Power (W)

| PU name:                                 | Intel(R)     | Xeon(R) C                      | PU E5-2695 v3 | @ 2.30GHz [ | ]                      |            |
|------------------------------------------|--------------|--------------------------------|---------------|-------------|------------------------|------------|
| <<< PROGRAM                              | OUTPUT >>>>  |                                |               |             |                        |            |
| roup 1: L2                               |              |                                |               |             |                        |            |
| Eve                                      | nt           | Counter                        | Core 14       | Core 15     | Core 16                | Core 17    |
| INSTR RET                                | IRED ANY     | <br>  FIXC0                    | 1298031144    | 1965945005  | 1854182290             | 1862521357 |
| CPU CLK UNH                              | ALTED CORE   | FIXC1                          | 2353698512    | 2894134935  | 2894645261             | 2895023739 |
| CPU CLK UN                               | HALTED REF   | FIXC2                          | 2057044629    | 2534405765  | 2535218217             | 2535560434 |
| LID_REPL                                 | ACEMENT      | PMC0                           | 212900444     | 200544877   | 200389272              | 200387671  |
| L2_TRANS                                 | L1D_WB       | PMC1                           | 112464863     | 99931184    | 99982371               | 99976697   |
| ICACHE_                                  | MISSES       | PMC2                           | 21265         | 26233       | 12646                  | 12363      |
| . statistics                             | output omi   | tted]                          | ++            | ++          | ++                     |            |
|                                          | Metric       | 1                              | Core 14       | Core 15     | Core 16                | Core 17    |
| Runti                                    | me (RDTSC)   | +<br>[s]                       | 1.1314        | 1.1314      | 1.1314                 | 1.1314     |
| Runtime unhalted [s]                     |              | 1.0234                         | 1.2583        | 1.2586      | 1.2587                 |            |
|                                          |              | 2631.6699                      | 2626.4367     | 2626.0579   | 2626.0468              |            |
| CPI                                      |              | 1                              | 1.8133        | 1.4721      | 1.5611                 | 1.5544     |
| L2D load bandwidth [MBytes/s]            |              | Bytes/s]                       | 12042.7388    | 11343.8446  | 11335.0428             | 11334.9523 |
| L2D load data volume [GBytes]            |              |                                | 13.6256       | 12.8349     | 12.8249                | 12.8248    |
|                                          |              | L2D evict bandwidth [MBytes/s] |               |             | 5655.5146              | 5655.1937  |
| L2D load d                               |              | Bytes/s]                       |               |             |                        |            |
| L2D load d<br>L2D evict b                | andwidth [M] |                                | -             | 6.3956      | 6.3989                 | 6.3985     |
| L2D load d<br>L2D evict b<br>L2D evict d | andwidth [M] | [GBytes]                       | -             |             | 6.3989  <br>16991.2728 |            |

| PU name: Intel(R)                                                                                                                                      | Xeon(R) C                                                                  | PU E5-2695 v3                                                                 | 3 @ 2.30GHz [                                                                             | ]                                                                                         |                                                                               |
|--------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|
| <<< PROGRAM OUTPUT >>>>                                                                                                                                |                                                                            | — Alw                                                                         |                                                                                           |                                                                                           |                                                                               |
| roup 1: L2                                                                                                                                             |                                                                            | measu                                                                         |                                                                                           | +                                                                                         | +                                                                             |
| Event                                                                                                                                                  | Counter                                                                    | Core 14                                                                       | Core 15                                                                                   | Core 16                                                                                   | Core 17                                                                       |
| INSTR RETIRED ANY                                                                                                                                      | FIXC0                                                                      | 1298031144                                                                    | 1965945005                                                                                | 1854182290                                                                                | 1862521357                                                                    |
| CPU_CLK_UNHALTED_CORE                                                                                                                                  | FIXC1                                                                      | 2353698512                                                                    | 2894134935                                                                                | 2894645261                                                                                | 2895023739                                                                    |
| CPU_CLK_UNHALTED_REF                                                                                                                                   | FIXC2                                                                      | 2057044629                                                                    | 2534405765                                                                                | 2535218217                                                                                | 2535560434                                                                    |
| LID_REPLACEMENT                                                                                                                                        | PMC0                                                                       | 212900444                                                                     | 200544877                                                                                 | 200389272                                                                                 | 200387671                                                                     |
| L2_TRANS_L1D_WB                                                                                                                                        | PMC1                                                                       | 112464863                                                                     | 99931184                                                                                  | 99982371                                                                                  | 99976697                                                                      |
| ICACHE_MISSES                                                                                                                                          | PMC2                                                                       | 21265                                                                         | 26233                                                                                     | 12646                                                                                     | 12363                                                                         |
|                                                                                                                                                        |                                                                            |                                                                               |                                                                                           |                                                                                           |                                                                               |
| statistics output omit<br><br>Metric                                                                                                                   | ted]<br>+<br>                                                              | Core 14                                                                       | +<br>Core 15                                                                              | +<br>Core 16                                                                              | Core 17                                                                       |
| <br>Metric                                                                                                                                             | ++<br> <br>+                                                               |                                                                               |                                                                                           | +                                                                                         |                                                                               |
| Metric<br>Runtime (RDTSC)                                                                                                                              | +<br> <br>+<br>[s]                                                         | 1.1314                                                                        | 1.1314                                                                                    | 1.1314                                                                                    | 1.1314                                                                        |
| Metric<br>Runtime (RDTSC)  <br>Runtime unhalted                                                                                                        | +<br> <br>+<br>[s]                                                         | 1.1314<br>1.0234                                                              | 1.1314  <br>1.2583                                                                        | 1.1314  <br>1.2586                                                                        | 1.1314<br>1.2587                                                              |
| Metric<br>Runtime (RDTSC)  <br>Runtime unhalted  <br>Clock [MHz]                                                                                       | +<br> <br>+<br>[s]                                                         | 1.1314<br>1.0234<br>2631.6699                                                 | 1.1314  <br>1.2583  <br>2626.4367                                                         | 1.1314  <br>1.2586  <br>2626.0579                                                         | 1.1314<br>1.2587<br>2626.0468                                                 |
| Metric<br>Runtime (RDTSC)  <br>Runtime unhalted  <br>Clock [MHz]<br>CPI                                                                                | <br> +<br>[s]  <br>[s]  <br>                                               | 1.1314<br>1.0234<br>2631.6699<br>1.8133                                       | 1.1314  <br>1.2583  <br>2626.4367  <br>1.4721                                             | 1.1314  <br>1.2586  <br>2626.0579  <br>1.5611                                             | 1.1314<br>1.2587<br>2626.0468<br>1.5544                                       |
| Metric<br>Runtime (RDTSC)  <br>Runtime unhalted  <br>Clock [MHz]<br>CPI<br>L2D load bandwidth [ME                                                      | [s]  <br>[s]  <br>[s]  <br>[s]  <br> <br>Bytes/s]                          | 1.1314<br>1.0234<br>2631.6699<br>1.8133<br>12042.7388                         | 1.1314  <br>1.2583  <br>2626.4367  <br>1.4721  <br>11343.8446                             | 1.1314  <br>1.2586  <br>2626.0579  <br>1.5611  <br>11335.0428                             | 1.1314<br>1.2587<br>2626.0468<br>1.5544<br>11334.9523                         |
| Metric<br>Runtime (RDTSC)  <br>Runtime unhalted  <br>Clock [MHz]<br>CPI                                                                                | [s]  <br>[s]  <br>[s]  <br>Bytes/s]  <br>[GBytes]                          | 1.1314<br>1.0234<br>2631.6699<br>1.8133<br>12042.7388<br>13.6256              | 1.1314  <br>1.2583  <br>2626.4367  <br>1.4721  <br>11343.8446  <br>12.8349                | 1.1314  <br>1.2586  <br>2626.0579  <br>1.5611  <br>11335.0428  <br>12.8249                | 1.1314<br>1.2587<br>2626.0468<br>1.5544<br>11334.9523<br>12.8248              |
| Metric<br>Runtime (RDTSC)  <br>Runtime unhalted  <br>Clock [MHz]<br>CPI<br>L2D load bandwidth [ME<br>L2D load data volume                              | [s]  <br>[s]  <br>[s]  <br>Bytes/s]  <br>[GBytes]  <br>Bytes/s]            | 1.1314<br>1.0234<br>2631.6699<br>1.8133<br>12042.7388<br>13.6256<br>6361.5883 | 1.1314  <br>1.2583  <br>2626.4367  <br>1.4721  <br>11343.8446  <br>12.8349                | 1.1314  <br>1.2586  <br>2626.0579  <br>1.5611  <br>11335.0428  <br>12.8249                | 1.1314<br>1.2587<br>2626.0468<br>1.5544<br>11334.9523<br>12.8248              |
| Metric<br>Runtime (RDTSC)  <br>Runtime unhalted  <br>Clock [MHz]<br>CPI<br>L2D load bandwidth [ME<br>L2D load data volume  <br>L2D evict bandwidth [ME | [s]  <br>[s]  <br>[s]  <br>[s]  <br>[GBytes/s]  <br>[GBytes]  <br>[GBytes] | 1.1314<br>1.0234<br>2631.6699<br>1.8133<br>12042.7388<br>13.6256<br>6361.5883 | 1.1314  <br>1.2583  <br>2626.4367  <br>1.4721  <br>11343.8446  <br>12.8349  <br>5652.6192 | 1.1314  <br>1.2586  <br>2626.0579  <br>1.5611  <br>11335.0428  <br>12.8249  <br>5655.5146 | 1.1314<br>1.2587<br>2626.0468<br>1.5544<br>11334.9523<br>12.8248<br>5655.1937 |

| < PROGRAM OUTPUT >>>><br>up 1: L2                                                                                                                                                            | ·/                                                                                           | — Alwa<br>measu<br>Intel C                                                                             | red for                                                                                                |                                                                                                        | red metrics<br>group)                                                                    |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| Event   C                                                                                                                                                                                    | Counter                                                                                      | Core 14                                                                                                | Core 15                                                                                                | Core 16                                                                                                | Core 17                                                                                  |
| INSTR RETIRED ANY                                                                                                                                                                            | FIXC0                                                                                        | 1298031144                                                                                             | 1965945005                                                                                             | 1854182290                                                                                             | 1862521357                                                                               |
| PU_CLK_UNHALTED_CORE                                                                                                                                                                         | FIXCL                                                                                        | 2353698512                                                                                             | 2894134935                                                                                             | 2894645261                                                                                             | 2895023739                                                                               |
| CPU CLE UNHALTED REF                                                                                                                                                                         | FIXC2                                                                                        | 2057044629                                                                                             | 2534405765                                                                                             | 2535218217                                                                                             | 2535560434                                                                               |
| LID_REPLACEMENT                                                                                                                                                                              | PMC0                                                                                         | 212900444                                                                                              | 200544877                                                                                              | 200389272                                                                                              | 200387671                                                                                |
| L2_TRANS_L1D_WB                                                                                                                                                                              | PMC1                                                                                         | 112464863                                                                                              | 99931184                                                                                               | 99982371                                                                                               | 99976697                                                                                 |
| ICACHE MISSES                                                                                                                                                                                | PMC2                                                                                         | 21265                                                                                                  | 1 26233                                                                                                | 12646                                                                                                  | 12363                                                                                    |
| +                                                                                                                                                                                            |                                                                                              | +                                                                                                      | +                                                                                                      | +                                                                                                      | +                                                                                        |
| statistics output omitte<br>Metric                                                                                                                                                           |                                                                                              | Core 14                                                                                                | Core 15                                                                                                | ++<br>Core 16                                                                                          | Core 17                                                                                  |
| statistics output omitte                                                                                                                                                                     | ed]<br> <br>                                                                                 | ++                                                                                                     | ·++                                                                                                    | ++                                                                                                     | +                                                                                        |
| statistics output omitte<br>Metric                                                                                                                                                           | ed]<br> <br> <br> <br>                                                                       | ++<br>Core 14                                                                                          | Core 15                                                                                                | ++<br>Core 16                                                                                          | +<br>Core 17                                                                             |
| statistics output omitte<br>Metric<br>Runtime (RDTSC) [s]                                                                                                                                    | ed]<br> <br> <br> <br>                                                                       | Core 14  <br>1.1314                                                                                    | Core 15  <br>1.1314  <br>1.2583                                                                        | Core 16  <br>1.1314                                                                                    | Core 17                                                                                  |
| statistics output omitte<br>Metric<br>Runtime (RDTSC) [s]<br>Runtime unhalted [s]                                                                                                            | ed]<br> <br> <br> <br>                                                                       | Core 14  <br>1.1314  <br>1.0234                                                                        | Core 15  <br>1.1314  <br>1.2583                                                                        | Core 16  <br>1.1314  <br>1.2586                                                                        | Core 17<br>1.1314<br>1.2587                                                              |
| Statistics output omitte<br>Metric<br>Runtime (RDTSC) [s]<br>Runtime unhalted [s]<br>Clock [MHz]                                                                                             | ed]<br> <br> <br> <br> <br> <br> <br>                                                        | Core 14  <br>1.1314  <br>1.0234  <br>2631.6699                                                         | Core 15  <br>1.1314  <br>1.2583  <br>2626.4367  <br>1.4721                                             | Core 16  <br>1.1314  <br>1.2586  <br>2626.0579                                                         | Core 17<br>1.1314<br>1.2587<br>2626.0468                                                 |
| statistics output omitte<br>Metric<br>Runtime (RDTSC) [s]<br>Runtime unhalted [s]<br>Clock [MHz]<br>CPI                                                                                      | ed]<br> <br>                                    | Core 14  <br>1.1314  <br>1.0234  <br>2631.6699  <br>1.8133                                             | Core 15  <br>1.1314  <br>1.2583  <br>2626.4367  <br>1.4721                                             | Core 16  <br>1.1314  <br>1.2586  <br>2626.0579  <br>1.5611                                             | Core 17<br>1.1314<br>1.2587<br>2626.0468<br>1.5544                                       |
| statistics output omitte<br>Metric<br>Runtime (RDTSC) [s]<br>Runtime unhalted [s]<br>Clock [MHz]<br>CPI<br>L2D load bandwidth [MByt                                                          | ed]<br> <br> | Core 14  <br>1.1314  <br>1.0234  <br>2631.6699  <br>1.8133  <br>12042.7388  <br>13.6256                | Core 15  <br>1.1314  <br>1.2583  <br>2626.4367  <br>1.4721  <br>11343.8446  <br>12.8349                | Core 16  <br>1.1314  <br>1.2586  <br>2626.0579  <br>1.5611  <br>11335.0428  <br>12.8249                | Core 17<br>1.1314<br>1.2587<br>2626.0468<br>1.5544<br>11334.9523<br>12.8248              |
| statistics output omitte<br>Metric<br>Runtime (RDTSC) [s]<br>Runtime unhalted [s]<br>Clock [MHz]<br>CPI<br>L2D load bandwidth [MByt<br>L2D load data volume [GB                              | ed]<br> <br> | Core 14  <br>1.1314  <br>1.0234  <br>2631.6699  <br>1.8133  <br>12042.7388  <br>13.6256  <br>6361.5883 | Core 15  <br>1.1314  <br>1.2583  <br>2626.4367  <br>1.4721  <br>11343.8446  <br>12.8349                | Core 16  <br>1.1314  <br>1.2586  <br>2626.0579  <br>1.5611  <br>11335.0428  <br>12.8249                | Core 17<br>1.1314<br>1.2587<br>2626.0468<br>1.5544<br>11334.9523<br>12.8248              |
| statistics output omitte<br>Metric<br>Runtime (RDTSC) [s]<br>Runtime unhalted [s]<br>Clock [MHz]<br>CPI<br>L2D load bandwidth [MByt<br>L2D load data volume [GB<br>L2D evict bandwidth [MByt | ed]<br> <br> | Core 14  <br>1.1314  <br>1.0234  <br>2631.6699  <br>1.8133  <br>12042.7388  <br>13.6256  <br>6361.5883 | Core 15  <br>1.1314  <br>1.2583  <br>2626.4367  <br>1.4721  <br>11343.8446  <br>12.8349  <br>5652.6192 | Core 16  <br>1.1314  <br>1.2586  <br>2626.0579  <br>1.5611  <br>11335.0428  <br>12.8249  <br>5655.5146 | Core 17<br>1.1314<br>1.2587<br>2626.0468<br>1.5544<br>11334.9523<br>12.8248<br>5655.1937 |



### likwid-perfctr stethoscope mode

likwid-perfctr counts events on hardware threads it has no notion of what kind of code is running (if any)

This allows you to "listen" to what is currently happening, without any overhead:

- \$ likwid-perfctr -c N:0-11 -g FLOPS\_DP -S 10s
- It can be used as cluster/server monitoring tool
- A frequent use is to measure a certain part of a long running parallel application from outside

### likwid-perfctr stethoscope example

Using Roofline for monitoring "live" jobs on a cluster Based on measured BW and Flop/s data via likwid-perfctr



Cluster monitoring framework: ClusterCockpit <u>https://clustercockpit.org</u> Come visit Booth #1311 (LRZ)

### likwid-perfctr with MarkerAPI

- The MarkerAPI can restrict measurements to code regions
- The API only reads counters.
   The configuration of the counters is still done by likwid-perfctr
- Multiple named regions allowed, accumulation over multiple calls
- Inclusive and overlapping regions allowed
- Caveat: Marker API can cause overhead; do not call too frequently!
- Fortran API exists

```
#include <likwid-marker.h>
LIKWID_MARKER_INIT; // must be called from serial region
....
LIKWID_MARKER_START("Compute");
....
LIKWID_MARKER_STOP("Compute");
....
LIKWID_MARKER_START("Postprocess");
....
LIKWID_MARKER_STOP("Postprocess");
....
LIKWID_MARKER_CLOSE; // must be called from serial region
```

#### likwid-perfctr with MarkerAPI: source code transformations



### Compiling, linking, and running with marker API

#### Compile:

cc -I /path/to/likwid.h -DLIKWID\_PERFMON -c program.c

#### Link:

```
cc -L /path/to/liblikwid program.o -o program -llikwid
```

#### Run:

```
likwid-perfctr -C <CPULIST> -g <GROUP> -m ./program
```

#### MPI/hybrid:

likwid-mpirun -np 4 -pin <PINEXPR> -g <GROUP> -m ./program

#### $\rightarrow$ One separate block of output for every marked region

### Compiling, linking, and running with marker API



 $\rightarrow$  One separate block of output for every marked region

### Compiling, linking, and running with marker API



 $\rightarrow$  One separate block of output for every marked region

# Summary of hardware performance monitoring

- Useful only if you know what you are looking for
  - PM bears potential of acquiring massive amounts of data for nothing!
- Resource-based metrics are most useful
  - Cache lines transferred, work executed, loads/stores, cycles
  - Instructions, CPI, cache misses may be misleading
- Caveat: Processor work != user work
  - Waiting time in libraries (OpenMP, MPI) may cause lots of instructions
  - $\rightarrow$  distorted application characteristic
- Another very useful application of PM: validating performance models!
  - Roofline is data centric  $\rightarrow$  measure data volume through memory hierarchy





Friedrich-Alexander-Universität Erlangen-Nürnberg

### Case study:

### Tall & Skinny Matrix-Transpose Times Tall & Skinny Matrix (TSMTTSM) Multiplication



- Block of vectors → Tall & Skinny Matrix (e.g.  $10^7 \times 10^1$  dense matrix)
- Row-major storage format (see SpMVM)
- Block vector subspace orthogonalization procedure requires, e.g., computation of scalar product between vectors of two blocks

■ → TSMTTSM Mutliplication



#### Assume: $\alpha = 1$ ; $\beta = 0$

General rule for dense matrix-matrix multiply: Use vendor-optimized GEMM, (e.g., Intel MKL<sup>1</sup>):

$$C_{mn} = \sum_{k=1}^{N} A_{mk} B_{kn}$$
,  $m = 1...M, n = 1...N$ 

Matrix sizes: Square (SQ): M=N=K=15,000 Tall&Skinny (TS): M=N=16 ; K=10,000,000

General rule for dense matrix-matrix multiply: Use vendor-optimized GEMM, (e.g., Intel MKL<sup>1</sup>):  $\kappa$ 

$$C_{mn} = \sum_{k=1}^{N} A_{mk} B_{kn}$$
,  $m = 1...M, n = 1...N$ 

| System                | P <sub>peak</sub> [GF/s] | b <sub>S</sub> [GB/s] | Size | Perf.     | Efficiency |
|-----------------------|--------------------------|-----------------------|------|-----------|------------|
| Intel Xeon E5 2660 v2 | 176 GF/s                 | 52 GB/s               | SQ   | 160 GF/s  | 91%        |
| 10c@2.2 GHz           | 170 GF/S                 | 52 GD/S               | TS   | 16.6 GF/s | 6%         |
| Intel Xeon E5 2697 v3 |                          | 65 GB/s               | SQ   | 550 GF/s  | 95%        |
| 14c@2.6GHz            | 582 GF/s                 | 00 GB/S               | TS   | 22.8 GF/s | 4%         |

Matrix sizes: Square (SQ): M=N=K=15,000 Tall&Skinny (TS): M=N=16 ; K=10,000,000

General rule for dense matrix-matrix multiply: Use vendor-optimized GEMM, (e.g., Intel MKL<sup>1</sup>):  $\kappa$ 



General rule for dense matrix-matrix multiply: Use vendor-optimized GEMM, (e.g., Intel MKL<sup>1</sup>):  $\kappa$ 



### **TSMTTSM Roofline model**

Computational intensity #flops

#bytes (slowest data path)



### **TSMTTSM Roofline model**

Computational intensity  $I = \frac{\# flops}{\# bytes (slowest data path)}$ 



Optimistic model (minimum data transfer) assuming  $M = N \ll K$  and double precision:

$$I_d \approx \frac{2KMN}{8(KM+KN)}\frac{F}{B} = \frac{M}{8}\frac{F}{B}$$

complex double:

$$I_z \approx \frac{8KMN}{16(KM+KN)} \frac{F}{B} = \frac{M}{4} \frac{F}{B}$$

### **TSMTTSM Roofline performance prediction**

Now choose 
$$M = N = 16 \Rightarrow I_d \approx \frac{16}{8} \frac{F}{B}$$
 and  $I_z \approx \frac{16}{4} \frac{F}{B}$ , i.e.  $B_d \approx 0.5 \frac{B}{F}$ ,  $B_z \approx 0.25 \frac{B}{F}$   
Intel Xeon E5 2660 v2 ( $b_S = 52 \frac{GB}{s}$ )  $\Rightarrow P = 104 \frac{GF}{s}$  (double)  
Measured (MKL): 16.6  $\frac{GF}{s}$ 

Intel Xeon E5 2697 v3 ( $b_S = 65 \frac{GB}{s}$ )  $\rightarrow P = 240 \frac{GF}{s}$  (double complex) Measured (MKL): 22.8  $\frac{GF}{s}$ 

### **TSMTTSM Roofline performance prediction**

Now choose 
$$M = N = 16 \Rightarrow I_d \approx \frac{16}{8} \frac{F}{B}$$
 and  $I_z \approx \frac{16}{4} \frac{F}{B}$ , i.e.  $B_d \approx 0.5 \frac{B}{F}$ ,  $B_z \approx 0.25 \frac{B}{F}$   
Intel Xeon E5 2660 v2 ( $b_S = 52 \frac{GB}{s}$ )  $\Rightarrow P = 104 \frac{GF}{s}$  (double)  
Measured (MKL): 16.6  $\frac{GF}{s}$ 

Intel Xeon E5 2697 v3 ( $b_S = 65 \frac{GB}{s}$ )  $\rightarrow P = 240 \frac{GF}{s}$  (double complex) Measured (MKL): 22.8  $\frac{GF}{s}$ 

→ Potential speedup: 6–10x vs. MKL

```
i #pragma omp parallel
2 {
   double c_tmp[n*m] = \{0.\};
3
4
5 #pragma omp for
   for (int row = 0; row < k-1; row+=2) {
6
      for (int bcol = 0; bcol < n; bcol++) {
8 #pragma simd
        for (int acol = 0; acol < m; acol++) {</pre>
9
          c_tmp[bcol*m+acol] +=
10
            a[(row+0)*m + acol] * b[(row+0)*n + bcol] +
11
            a[(row+1)*m + acol] * b[(row+1)*n + bcol];
12
        }
13
      }
14
    }
15
16
17 #pragma omp critical
    for (int bcol = 0; bcol < n; bcol++) {
18
19 #pragma simd
      for (int acol = 0; acol < m; acol++) {</pre>
20
        c[bcol*m+acol] += c_tmp[bcol*m+acol];
21
      }
22
23
   }
24 }
```

```
i #pragma omp parallel
2 {
    double c_tmp[n*m] = \{0.\};
                                               Long Loop (k): Parallel
3
4
5 #pragma omp for
   for (int row = 0; row < k-1; row+=2) {
6
      for (int bcol = 0; bcol < n; bcol++) {
8 #pragma simd
        for (int acol = 0; acol < m; acol++) {</pre>
9
          c_tmp[bcol*m+acol] +=
10
            a[(row+0)*m + acol] * b[(row+0)*n + bcol] +
11
            a[(row+1)*m + acol] * b[(row+1)*n + bcol];
12
        }
13
14
      }
    }
15
16
17 #pragma omp critical
    for (int bcol = 0; bcol < n; bcol++) {
18
19 #pragma simd
      for (int acol = 0; acol < m; acol++) {</pre>
20
        c[bcol*m+acol] += c_tmp[bcol*m+acol];
21
      }
22
23
   }
24 }
```

```
Thread-local copy of small (results) matrix
i#pragma omp parallel
2 {
    double c_tmp[n*m] = \{0,\};
                                               Long Loop (k): Parallel
3
4
5 #pragma omp for
    for (int row = 0; row < k-1; row+=2) {
      for (int bcol = 0; bcol < n; bcol++) {
8 #pragma simd
        for (int acol = 0; acol < m; acol++) {</pre>
9
          c_tmp[bcol*m+acol] +=
10
            a[(row+0)*m + acol] * b[(row+0)*n + bcol] +
11
            a[(row+1)*m + acol] * b[(row+1)*n + bcol];
12
        }
13
14
      }
    3
15
16
17 #pragma omp critical
    for (int bcol = 0; bcol < n; bcol++) {
18
19 #pragma simd
      for (int acol = 0; acol < m; acol++) {</pre>
20
        c[bcol*m+acol] += c_tmp[bcol*m+acol];
21
      }
22
   }
23
24 }
```











Not shown: Inner Loop boundaries (n,m) known at compile time (kernel generation), k assumed to be even

### TSMTTSM MKL vs. "hand crafted" (OPT)

TS: M=N=16 ; K=10,000,000

| System                | P <sub>peak</sub> / b <sub>S</sub> | Version | Performance | <b>RLM Efficiency</b> |
|-----------------------|------------------------------------|---------|-------------|-----------------------|
| Intel Xeon E5 2660 v2 | 176 GF/s                           | TS OPT  | 98 GF/s     | 94 %                  |
| 10c@2.2 GHz           | 52 GB/s                            | TS MKL  | 16.6 GF/s   | 16 %                  |
| Intel Xeon E5 2697 v3 | 582 GF/s                           | TS OPT  | 159 GF/s    | 66 %                  |
| 14c@2.6GHz            | 65 GB/s                            | TS MKL  | 22.8 GF/s   | 9.5 %                 |

### TSMTTSM MKL vs. "hand crafted" (OPT)

TS: M=N=16 ; K=10,000,000

| System                | P <sub>peak</sub> / b <sub>S</sub> | Version | Performance | <b>RLM Efficiency</b> |
|-----------------------|------------------------------------|---------|-------------|-----------------------|
| Intel Xeon E5 2660 v2 | 176 GF/s                           | TS OPT  | 98 GF/s     | 94 %                  |
| 10c@2.2 GHz           | 52 GB/s                            | TS MKL  | 16.6 GF/s   | 16 %                  |
| Intel Xeon E5 2697 v3 | 582 GF/s                           | TS OPT  | 159 GF/s    | 66 %                  |
| 14c@2.6GHz            | 65 GB/s                            | TS MKL  | 22.8 GF/s   | 9.5 %                 |



# **TSMTTSM** conclusion

- Typical example of model-guided optimization
- "Invisible" P<sub>max</sub> ceiling with Intel MKL
- Hand-coded implementation ran much closer to limit
- Caveat: this is to exemplify the method; current MKL versions might have improved!





Friedrich-Alexander-Universität Erlangen-Nürnberg

# Case study: A Jacobi smoother

#### The basics in two dimensions



# **Stencil schemes**

- Stencil schemes frequently occur in PDE solvers on regular lattice structures
- Basically it is a sparse matrix vector multiply (spMVM) embedded in an iterative scheme (outer loop)
- ... but the regular access structure allows for matrix-free coding

```
do iter = 1, max_iterations
  Perform sweep over regular grid: y(:) ← x(:)
  Swap y ←→ x
enddo
```

- Complexity of implementation and performance depends on
  - stencil operator, e.g. Jacobi-type, Gauss-Seidel-type, ...
  - discretization, e.g. 7-pt or 27-pt in 3D,...











Appropriate performance metric: "Lattice site updates per second" [LUP/s] (here: Multiply by 4 FLOP/LUP to get FLOP/s rate)













**y(:**, :) : 1 WR+ 1 RD

 $\rightarrow$  B<sub>c</sub> = 5 Words / LUP = 40 B / LUP (assuming double precision)









Questions:









Friedrich-Alexander-Universität Erlangen-Nürnberg

# Case study: A Jacobi smoother

#### Layer conditions



Worst case: Cache not large enough to hold 3 layers (rows) of grid (assume "Least Recently Used" replacement strategy)



Worst case: Cache not large enough to hold 3 layers (rows) of grid (assume "Least Recently Used" replacement strategy)

|  |     |      |      |  |  |  | S    |
|--|-----|------|------|--|--|--|------|
|  |     | miss |      |  |  |  | cell |
|  | hit |      | miss |  |  |  | alo  |
|  |     | miss |      |  |  |  | Ĩ    |
|  |     |      |      |  |  |  |      |

k

Worst case: Cache not large enough to hold 3 layers (rows) of grid (assume "Least Recently Used" replacement strategy)



k

Worst case: Cache not large enough to hold 3 layers (rows) of grid (assume "Least Recently Used" replacement strategy)



Worst case: Cache not large enough to hold 3 layers (rows) of grid (assume "Least Recently Used" replacement strategy)

|  |  |     | miss |      |  |  |  |
|--|--|-----|------|------|--|--|--|
|  |  | hit |      | miss |  |  |  |
|  |  |     | miss |      |  |  |  |
|  |  |     |      |      |  |  |  |



Worst case: Cache not large enough to hold 3 layers (rows) of grid (assume "Least Recently Used" replacement strategy)

|  |     | miss |      |  |  |  |
|--|-----|------|------|--|--|--|
|  | hit |      | miss |  |  |  |
|  |     | miss |      |  |  |  |
|  |     |      |      |  |  |  |



Reduce inner (j-) loop dimension successively

|  |     | miss |      |  |               |       |       |     |
|--|-----|------|------|--|---------------|-------|-------|-----|
|  | hit |      | miss |  |               |       |       |     |
|  |     | miss |      |  |               |       |       |     |
|  |     |      |      |  |               |       |       |     |
|  |     |      |      |  | <b>x (0:j</b> | max1+ | 1,0:k | max |
|  |     |      |      |  |               |       |       |     |
|  |     |      |      |  |               |       |       |     |
|  |     | miss |      |  |               |       |       |     |
|  | hit |      | miss |  |               |       |       |     |
|  |     | miss |      |  |               |       |       |     |
|  |     |      |      |  |               |       |       |     |
|  |     |      |      |  |               |       |       |     |
|  |     |      |      |  |               |       |       |     |
|  |     |      |      |  |               |       |       |     |
|  |     |      |      |  |               |       |       |     |
|  |     | miss |      |  |               |       |       |     |

hit

hit

hit



Reduce inner (j-) loop dimension successively



Best case: 3 "layers" of grid fit into the cache!





enddo











Layer condition:

- Does not depend on outer loop length (kmax)
- No strict guideline (cache associativity, data traffic for y not included)
- Needs to be adapted for other stencils (e.g., long-range stencils)

3 \* jmax \* 8B < CacheSize/2 Layer condition fulfilled?



# Analyzing the data flow: Layer condition







Friedrich-Alexander-Universität Erlangen-Nürnberg

# Case study: A Jacobi smoother

#### Optimization by spatial blocking



- How can we enforce a layer condition for all domain sizes ?
- Idea: Spatial blocking
  - Reuse elements of x () as long as they stay in cache
  - Sweep can be executed in any order, e.g. compute blocks in j-direction

- How can we enforce a layer condition for all domain sizes ?
- Idea: Spatial blocking
  - Reuse elements of x () as long as they stay in cache
  - Sweep can be executed in any order, e.g. compute blocks in j-direction

```
"Spatial Blocking" of j-loop:
```

- How can we enforce a layer condition for all domain sizes ?
- Idea: Spatial blocking
  - Reuse elements of x () as long as they stay in cache
  - Sweep can be executed in any order, e.g. compute blocks in j-direction

```
"Spatial Blocking" of j-loop:
```

- How can we enforce a layer condition for all domain sizes ?
- Idea: Spatial blocking
  - Reuse elements of x () as long as they stay in cache
  - Sweep can be executed in any order, e.g. compute blocks in j-direction

```
"Spatial Blocking" of j-loop:
```

Determine for given CacheSize an appropriate jblock value:

jblock < CacheSize / 48B</pre>

Split domain into subblocks:

| Split<br>domain into<br>subblocks: |  |  |  |  |  |  |
|------------------------------------|--|--|--|--|--|--|
|                                    |  |  |  |  |  |  |
|                                    |  |  |  |  |  |  |
|                                    |  |  |  |  |  |  |
|                                    |  |  |  |  |  |  |







Node-Level Performance Engineering



Node-Level Performance Engineering



Intel Compiler 2022.1.0 Intel Xeon Platinum 8360Y ("IcelakeSP"@2.4 GHz)





Node-Level Performance Engineering





#### Validating the model: Memory code balance





Intel Compiler 2022.1.0 Intel Xeon Platinum 8360Y ("IcelakeSP"@2.4 GHz)

#### Validating the model: Memory code balance



40 Measured main memory code balance  $(B_C)$  [Byte/LUP] 32 24 Blocking factor still a 16 little too large CS=inf 8 CS=1.25MB jblock=400K CS=54MB 0 1x10<sup>7</sup> 1x10<sup>6</sup> 1000 100000 10000 jmax

Intel Compiler 2022.1.0 Intel Xeon Platinum 8360Y ("IcelakeSP"@2.4 GHz)

#### Validating the model: Memory code balance



Node-Level Performance Engineering





a) Long-range r = 2: 5 layers (2r + 1)



- a) Long-range r = 2: 5 layers (2r + 1)
- b) Asymmetric: 3 layers



- a) Long-range r = 2: 5 layers (2r + 1)
- b) Asymmetric: 3 layers
- c) 2D box: 3 layers





Friedrich-Alexander-Universität Erlangen-Nürnberg

# Case study: A Jacobi smoother

#### **OpenMP** parallelization



Straightforward OpenMP work sharing:

Straightforward OpenMP work sharing:

Straightforward OpenMP work sharing:

■ Caveat: LC must be fulfilled per thread → shared cache causes smaller blocks!

#### Straightforward OpenMP work sharing:



■ Caveat: LC must be fulfilled per thread → shared cache causes smaller blocks!



# OpenMP parallelization and blocking for a shared cache



# OpenMP parallelization and blocking for a shared cache



# OpenMP parallelization and blocking for a shared cache



- We have made sense of the memory-bound performance vs. problem size
  - "Layer conditions" lead to predictions of code balance
  - "What part of the data comes from where" is a crucial question
  - The model works only if the bandwidth is "saturated"
    - In-cache modeling is more involved

- We have made sense of the memory-bound performance vs. problem size
  - "Layer conditions" lead to predictions of code balance
  - "What part of the data comes from where" is a crucial question
  - The model works only if the bandwidth is "saturated"
    - In-cache modeling is more involved
- Avoiding slow data paths == re-establishing the most favorable layer condition
  - Improved code showed the predicted speedup
  - Optimal blocking factor can be estimated

- We have made sense of the memory-bound performance vs. problem size
  - "Layer conditions" lead to predictions of code balance
  - "What part of the data comes from where" is a crucial question
  - The model works only if the bandwidth is "saturated"
    - In-cache modeling is more involved
- Avoiding slow data paths == re-establishing the most favorable layer condition
  - Improved code showed the predicted speedup
  - Optimal blocking factor can be estimated
- Food for thought
  - Higher dimensions (beyond 2D)?
  - Multi-dimensional loop blocking would it make sense?
  - Can we choose a "better" OpenMP loop schedule?
  - What about temporal blocking?

- J. Hammer, G. Hager, J. Eitzinger, and G. Wellein: Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft. Proc. <u>PMBS15</u>, the 6th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, in conjunction with ACM/IEEE Supercomputing 2015 (<u>SC15</u>), November 16, 2015, Austin, TX. <u>DOI: 10.1145/2832087.2832092</u>, Preprint: <u>arXiv:1509.03778</u>
- H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Proc. <u>ICS15</u>, <u>DOI: 10.1145/2751205.2751240</u>, Preprint: <u>arXiv:1410.5010</u>
- M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations. Concurrency and Computation: Practice and Experience (2015). <u>DOI:10.1002/cpe.3489</u> Preprint: <u>arXiv:1304.7664</u>
- J. Treibig, G. Wellein and G. Hager: Efficient multicore-aware parallelization strategies for iterative stencil computations. Journal of Computational Science 2 (2), 130-137 (2011). <u>DOI 10.1016/j.jocs.2011.01.010</u>
- M. Wittmann, G. Hager, J. Treibig and G. Wellein: Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters. Parallel Processing Letters 20 (4), 359-376 (2010).
- G. Wellein, G. Hager, T. Zeiser, M. Wittmann and H. Fehske: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. Proc. COMPSAC 2009. DOI: 10.1109/COMPSAC.2009.82





Friedrich-Alexander-Universität Erlangen-Nürnberg

# Case study: Sparse Matrix-Vector Multiplication



# Sparse Matrix Vector Multiplication (SpMV)

- Key ingredient in some matrix diagonalization algorithms
  - Lanczos, Davidson, Jacobi-Davidson
- Store only N<sub>nz</sub> nonzero elements of matrix and RHS, LHS vectors with N<sub>r</sub> (number of matrix rows) entries
- "Sparse": N<sub>nz</sub> ~ N<sub>r</sub>
- Average number of nonzeros per row:  $N_{nzr} = N_{nz}/N_r$



# Sparse Matrix Vector Multiplication (SpMV)

- Key ingredient in some matrix diagonalization algorithms
  - Lanczos, Davidson, Jacobi-Davidson
- Store only N<sub>nz</sub> nonzero elements of matrix and RHS, LHS vectors with N<sub>r</sub> (number of matrix rows) entries
- "Sparse": N<sub>nz</sub> ~ N<sub>r</sub>
- Average number of nonzeros per row:  $N_{nzr} = N_{nz}/N_r$



### **SpMVM** characteristics

- For large problems, SpMV is inevitably memory-bound
  - Intra-socket saturation effect on modern multicores
- SpMV is easily parallelizable in shared and distributed memory
  - Load balancing
  - Communication overhead
- Data storage format is crucial for performance properties
  - Most useful general format on CPUs: Compressed Row Storage (CRS)
  - Depending on compute architecture







val[] stores all the nonzeros (length N<sub>nz</sub>)





- val[] stores all the nonzeros (length N<sub>nz</sub>)
- col\_idx[] stores the column index of each nonzero (length N<sub>nz</sub>)





- val[] stores all the nonzeros (length N<sub>nz</sub>)
- col\_idx[] stores the column index of each nonzero (length N<sub>nz</sub>)
- row\_ptr[] stores the starting index of each new row in val[] (length: N<sub>r</sub>)



# Case study: Sparse matrix-vector multiply

- Strongly memory-bound for large data sets
  - Streaming, with partially indirect access:

```
do i = 1,Nr
do j = row_ptr(i), row_ptr(i+1) - 1
C(i) = C(i) + val(j) * B(col_idx(j))
enddo
enddo
```

Usually many spMVMs required to solve a problem

Now let's look at some performance measurements...

### Case study: Sparse matrix-vector multiply

- Strongly memory-bound for large data sets
  - Streaming, with partially indirect access:

```
!$OMP parallel do schedule(???)
do i = 1,Nr
  do j = row_ptr(i), row_ptr(i+1) - 1
    C(i) = C(i) + val(j) * B(col_idx(j))
  enddo
enddo
!$OMP end parallel do
```

Usually many spMVMs required to solve a problem

Now let's look at some performance measurements...

- Strongly memory-bound for large data sets → saturating performance across cores on the chip
- Performance seems to depend on the matrix
- Can we explain this?
- Is there a "light speed" for SpMV?

Optimization?



- Strongly memory-bound for large data sets → saturating performance across cores on the chip
- Performance seems to depend on the matrix
- Can we explain this?
- Is there a "light speed" for SpMV?

Optimization?



```
do i = 1, N<sub>r</sub>
  do j = row_ptr(i), row_ptr(i+1) - 1
    C(i) = C(i) + val(j) * B(col_idx(j))
    enddo
enddo
```

```
real*8 val(N<sub>nz</sub>)
integer*4 col_idx(N<sub>nz</sub>)
integer*4 row_ptr(N<sub>r</sub>)
real*8 C(N<sub>r</sub>)
real*8 B(N<sub>c</sub>)
```

```
Min. load traffic [B]: (8 + 4) N_{nz} + (4 + 8)N_r + 8 N_c

Min. store traffic [B]: 8 N_r

Total FLOP count [F]: 2 N_{nz}
```

```
do i = 1, N<sub>r</sub>
  do j = row_ptr(i), row_ptr(i+1) - 1
    C(i) = C(i) + val(j) * B(col_idx(j))
  enddo
enddo
```

real\*8 val(N<sub>nz</sub>)
integer\*4 col\_idx(N<sub>nz</sub>)
integer\*4 row\_ptr(N<sub>r</sub>)
real\*8 C(N<sub>r</sub>)
real\*8 B(N<sub>c</sub>)

Min. load traffic [B]: 
$$(8 + 4) N_{nz} + (4 + 8)N_r + 8 N_c$$
  
Min. store traffic [B]:  $8 N_r$   
Total FLOP count [F]:  $2 N_{nz}$ 

$$B_{C,min} = \frac{12 N_{nz} + 20 N_r + 8 N_c}{2 N_{nz}} \frac{B}{F} =$$

```
do i = 1, N<sub>r</sub> real*{
  do j = row_ptr(i), row_ptr(i+1) - 1 intege
  C(i) = C(i) + val(j) * B(col_idx(j)) intege
  enddo
  enddo
  real*{
```

real\*8 val(N<sub>nz</sub>)
integer\*4 col\_idx(N<sub>nz</sub>)
integer\*4 row\_ptr(N<sub>r</sub>)
real\*8 C(N<sub>r</sub>)
real\*8 B(N<sub>c</sub>)

Min. load traffic [B]: 
$$(8 + 4) N_{nz} + (4 + 8)N_r + 8 N_c$$
  
Min. store traffic [B]:  $8 N_r$   
Total FLOP count [F]:  $2 N_{nz}$ 

$$B_{C,min} = \frac{12 N_{nz} + 20 N_r + 8 N_c}{2 N_{nz}} \frac{B}{F} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$
Nonzeros per row  $(N_{nzr} = \frac{N_{nz}}{N_r})$  or column  $(N_{nzc} = \frac{N_{nz}}{N_c})$ 

real\*8 val(N<sub>nz</sub>)
integer\*4 col\_idx(N<sub>nz</sub>)
integer\*4 row\_ptr(N<sub>r</sub>)
real\*8 C(N<sub>r</sub>)
real\*8 B(N<sub>c</sub>)

Min. load traffic [B]: 
$$(8 + 4) N_{nz} + (4 + 8)N_r + 8 N_c$$
  
Min. store traffic [B]:  $8 N_r$   
Total FLOP count [F]:  $2 N_{nz}$ 

$$B_{C,min} = \frac{12 N_{nz} + 20 N_r + 8 N_c}{2 N_{nz}} \frac{B}{F} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$
Nonzeros per row  $(N_{nzr} = N_{nz}/N_r)$  or column  $(N_{nzc} = N_{nz}/N_c)$ 
Lower bound for code balance:  $B_{C,min} \ge 6 \frac{B}{F} \rightarrow I_{max} \le \frac{1}{6} \frac{F}{B}$ 



$$B_{C,min} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$



$$B_{C,min} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$

$$B_{C,min} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$
$$B_{C}(\alpha) = \frac{12 + 20/N_{nzr} + 8\alpha}{2} \frac{B}{F}$$



Parameter ( $\alpha$ ) quantifies additional traffic for **B(:)** (irregular access):

$$\alpha \ge \frac{1}{N_{nzc}}$$

 $\alpha N_{nzc} \geq 1$ 

$$B_{C,min} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$
$$B_{C}(\alpha) = \frac{12 + 20/N_{nzr} + 8\alpha}{2} \frac{B}{F}$$

Consider square matrices:  $N_{nzc} = N_{nzr}$  and  $N_c = N_r$ Note:  $B_C (1/N_{nzr}) = B_{C,min}$ 



Parameter ( $\alpha$ ) quantifies additional traffic for **B(:)** (irregular access):

$$\alpha \ge \frac{1}{N_{nzc}}$$

$$\alpha N_{nzc} \geq 1$$

# The " $\alpha$ effect"

- DP CRS code balance
- α quantifies the traffic for loading the RHS
  - $\alpha = 0 \rightarrow \text{RHS}$  is in cache
  - $\alpha = 1/N_{nzr}$   $\rightarrow$  RHS loaded once
  - $\alpha = 1 \rightarrow$  no cache
  - $\alpha > 1 \rightarrow$  Houston, we have a problem!
- "Target" performance =  $b_S/B_c$
- Caveat: Maximum memory BW may not be achieved with spMVM (see later)

$$B_C(\alpha) = \frac{12 + 20/N_{nzr} + 8\alpha}{2} \frac{B}{F}$$
$$= \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F}$$

# The " $\alpha$ effect"

- DP CRS code balance
- α quantifies the traffic for loading the RHS
  - $\alpha = 0 \rightarrow \text{RHS}$  is in cache
  - $\alpha = 1/N_{nzr}$   $\rightarrow$  RHS loaded once
  - $\alpha = 1 \rightarrow$  no cache
  - $\alpha > 1 \rightarrow$  Houston, we have a problem!
- "Target" performance =  $b_S/B_c$
- Caveat: Maximum memory BW may not be achieved with spMVM (see later)
- Can we predict  $\alpha$ ?
- Not in general
- Simple cases (banded, block-structured): Similar to layer condition analysis

 $\rightarrow$  Determine  $\alpha$  by measuring the actual memory traffic ( $\rightarrow$  measured code balance  $B_C^{meas}$ )

 $B_{C}(\alpha) = \frac{12 + 20/N_{nzr} + 8\alpha}{2} \frac{B}{F}$  $= \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F}$ 

$$B_C(\alpha) = \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F} = \frac{V_{meas}}{N_{nz} \cdot 2F} \quad (= B_C^{meas})$$

- V<sub>meas</sub> is the measured overall memory data traffic (using, e.g., likwid-perfctr)
- Solve for *α*:

$$\alpha = \frac{1}{4} \left( \frac{V_{meas}}{N_{nz} \cdot 2 \text{ bytes}} - 6 - \frac{10}{N_{nzr}} \right)$$

$$B_C(\alpha) = \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F} = \frac{V_{meas}}{N_{nz} \cdot 2F} \quad (= B_C^{meas})$$

- V<sub>meas</sub> is the measured overall memory data traffic (using, e.g., likwid-perfctr)
- Solve for α:

$$\alpha = \frac{1}{4} \left( \frac{V_{meas}}{N_{nz} \cdot 2 \text{ bytes}} - 6 - \frac{10}{N_{nzr}} \right)$$

Example: kkt\_power matrix from the UoF collection on one Intel SNB socket

• 
$$N_{nz} = 14.6 \cdot 10^6$$
,  $N_{nzr} = 7.1$ 

•  $V_{meas} \approx 258 \text{ MB}$ 

$$\rightarrow \alpha = 0.36, \, \alpha N_{nzr} = 2.5$$

 $\rightarrow$  RHS is loaded 2.5 times from memory

$$B_C(\alpha) = \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F} = \frac{V_{meas}}{N_{nz} \cdot 2F} \quad (= B_C^{meas})$$

- V<sub>meas</sub> is the measured overall memory data traffic (using, e.g., likwid-perfctr)
- Solve for α:

$$\alpha = \frac{1}{4} \left( \frac{V_{meas}}{N_{nz} \cdot 2 \text{ bytes}} - 6 - \frac{10}{N_{nzr}} \right)$$

Example: kkt\_power matrix from the UoF collection on one Intel SNB socket

• 
$$N_{nz} = 14.6 \cdot 10^6$$
,  $N_{nzr} = 7.1$ 

•  $V_{meas} \approx 258 \text{ MB}$ 

$$\rightarrow \alpha = 0.36, \, \alpha N_{nzr} = 2.5$$

 $\rightarrow$  RHS is loaded 2.5 times from memory

$$> \frac{B_C(\alpha)}{B_{C,min}} = 1.11$$

$$B_C(\alpha) = \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F} = \frac{V_{meas}}{N_{nz} \cdot 2F} \quad (= B_C^{meas})$$

- V<sub>meas</sub> is the measured overall memory data traffic (using, e.g., likwid-perfctr)
- Solve for α:

$$\alpha = \frac{1}{4} \left( \frac{V_{meas}}{N_{nz} \cdot 2 \text{ bytes}} - 6 - \frac{10}{N_{nzr}} \right)$$

Example: kkt\_power matrix from the UoF collection on one Intel SNB socket

• 
$$N_{nz} = 14.6 \cdot 10^6$$
,  $N_{nzr} = 7.1$ 

•  $V_{meas} \approx 258 \text{ MB}$ 

$$\rightarrow \alpha = 0.36, \, \alpha N_{nzr} = 2.5$$

 $\rightarrow$  RHS is loaded 2.5 times from memory

$$\frac{B_{C}(\alpha)}{B_{C,min}} = 1.11$$

$$\frac{11\% \text{ extra traffic } \rightarrow}{\text{optimization potential!}}$$

#### Three different sparse matrices

Benchmark system: Intel Xeon Ivy Bridge E5-2660v2, 2.2 GHz,  $b_S = 46.6 \text{ GB/s}$ 

→ Roofline:  $P_{opt} = {}^{b_S} / {}_{B_{C,min}}$ 

| Matrix    | Ν         | N <sub>nzr</sub> | B <sub>C,min</sub> [B/F] | $P_{opt}$ [GF/s] |
|-----------|-----------|------------------|--------------------------|------------------|
| DLR1      | 278,502   | 143              | 6.1                      | 7.64             |
| scai1     | 3,405,035 | 7.0              | 8.0                      | 5.83             |
| kkt_power | 2,063,494 | 7.08             | 8.0                      | 5.83             |







•  $b_S = 46.6 \text{ GB/s}$ ,  $B_c = 6 \text{ B/F}$ 

Maximum spMVM performance:

 $P_{max} = 7.8 \,\mathrm{GF/s}$ 

 DLR1 causes (almost) minimum CRS code balance (as expected)



•  $b_S = 46.6 \,\text{GB/s}$ ,  $B_c = 6 \,\text{B/F}$ 

Maximum spMVM performance:

 $P_{max} = 7.8 \,\mathrm{GF/s}$ 

- DLR1 causes (almost) minimum CRS code balance (as expected)
- scai1 measured balance:

 $B_c^{meas} \approx 8.5 \text{ B/F} > B_{C,min}$  (6% higher than min)

 $\rightarrow$  good BW utilization, slightly non-optimal  $\alpha$ 





#### Investigating the load imbalance with kkt\_power



#### Investigating the load imbalance with kkt\_power



#### Investigating the load imbalance with kkt\_power



#### SpMV node performance model – CPU



Matrices taken from: C. L. Alappat, N. Meyer, J. Laukemann, T. Gruber, G. Hager, G. Wellein, and T. Wettig: *ECM modeling and performance tuning* of SpMV and Lattice QCD on A64FX. Concurrency and Computation: Practice and Experience, e6512 (2021). DOI: <u>10.1002/cpe.6512</u>

#### SpMV node performance model – CPU



Matrices taken from: C. L. Alappat, N. Meyer, J. Laukemann, T. Gruber, G. Hager, G. Wellein, and T. Wettig: *ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX.* Concurrency and Computation: Practice and Experience, e6512 (2021). DOI: <u>10.1002/cpe.6512</u>

#### Roofline analysis for spMVM

- Conclusion from the Roofline analysis
  - The roofline model does not "work" for spMVM due to the RHS traffic uncertainties
  - We have "turned the model around" and measured the actual memory traffic to determine the RHS overhead
  - Result indicates:
    - 1. how much actual traffic the RHS generates
    - 2. how efficient the RHS access is (compare BW with max. BW)
    - 3. how much optimization potential we have with matrix reordering
- Do not forget about load balancing!
- SpMV is not the end of the story:  $A \times \{x^1, x^2, ...\}, A^p x, ...$
- Consequence: Modeling is not always 100% predictive. It's all about *learning more* about performance properties!

### Some publications

- C. Alappat, J. Thies, G. Hager, H. Fehske, and G. Wellein: Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs. Submitted. Preprint: <u>arXiv:2309.02228</u>
- C. L. Alappat, G. Hager, O. Schenk, and G. Wellein: Level-based Blocking for Sparse Matrices: Sparse Matrix-Power-Vector Multiplication. IEEE Transactions on Parallel and Distributed Systems 34(2), 581-597 (2023), DOI: <u>10.1109/TPDS.2022.3223512</u>
- C. L. Alappat, N. Meyer, J. Laukemann, T. Gruber, G. Hager, G. Wellein, and T. Wettig: *ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX.* Concurrency and Computation: Practice and Experience, e6512 (2021). Available with Open Access. DOI: <u>10.1002/cpe.65</u>
- C. L. Alappat, G. Hager, O. Schenk, J. Thies, A. Basermann, A. R. Bishop, H. Fehske, and G. Wellein: A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication. ACM Trans. Parallel Comput. 7(3), Article 19 (June 2020), 37 pages. Available with Open Access. DOI: <u>10.1145/3399732</u>.
- M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units.
   SIAM Journal on Scientific Computing 36(5), C401–C423 (2014). DOI: 10.1137/130930352





Friedrich-Alexander-Universität Erlangen-Nürnberg

optional

# Sparse Matrix-Vector Multiplication on GPGPUs



### What about GPUs?

- GPUs need
  - Enough work per kernel launch in order to leverage their parallelism
  - Coalesced access to memory (consecutive threads in a warp should access consecutive memory addresses)



### What about GPUs?

- GPUs need
  - Enough work per kernel launch in order to leverage their parallelism
  - Coalesced access to memory (consecutive threads in a warp should access consecutive memory addresses)
- Plain CRS for SpMV on GPUs is not a good idea
  - 1. Short inner loop
  - 2. Different amount of work per thread
  - 3. Non-coalesced memory access



### What about GPUs?

- GPUs need
  - Enough work per kernel launch in order to leverage their parallelism
  - Coalesced access to memory (consecutive threads in a warp should access consecutive memory addresses)
- Plain CRS for SpMV on GPUs is not a good idea
  - 1. Short inner loop
  - 2. Different amount of work per thread
  - 3. Non-coalesced memory access
- Remedy: Use SIMD/SIMT-friendly storage format
  - ELLPACK, SELL-C-σ, DIA, ESB,...



### CRS SpMV in CUDA (y = Ax)

```
template <typename VT, typename IT>
global static void
spmv csr(const ST num rows,
          const IT * RESTRICT row_ptrs, const IT * RESTRICT col_idxs,
          const VT * RESTRICT values, const VT * RESTRICT x,
                                                  VT * RESTRICT \mathbf{v})
{
    ST row = threadIdx.x + blockDim.x * blockIdx.x; // 1 thread per row
    if (row < num rows) {</pre>
        VT sum{};
        for (IT j = row_ptrs[row]; j < row_ptrs[row + 1]; ++j) {</pre>
             sum += values[j] * x[col idxs[j]];
        y[row] = sum;
                                                          B_c(\alpha) = \left(6 + 4\alpha + \frac{6}{N_m}\right)\frac{B}{F}
```

No write-allocate on GPUs for consecutive stores

CRS (1 thread per row)







- Strong "α effect" large deviation from optimal α for many matrices
  - Many cache lines touched b/c every thread handles one row → bad cache usage



- Strong "α effect" large deviation from optimal α for many matrices
  - Many cache lines touched b/c every thread handles one row → bad cache usage
- Mediocre memory bandwidth usage (« 1400 GB/s) in many cases
  - Non-coalesced memory access
  - Imbalance across rows/threads of warps

### SELL-C- $\sigma$

Idea

M. Kreutzer et al.: A Unified Sparse Matrix Data Format For Efficient General Sparse Matrix-vector Multiplication On Modern Processors With Wide SIMD Units, SIAM SISC 2014, DOI: <u>10.1137/130930352</u>

- Sort rows according to length within sorting scope  $\sigma$
- Store nonzeros column-major in zero-padded chunks of height C





### SELL-C- $\sigma$

Idea

M. Kreutzer et al.: A Unified Sparse Matrix Data Format For Efficient General Sparse Matrix-vector Multiplication On Modern Processors With Wide SIMD Units, SIAM SISC 2014, DOI: <u>10.1137/130930352</u>

- Sort rows according to length within sorting scope  $\sigma$
- Store nonzeros column-major in zero-padded chunks of height C



### SELL-C- $\sigma$ SpMV in CUDA (y=Ax)

```
ST row = threadIdx.x + blockDim.x * blockIdx.x;
ST c = row / C; // the no. of the chunk
ST idx = row % C; // index inside the chunk
```

```
if (row < n_chunks * C) {
    VT tmp{};
    IT cs = chunk_ptrs[c]; // points to start indices of chunks</pre>
```

```
for (ST j = 0; j < chunk_lengths[c]; ++j) {
    tmp += values[cs + idx] * x[col_idxs[cs + idx]];
    cs += C;
}
y[row] = tmp;</pre>
```



$$B_{SELL}(\alpha, \beta, N_{nzr}) = \left(\frac{1}{\beta} \left(\frac{8+4}{2}\right) + \frac{8\alpha + \beta(8+4/C)/N_{nzr}}{2}\right) \frac{\text{bytes}}{\text{flop}}$$
$$= \left(\frac{6}{\beta} + 4\alpha + \frac{\beta(4+2/C)}{N_{nzr}}\right) \frac{\text{bytes}}{\text{flop}}$$
$$\text{Optimal } \alpha = \frac{\beta}{N_{nzr}}$$









When measuring  $B_C^{meas}$ , take care to use the "useful" number of flops (excluding zero padding) for work

### How to choose the parameters *C* and $\sigma$ on GPUs?

- *C* 
  - n × warp size to allow good utilization of GPU threads and cache lines

#### • *o*

- As small as possible, as large as necessary
- Large  $\sigma$  reduces zero padding (brings  $\beta$  closer to 1)
- Sorting alters RHS access pattern  $\rightarrow \alpha$  depends on  $\sigma$



### SpMV node performance model – GPU

CRS (1 thread per row)

NVIDIA Ampere A100



$$b_{S} = 1400 \text{ GB/s}$$

## SpMV node performance model – GPU



NVIDIA Ampere A100

 $b_{S} = 1400 \text{ GB/s}$ 

## SpMV node performance model – GPU









Friedrich-Alexander-Universität Erlangen-Nürnberg

## Single Instruction Multiple Data (SIMD) processing



#### SIMD terminology

#### A word on terminology

- SIMD == "one instruction → several operations"
- "SIMD width" == number of operands that fit into a register
- No statement about parallelism among those operations
- Original vector computers: long registers, pipelined execution, but no parallelism (within the instruction)

#### SIMD terminology

#### A word on terminology

- SIMD == "one instruction → several operations"
- "SIMD width" == number of operands that fit into a register
- No statement about parallelism among those operations
- Original vector computers: long registers, pipelined execution, but no parallelism (within the instruction)



#### SIMD terminology

#### A word on terminology

- SIMD == "one instruction → several operations"
- "SIMD width" == number of operands that fit into a register
- No statement about parallelism among those operations
- Original vector computers: long registers, pipelined execution, but no parallelism (within the instruction)



#### Today

- x86: most SIMD instructions fully parallel
  - "Short Vector SIMD"
  - Some exceptions on some architectures (e.g., vdivpd)
- NEC Tsubasa: 32-way parallelism but SIMD width = 256 (DP)

```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

#### **Register width**

1 operand









```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

**Register width** 

1 operand





```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

**Register width** 

1 operand









```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

#### **Register width**

1 operand







```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

## **Register width**

1 operand









```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

**Register width** 

1 operand



```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

**Register width** 

1 operand



```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

**Register width** 

1 operand



```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

**Register width** 

1 operand



```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

## Register width

1 operand



2 operands (SSE)



4 operands (AVX)



8 operands (AVX512)







```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

Register width

1 operand



2 operands (SSE)



4 operands (AVX)



8 operands (AVX512)



```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

Register width

1 operand



2 operands (SSE)



4 operands (AVX)



8 operands (AVX512)



```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

Register width

1 operand



2 operands (SSE)



4 operands (AVX)



8 operands (AVX512)



```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

## Register width

1 operand



2 operands (SSE)



4 operands (AVX)



8 operands (AVX512)







```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

Register width

1 operand



2 operands (SSE)



4 operands (AVX)



8 operands (AVX512)



```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

## Register width

1 operand



2 operands (SSE)



4 operands (AVX)



8 operands (AVX512)



```
for (int j=0; j<size; j++) {
        A[j] = B[j] + C[j];
}</pre>
```

Register width

1 operand



2 operands (SSE)



4 operands (AVX)



8 operands (AVX512)

#### **SIMD** execution



Best code requires vectorized

LOADs, STOREs, and arithmetic!

## Data types in 32-byte SIMD registers

Supported data types depend on actual SIMD instruction set



```
for(int i=0; i<n; i++)
    C[i]= A[i] + B[i];</pre>
```









## SIMD processing: Roadblocks

No SIMD vectorization for loops with data dependencies:

```
for(int i=1; i<n; i++)
        A[i] = A[i-1] * s;</pre>
```

## SIMD processing: Roadblocks

No SIMD vectorization for loops with data dependencies:

```
for(int i=1; i<n; i++)
        A[i] = A[i-1] * s;</pre>
```

"Pointer aliasing" may prevent vectorization

```
void f(double *A, double *B, double *C, int n) {
    for(int i=0; i<n; ++i)
        C[i] = A[i] + B[i];
}</pre>
```

C/C++ allows: A=&C[-1] and  $B=\&C[-2] \rightarrow C[i]=C[i-1]+C[i-2]$  $\rightarrow$  data dependency  $\rightarrow$  no SIMD

# SIMD processing: Roadblocks

No SIMD vectorization for loops with data dependencies:

```
for(int i=1; i<n; i++)
        A[i] = A[i-1] * s;</pre>
```

"Pointer aliasing" may prevent vectorization

```
void f(double *A, double *B, double *C, int n) {
    for(int i=0; i<n; ++i)
        C[i] = A[i] + B[i];
}</pre>
```

C/C++ allows: A=&C[-1] and B=&C[-2] → C[i]=C[i-1]+C[i-2] → data dependency → no SIMD

If pointer aliasing does not occur in code, tell the compiler: -fno-alias (Intel), -Msafeptr (PGI), -fargument-noalias (gcc) restrict keyword (C only!): void f(double \*restrict A, double \*restrict B, double \*restrict C, int n) {...}

# How to leverage SIMD: your options

**Options:** 

- The compiler does it for you (but: aliasing, alignment, language, abstractions)
- Compiler directives (pragmas) OpenMP 4.0++ has ample support
- Alternative programming models for compute kernels (OpenCL, ispc)
- Intrinsics (restricted to C/C++)
- Implement directly in assembly

Options:

- The compiler does it for you (but: aliasing, alignment, language, abstractions)
- Compiler directives (pragmas) OpenMP 4.0++ has ample support
- Alternative programming models for compute kernels (OpenCL, ispc)
- Intrinsics (restricted to C/C++)
- Implement directly in assembly

Example: x86 SIMD (SSE) intrinsics

```
#include <x86intrin.h>
...
for (int j=0; j<size; j+=16) {
   t0 = _mm_loadu_ps(data+j);
   t1 = _mm_loadu_ps(data+j+4);
   t2 = _mm_loadu_ps(data+j+8);
   t3 = _mm_loadu_ps(data+j+12);
   sum0 = _mm_add_ps(sum0, t0);
   sum1 = _mm_add_ps(sum1, t1);
   sum2 = _mm_add_ps(sum2, t2);
   sum3 = _mm_add_ps(sum3, t3);</pre>
```

# Vectorization compiler options (Intel)

- The compiler will vectorize starting with –O2
- To enable specific SIMD extensions use the -x option: -xSSE2, -xSSE3, -xSSE3, -xSSE4.1, -xSSE4.2, -xAVX, ...
- -xAVX on Sandy/Ivy Bridge processors
- -xCORE-AVX2 on Haswell/Broadwell
- -xCORE-AVX512 on Skylake (certain models) and later

Recommended option:

- -xHost will optimize for the architecture you compile on
- To really enable 512-bit SIMD with current Intel compilers you need to set -qopt-zmm-usage=high

Optional

# User-mandated vectorization (OpenMP 4)

- Since OpenMP 4.0 SIMD features are a part of the OpenMP standard
- #pragma omp simd enforces vectorization
- Essentially a standardized "go ahead, no dependencies here!"
   Do not lie to the compiler!

# User-mandated vectorization (OpenMP 4)

- Since OpenMP 4.0 SIMD features are a part of the OpenMP standard
- #pragma omp simd enforces vectorization
- Essentially a standardized "go ahead, no dependencies here!"
   Do not lie to the compiler!

```
for (int j=0; j<n; j++) {
    #pragma omp simd reduction(+:b[j:1])
    for (int i=0; i<n; i++) {
        b[j] += a[j][i];
    }
}</pre>
```

# User-mandated vectorization (OpenMP 4)

- Since OpenMP 4.0 SIMD features are a part of the OpenMP standard
- #pragma omp simd enforces vectorization
- Essentially a standardized "go ahead, no dependencies here!"
   Do not lie to the compiler!
- Prerequisites
  - Countable loop

for (int j=0; j<n; j++) {
 #pragma omp simd reduction(+:b[j:1])
 for (int i=0; i<n; i++) {
 b[j] += a[j][i];
 }
}</pre>

- Innermost loop
- Must conform to for-loop style of OpenMP worksharing constructs
- There are additional clauses: reduction, simdlen, private, collapse, ...

# Limits of the SIMD benefit

Why does SIMD usually not give the expected speedup? → Analyze time contributions for data and execution



for(int i=0; i<size; i++)
 sum += data[i];</pre>

# Limits of the SIMD benefit

Why does SIMD usually not give the expected speedup? → Analyze time contributions for data and execution

for(int i=0; i<size; i++)
 sum += data[i];</pre>



# Rules and guidelines for vectorizable loops

#### 1. Inner loop

- 2. Countable (loop length can be determined at loop entry)
- 3. Single entry and single exit
- 4. Straight line code (no conditionals) unless masks can be used
- 5. No function calls (exceptions: SIMD declared functions, intrinsic math)

#### Better performance with:

- 1. Simple inner loops with unit stride (contiguous data access)
- 2. Minimize indirect addressing
- 3. Align data structures to SIMD width boundary (minor impact)

In C use the **restrict** keyword and/or **const** qualifiers and/or compiler options to rule out array/pointer aliasing

# SIMD conclusions

- Short-vector SIMD = data-parallel execution on the instruction level
- Best option: make the compiler employ SIMD instructions
- SIMD is an in-core feature
  - Boosts work per cycle in core (peak performance)
  - The further away the data, the less benefit
  - If the code is memory bound, you may not even care





Friedrich-Alexander-Universität Erlangen-Nürnberg

# Efficient parallel programming on ccNUMA nodes

# Performance characteristics of ccNUMA nodes First touch placement policy



## ccNUMA – The "other affinity"

#### ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed across multiple locality domains (LDs)
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?

**Note:** Page placement is implemented in units of OS pages (often 4 KiB, possibly more)



## ccNUMA – The "other affinity"

#### ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed across multiple locality domains (LDs)
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?

**Note:** Page placement is implemented in units of OS pages (often 4 KiB, possibly more)



## ccNUMA – The "other affinity"

#### ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed across multiple locality domains (LDs)
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?

**Note:** Page placement is implemented in units of OS pages (often 4 KiB, possibly more)



#### ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed across multiple locality domains (LDs)
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?



#### ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed across multiple locality domains (LDs)
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?



#### ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed across multiple locality domains (LDs)
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?



#### ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed across multiple locality domains (LDs)
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?



#### How much does nonlocal access cost?

Example: AMD "Naples" dual-socket system (8 chips, 2 sockets, 48 cores): *STREAM Triad bandwidth measurements* [Gbyte/s]

| CPU nod  | e 0  | 1    | 2    | 3    | 4    | 5    | 6    | 7    |
|----------|------|------|------|------|------|------|------|------|
| MEM node | 22.4 | 24.4 | 24.0 | 24.0 | 10.0 | 10.6 | 10.7 | 10.0 |
| 0        | 32.4 | 21.4 | 21.8 | 21.9 | 10.6 | 10.6 | 10.7 | 10.8 |
| 1        | 21.5 | 32.4 | 21.9 | 21.9 | 10.6 | 10.5 | 10.7 | 10.6 |
| 2        | 21.8 | 21.9 | 32.4 | 21.5 | 10.6 | 10.6 | 10.8 | 10.7 |
| 3        | 21.9 | 21.9 | 21.5 | 32.4 | 10.6 | 10.6 | 10.6 | 10.7 |
| 4        | 10.6 | 10.7 | 10.6 | 10.6 | 32.4 | 21.4 | 21.9 | 21.9 |
| 5        | 10.6 | 10.6 | 10.6 | 10.6 | 21.4 | 32.4 | 21.9 | 21.9 |
| 6        | 10.6 | 10.7 | 10.6 | 10.6 | 21.9 | 21.9 | 32.3 | 21.4 |
| 7        | 10.7 | 10.6 | 10.6 | 10.6 | 21.9 | 21.9 | 21.4 | 32.5 |



#### numact1 can influence the way a binary maps its memory pages:

numactl --membind=<nodes> a.out # map pages only on <nodes> --preferred=<node> a.out # map pages on <node> # and others if <node> is full --interleave=<nodes> a.out # map pages round robin across # all <nodes>

Examples:

```
for m in `seq 0 7`; do ccNUMA map scan
for c in `seq 0 7`; do for Naples system
env OMP_NUM_THREADS=6 \
    numactl --membind=$m likwid-pin -c M${c}:0-5 ./stream
done
done
```

numactl --interleave=0-7 likwid-pin -c E:N:8:1:12 ./stream

#### numact1 can influence the way a binary maps its memory pages:

numactl --membind=<nodes> a.out # map pages only on <nodes> --preferred=<node> a.out # map pages on <node> # and others if <node> is full --interleave=<nodes> a.out # map pages round robin across # all <nodes>

Examples:

```
for m in `seq 0 7`; do ccNUMA map scan
for c in `seq 0 7`; do for Naples system
env OMP_NUM_THREADS=6 \
    numactl --membind=$m likwid-pin -c M${c}:0-5 ./stream
done
done
```

numactl --interleave=0-7 likwid-pin -c E:N:8:1:12 ./stream

#### But what is the default without numactl?

### A memory page gets mapped into the local memory of the processor that first touches it!

(Except if there is not enough local memory available)

Caveat: "to touch" means "to write," not "to allocate"

A memory page gets mapped into the local memory of the processor that first touches it! (Except if there is not enough local memory available)

Caveat: "to touch" means "to write," not "to allocate"Example:

double \*huge = (double\*)malloc(N\*sizeof(double));

```
for(i=0; i<N; i++) // or i+=PAGE_SIZE/sizeof(double)
huge[i] = 0.0;</pre>
```

It is sufficient to touch a single item to map the entire page

A memory page gets mapped into the local memory of the processor that first touches it! (Except if there is not enough local memory available)



It is sufficient to touch a single item to map the entire page

A memory page gets mapped into the local memory of the processor that first touches it! (Except if there is not enough local memory available)



It is sufficient to touch a single item to map the entire page

#### Simplest case: explicit initialization

```
integer,parameter :: N=1000000
double precision A(N), B(N)
A=0.d0
!$OMP parallel do
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end parallel do
```

#### Simplest case: explicit initialization



```
integer,parameter :: N=1000000
double precision A(N), B(N)
!$OMP parallel
!$OMP do schedule(static)
do i = 1, N
 A(i)=0.d0
end do
!$OMP end do
. . .
!$OMP do schedule(static)
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end do
!$OMP end parallel
```

Sometimes initialization is not so obvious: I/O cannot be easily parallelized, so "localize" arrays before I/O

```
integer,parameter :: N=1000000
allocate(A(N), B(N))
READ (1000) A
!$OMP parallel do
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end parallel do
```

Sometimes initialization is not so obvious: I/O cannot be easily parallelized, so "localize" arrays before I/O



### Coding for Data Locality

- Required condition: OpenMP loop schedule of initialization must be the same as in all computational loops
  - Only choice: **static**! Specify explicitly on all NUMA-sensitive loops, just to be sure...
  - Imposes some constraints on possible optimizations (e.g. load balancing)
  - Presupposes that all worksharing loops with the same loop length have the same threadchunk mapping
  - If dynamic scheduling/tasking is unavoidable, the problem cannot be solved completely if a team of threads spans more than one LD
    - Static parallel first touch is still a good idea
- How about global objects?
  - Initialized before main() is called
  - If communication vs. computation is favorable, might consider properly placed copies of global data
- C++: Arrays of objects and std::vector<> are by default initialized sequentially
  - STL allocators provide an elegant solution

### Diagnosing bad locality

- If your code is cache bound, you might not notice any locality problems
- Otherwise, bad locality limits scalability (whenever a ccNUMA node boundary is crossed)
   Just an indication, not a proof yet
- Running with numactl --interleave might give you a hint
  - See later
- Consider using performance counters
  - likwid-perfctr can be used to measure non-local memory accesses
  - Example:
    - \$ likwid-perfctr -g NUMA -C M0:0-4@M1:0-4 ./a.out

### Diagnosing bad locality

- If your code is cache bound, you might not notice any locality problems
- Otherwise, bad locality limits scalability (whenever a ccNUMA node boundary is crossed)
   Just an indication, not a proof yet
- Running with numactl --interleave might give you a hint
  - See later
- Consider using performance counters
  - Ikwid-perfctr can be used to measure non-local memory accesses
  - Example:
    - \$ likwid-perfctr -g NUMA -C M0:0-4@M1:0-4 ./a.out



### Using performance counters for diagnosis

- Intel Ice Lake SP node (running 2x32 threads): measure inter-socket traffic
  - \$ likwid-perfctr -g UPI -C S0:0@S1:0 ./a.out
- Output:

| Metric                            | •       |            | •       | HWThread 32 |
|-----------------------------------|---------|------------|---------|-------------|
| Runtime (RDTSC) [s]               | -+-<br> | 12.3681    | -+-<br> | 12.3681     |
| Runtime unhalted [s]              | Ι       | 12.1108    | Ι       | 8.2227      |
| Clock [MHz]                       | Ι       | 3281.3537  | Ι       | 3103.6518   |
| CPI                               | Ι       | 5.4670     | Ι       | 35.5873     |
| Received data bandwidth [MByte/s] | Ι       | 22127.2106 | Ι       | 21358.7412  |
| Received data volume [GByte]      | Ι       | 273.6708   | Ι       | 264.1663    |
| Sent data bandwidth [MByte/s]     | Ι       | 21358.7391 | Ι       | 22127.2191  |
| Sent data volume [GByte]          | Ι       | 264.1663   | I       | 273.6709    |
| Total data bandwidth [MByte/s]    | Ι       | 43485.9496 | Ι       | 43485.9603  |
| Total data volume [GByte]         | Ι       | 537.8370   | Ι       | 537.8372    |

#### Caveat: NUMA metrics vary strongly between CPU models

### Using performance counters for diagnosis

- Intel Ice Lake SP node (running 2x32 threads): measure inter-socket traffic
  - \$ likwid-perfctr -g UPI -C S0:0@S1:0 ./a.out
- Output:

| +                                 | +          | -++         |  |
|-----------------------------------|------------|-------------|--|
| Metric                            | HWThread 0 | HWThread 32 |  |
| Runtime (RDTSC) [s]               | 12.3681    | 12.3681     |  |
| Runtime unhalted [s]              | 12.1108    | 8.2227      |  |
| Clock [MHz]                       | 3281.3537  | 3103.6518   |  |
| CPI                               | 5.4670     | 35.5873     |  |
| Received data bandwidth [MByte/s] | 22127.2106 | 21358.7412  |  |
| Received data volume [GByte]      | 273.6708   | 264.1663    |  |
| Sent data bandwidth [MByte/s]     | 21358.7391 | 22127.2191  |  |
| Sent data volume [GByte]          | 264.1663   | 273.6709    |  |
| Total data bandwidth [MByte/s]    | 43485.9496 | 43485.9603  |  |
| Total data volume [GByte]         | 537.8370   | 537.8372    |  |
| <b>_</b>                          | <b></b>    |             |  |

#### Caveat: NUMA metrics vary strongly between CPU models

About half of the overall memory traffic is caused by the remote domain!

# OpenMP STREAM triad on a dual AMD Epyc 7451 ("Naples") (6 cores per LD)

- 1. Parallel init: Correct parallel initialization
- 2. LDO: Force data into LDO via numactl -m 0
- 3. Interleaved: numactl --interleave <LD range>



- Experiment: memory-bound Jacobi solver with sequential data initialization
  - No parallel data placement at all!
  - Expect no scaling across LDs
- Convergence threshold δ determines the runtime
  - The smaller  $\delta$ , the longer the run



- Experiment: memory-bound Jacobi solver with sequential data initialization
  - No parallel data placement at all!
  - Expect no scaling across LDs
- Convergence threshold δ determines the runtime
  - The smaller  $\delta$ , the longer the run
- Observation
  - No scaling across LDs for large δ (runtime 0.5 s)
  - Scaling gets better with smaller δ up to almost perfect efficiency ε (runtime 91 s)



- Experiment: memory-bound Jacobi solver with sequential data initialization
  - No parallel data placement at all!
  - Expect no scaling across LDs
- Convergence threshold δ determines the runtime
  - The smaller  $\delta$ , the longer the run
- Observation
  - No scaling across LDs for large δ (runtime 0.5 s)
  - Scaling gets better with smaller δ up to almost perfect efficiency ε (runtime 91 s)



- Experiment: memory-bound Jacobi solver with sequential data initialization
  - No parallel data placement at all!
  - Expect no scaling across LDs
- Convergence threshold δ determines the runtime
  - The smaller  $\delta$ , the longer the run
- Observation
  - No scaling across LDs for large δ (runtime 0.5 s)
  - Scaling gets better with smaller δ up to almost perfect efficiency ε (runtime 91 s)
- Conclusion
  - Something seems to "heal" the bad access locality on a time scale of tens of seconds



Linux kernel supports automatic page migration

```
$ cat /proc/sys/kernel/numa_balancing
0
$ echo 1 > /proc/sys/kernel/numa balancing  # activate
```

- Active on all current Linux distributions, some performance impact for single core execution
- Parameters control aggressiveness

\$ 11 /proc/sys/kernel/numa\* -rw-r--r-- 1 root root 0 Jun 26 09:16 numa\_balancing -rw-r--r-- 1 root root 0 Jun 26 09:16 numa\_balancing\_scan\_delay\_ms -rw-r--r-- 1 root root 0 Jun 26 09:16 numa\_balancing\_scan\_period\_max\_ms -rw-r--r-- 1 root root 0 Jun 26 09:16 numa\_balancing\_scan\_period\_min\_ms -rw-r--r-- 1 root root 0 Jun 26 09:16 numa\_balancing\_scan\_size\_mb

Default behavior is "take it slow"

Do not rely on it! Parallel first touch is still a good idea!

- Identify the problem
  - Is ccNUMA an issue in your code?
  - Simple test: run with numactl --interleave
  - Consider performance counters if available

- Identify the problem
  - Is ccNUMA an issue in your code?
  - Simple test: run with numactl --interleave
  - Consider performance counters if available
- Apply first-touch placement in initialization loops
  - Consider loop lengths and static scheduling
  - C++ and global/static objects may require special care

- Identify the problem
  - Is ccNUMA an issue in your code?
  - Simple test: run with numactl --interleave
  - Consider performance counters if available
- Apply first-touch placement in initialization loops
  - Consider loop lengths and static scheduling
  - C++ and global/static objects may require special care
- NUMA balancing is active on many Linux systems today
  - Automatic page migration
  - Slow process, may take many seconds (configurable)
  - Not a silver bullet
  - Still a good idea to do parallel first touch

- Identify the problem
  - Is ccNUMA an issue in your code?
  - Simple test: run with numactl --interleave
  - Consider performance counters if available
- Apply first-touch placement in initialization loops
  - Consider loop lengths and static scheduling
  - C++ and global/static objects may require special care
- NUMA balancing is active on many Linux systems today
  - Automatic page migration
  - Slow process, may take many seconds (configurable)
  - Not a silver bullet
  - Still a good idea to do parallel first touch
- If dynamic scheduling cannot be avoided
  - Consider round-robin placement as a quick (but non-ideal) fix
  - OpenMP 5.0 has some data affinity support

## **Tutorial conclusion**

- Know your system (node) architecture
- Enforce affinity



- Back-of-the-envelope models are extremely useful
- Modeling is not always predictive
- Bottleneck awareness rules
- Performance is not about tools. Use your brain!