

# Performance Engineering on Multi- and Manycores

<u>Georg Hager</u>, Gerhard Wellein HPC Services, Erlangen Regional Computing Center (RRZE)

Tutorial @ SAHPC 2012 December 1-3, 2012 KAUST, Thuwal Saudi Arabia Where can I find those gorgeous slides?

# http://goo.gl/cTSKL

**Or:** <u>http://blogs.fau.de/hager/tutorials/sahpc-2012/</u>

# Is there a book or anything?

Georg Hager and Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers

CRC Press, 2010 ISBN 978-1439811924 356 pages

# Fun and facts for HPC: <u>http://blogs.fau.de/hager/</u>







# The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access

## Some more architecture

- Simultaneous multithreading (SMT)
- ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

#### SAHPC 2012 Tutorial

# The Plan



### Motivation

- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

### SAHPC 2012 Tutorial



# Motivation 1: Scalability 4 the win!



# Lore 1

# In a world of highly parallel computer architectures only highly scalable codes will survive

Lore 2

Single core performance no longer matters since we have so many of them and use scalable codes

### Scalability Myth: Code scalability is the key issue





#### SAHPC 2012 Tutorial

### Scalability Myth: Code scalability is the key issue





#### SAHPC 2012 Tutorial



# Motivation 2: The 200x GPGPU speedup story

### Accelerator myth: The 200x speedup story...





#### **Dense Matrix-Vector-Multiplication (N=4500)**

SAHPC 2012 Tutorial

## **Sparse matrix-vector multiply**

M. Kreutzer et al., LSPP12 DOI: 10.1109/IPDPSW.2012.211



### GPGPU speedup: 1.6x,...,2.1x (no PCIe data transfer!)

SAHPC 2012 Tutorial

# The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

### SAHPC 2012 Tutorial



# **The Performance Engineering process**

Model building Our definition

# How model-building works: Physics



### **Newtonian mechanics**







Relativistic quantum field theory

 $\vec{F} = m\vec{a}$ 

Fails @ small scales!

 $U(1)_Y \otimes SU(2)_L \otimes SU(3)_c$ 

SAHPC 2012 Tutorial



# The Performance Engineering (PE) process:



# The performance model is the central component – if the model fails to predict the measurement, you learn something!

### The analysis has to be done for every loop / basic block!

# The Plan



### Motivation

### Performance Engineering

- Performance modeling
- The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

### SAHPC 2012 Tutorial



# Multicore processor and system architecture

### **Basics of machine characteristics**





#### SAHPC 2012 Tutorial





# But: P=5.4 GF/s (dp) for serial, non-SIMD code



# Yesterday (2006): Dual-socket Intel "Core2" node:



Uniform Memory Architecture (UMA)

Flat memory ; symmetric MPs

But: system "anisotropy"

# Today: Dual-socket Intel (Westmere) node:



Cache-coherent Non-Uniform Memory Architecture (ccNUMA)

HT / QPI provide scalable bandwidth at the price of ccNUMA architectures: *Where does my data finally end up?* 

# On AMD it is even more complicated $\rightarrow$ ccNUMA within a socket!

rr 🖻 🖃

• Up to 16 cores (8 Bulldozer modules) in a single socket



SAHPC 2012 Tutorial





- Two 8- (integer-) core chips per socket @ 2.3 GHz (3.3 @ turbo)
- Separate DDR3 memory interface per chip
  - ccNUMA on the socket!
- Shared FP unit per pair of integer cores ("module")
  - "256-bit" FP unit
  - SSE4.2, AVX, FMA4
- 16 kB L1 data cache per core
- 2 MB L2 cache per module
- 8 MB L3 cache per chip (6 MB usable)

#### SAHPC 2012 Tutorial



# Interlude: A glance at current accelerator technology

# **NVIDIA Kepler GK110 Block Diagram**



## Architecture

- 7.1B Transistors
- 15 SMX units
- > 1 TFLOP DP peak
- 1.5 MB L2 Cache
- 384-bit GDDR5
- PCI Express Gen3
- 3:1 SP:DP performance

|                   |  | LD/ST | SFU              | Core | Core | Core                | DP Uni  | t Core | Core | Core | DP Unit           |
|-------------------|--|-------|------------------|------|------|---------------------|---------|--------|------|------|-------------------|
|                   |  |       |                  |      | PCI  | Express 3.0 Host Ir | terface |        |      |      |                   |
| Memory Controller |  |       |                  |      |      |                     |         |        |      |      | Mamory Controller |
| Memory Controller |  |       |                  |      |      |                     |         |        |      |      |                   |
| Memory Controller |  |       | SWX <sup>1</sup> |      |      | SWX                 |         |        |      |      | Memory Control    |

© NVIDIA Corp. Used with permission.

# Intel Xeon Phi block diagram

# Architecture

- 3B Transistors
- 60+ cores
- 512 bit SIMD
- ≈ 1 TFLOP DP peak
- 0.5 MB L2/core
- GDDR5
- 2:1 SP:DP performance





### **Comparing accelerators**



### Intel Xeon Phi

- 60+ IA32 cores each with 512 Bit SIMD FMA unit → 480/960 SIMD DP/SP tracks
- Clock Speed: ~1000 MHz
- Transistor count: ~3 B (22nm)
- Power consumption: ~250 W
- Peak Performance (DP): ~ 1 TF/s
- Memory BW: ~250 GB/s (GDDR5)
- Threads to execute: 60-240+
- Programming:
   Fortran/C/C++ +OpenMP + SIMD



# NVIDIA Kepler K20

- 15 SMX units each with 192 "cores"
  - → 960/2880 DP/SP "cores" in total



- Clock Speed: ~700 MHz
- Transistor count: 7.1 B (28nm)
- Power consumption: ~250 W
- Peak Performance (DP): ~ 1.3 TF/s
- Memory BW: ~ 250 GB/s (GDDR5)
- Threads to execute: 10.000+
- Programming: CUDA, OpenCL, (OpenACC)

| TOP7: "Stampede" at Texas Center | <b>TOP500</b> | TOP1: "Titan" at Oak Ridge National |
|----------------------------------|---------------|-------------------------------------|
| for Advanced Computing           | rankings      | Laboratory                          |

# **Trading single thread performance for parallelism:** *GPGPUs vs. CPUs*



# GPU vs. CPU light speed estimate:

- 1. Compute bound: 2-10x
- 2. Memory Bandwidth: 1-5x



|                                                                                   | Intel Core i5 – 2500<br>("Sandy Bridge") | Intel Xeon E5-2680 DP<br>node ("Sandy Bridge") | NVIDIA K20x<br>("Kepler") |  |  |  |  |
|-----------------------------------------------------------------------------------|------------------------------------------|------------------------------------------------|---------------------------|--|--|--|--|
| Cores@Clock                                                                       | 4 @ 3.3 GHz                              | 2 x 8 @ 2.7 GHz                                | 2880 @ 0.7 GHz            |  |  |  |  |
| Performance+/core                                                                 | 52.8 GFlop/s                             | 43.2 GFlop/s                                   | 1.4 GFlop/s               |  |  |  |  |
| Threads@STREAM                                                                    | <4                                       | <16                                            | >8000?                    |  |  |  |  |
| Total performance+                                                                | 210 GFlop/s                              | 691 GFlop/s                                    | 4,000 GFlop/s             |  |  |  |  |
| Stream BW                                                                         | 18 GB/s                                  | 2 x 40 GB/s                                    | 168 GB/s (ECC=1)          |  |  |  |  |
| Transistors / TDP                                                                 | 1 Billion* / 95 W                        | 2 x (2.27 Billion/130W)                        | 7.1 Billion/250W          |  |  |  |  |
| + Single Precision * Includes on-chip GPU and PCI-Express Complete compute device |                                          |                                                |                           |  |  |  |  |

#### SAHPC 2012 Tutorial

# Parallel programming models

on multicore multisocket nodes

# Shared-memory (intra-node)

- Good old MPI (current standard: 2.2)
- OpenMP (current standard: 3.0)
- POSIX threads
- Intel Threading Building Blocks (TBB)
- Cilk+, OpenCL, StarSs,... you name it

# Distributed-memory (inter-node)

- MPI (current standard: 2.2)
- PVM (gone)

# Hybrid

- Pure MPI
- MPI+OpenMP
- MPI + any shared-memory model
- MPI (+OpenMP) + CUDA/OpenCL/...

All models require awareness of *topology* and *affinity* issues for getting best performance out of the machine!



### **Parallel programming models:** *Pure MPI*







#### SAHPC 2012 Tutorial

# Parallel programming models:

Hybrid MPI+OpenMP on a multicore multisocket cluster





# The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models

## Data access

- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

### SAHPC 2012 Tutorial



# **Data access on modern processors**

Characterization of memory hierarchies General performance properties of multicore processors

### Latency and bandwidth in modern computer environments





SAHPC 2012 Tutorial

# Interlude: Data transfers in a memory hierarchy



- How does data travel from memory to the CPU and back?
- Example: Array copy A(:)=C(:)





- Report performance for different N
- Choose NITER so that accurate time measurement is possible
- This kernel is limited by data transfer performance for all memory levels on all current architectures!





## The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

#### SAHPC 2012 Tutorial



# General remarks on the performance properties of multicore multisocket systems





## Parallel and shared resources within a shared-memory node



#### Parallel resources:

- Execution/SIMD units 1
- Cores 2
- Inner cache levels 3
- Sockets / memory domains 4
- Multiple accelerators 5

#### **Shared resources:**

- Outer cache level per socket 6
- Memory bus per socket 7
- Intersocket link 8
- PCIe bus(es) 9
- Other I/O resources 10

## How does your application react to all of those details?

## The parallel vector triad benchmark (Near-)Optimal code on (Cray) x86 machines



```
call get walltime(S)
!$OMP parallel private(j)
                                               "outer parallel": Avoid thread team restart at
do j=1,R
                                               every workshared loop
  if (N.ge.CACHE LIMIT) then
!DIR$ LOOP INFO cache nt(A)
!$OMP <del>parallel</del> do
    do i=1,N
                                          Large-N version
      A(i) = B(i) + C(i) * D(i)
                                          (nontemporal stores)
    enddo
!$OMP end parallel do
  else
!DIR$ LOOP INFO cache(A)
!$OMP <del>parallel</del> do
    do i=1,N
                                          Small-N version
      A(i) = B(i) + C(i) * D(i)
                                          (standard stores)
    enddo
!$OMP end <del>parallel</del> do
  endif
  ! prevent loop interchange
  if(A(N2).lt.0) call dummy(A,B,C,D)
enddo
!$OMP end parallel
```

```
call get_walltime(E)
```



SAHPC 2012 Tutorial



## The parallel vector triad benchmark

Intra-chip scaling on Cray XE6 Interlagos node

SAHPC 2012 Tutorial



## The parallel vector triad benchmark

Nontemporal stores on Cray XE6 Interlagos node





## The parallel vector triad benchmark

Topology dependence on Cray XE6 Interlagos node





#### 25000 OpenMP T=8 Memory Memory OpenMP T=16 OpenMP T=24 ιοιλ Ιυτειταce OpenMP T=32 20000 ۲S Performance [MFlop/s] CI CS CI CS CI CS C1 C2 C1 C2 C1 C2 C1 C2 15000 10|L1| L2 Memory Interfa Memory Interface Memory Memory 10000 sync overhead grows with core/chip count 5000 bandwidth (up to 8000 cy here) scalability across memory 0 interfaces $10^{2}$ $10^{1}$ $10^{4}$ 10<sup>5</sup> $10^{6}$ Loop length N

## The parallel vector triad benchmark

Inter-chip scaling on Cray XE6 Interlagos node

## What will it look like on many-cores?



## Go figure.





## Bandwidth saturation effects in cache and memory

A look at different processors

## **Bandwidth limitations: Main Memory**

Scalability of shared data paths inside a NUMA domain (V-Triad)





SAHPC 2012 Tutorial

### **Bandwidth limitations: Outer-level cache**

Scalability of shared data paths in L3 cache







## Some data on OpenMP synchronization overhead



**!\$OMP PARALLEL** ...

\$0MP BARRIER

!\$OMP DO

•••

**!\$OMP ENDDO !\$OMP END PARALLEL**  Threads are synchronized at **explicit** AND **implicit** barriers. These are a main source of overhead in OpenMP progams.

Determine costs via modified OpenMP Microbenchmarks testcase (epcc)

## On x86 systems there is no hardware support for synchronization!

- Next slide: Test OpenMP Barrier performance...
- for different compilers
- and different topologies:
  - shared cache
  - shared socket
  - between sockets
- and different thread counts
  - 2 threads
  - full domain (chip, socket, node)

## Thread synchronization overhead on AMD Interlagos

OpenMP barrier overhead in CPU cycles



| 2 Threads      | Cray 8.03 | GCC 4.6.2   | PGI 11.8    | Intel 12.1.3 |
|----------------|-----------|-------------|-------------|--------------|
| Shared L2      | 258       | 3995        | 1503        | 128623       |
| Shared L3      | 698       | 2853        | 1076        | 128611       |
| Same<br>socket | 879       | 2785        | 1297        | 128695       |
| Other socket   | 940       | 2740 / 4222 | 1284 / 1325 | 128718       |

•••

Intel compiler barrier very expensive on Interlagos

OpenMP & Cray compiler 🙂

| Full domain | Cray 8.03 | GCC 4.6.2 | PGI 11.8 | Intel 12.1.3 |
|-------------|-----------|-----------|----------|--------------|
| Shared L3   | 2272      | 27916     | 5981     | 151939       |
| Socket      | 3783      | 49947     | 7479     | 163561       |
| Node        | 7663      | 167646    | 9526     | 178892       |

## **Thread synchronization overhead on Intel CPUs**

pthreads vs. OpenMP vs. Spin loop



| 2 Threads                                                         |  | Q9550 (shared L2) |               |              | i7 920 (shared L3) |                  |  |
|-------------------------------------------------------------------|--|-------------------|---------------|--------------|--------------------|------------------|--|
| pthreads_barrier_wait                                             |  | 23739             |               |              | 6511               |                  |  |
| omp barrier gcc 4.3.3                                             |  | 22603             |               |              | 7333               |                  |  |
| omp barrier icc 11.0                                              |  | 399               |               |              | 469                |                  |  |
| Spin loop                                                         |  | 231               |               | 270          |                    |                  |  |
|                                                                   |  |                   |               |              |                    |                  |  |
| Nehalem 2 Threads                                                 |  | ared SMT th       | reads         | shared       | I L3               | different socket |  |
| pthreads_barrier_wait                                             |  | 23352             |               | 479          | 6                  | 49237            |  |
| omp barrier (icc 11.0)                                            |  | 2761              | 7             | 479          |                    | 1206             |  |
| Spin loop                                                         |  | 17388             | $\mathcal{T}$ | <b>7</b> 267 |                    | 787              |  |
| pthreads → OS kernel call Syncing SMT threads is expensive        |  |                   |               |              |                    |                  |  |
| Spin loop does fine for shared cache sync OpenMP & Intel compiler |  |                   |               |              |                    |                  |  |



## Understanding MPI communication in multicore environments

## Intra-node vs. inter-node MPI

MPI Cartesian topologies and rank-subdomain mapping



 Common misconception: Intranode MPI is infinitely fast compared to internode

## Reality

- Intranode latency is much smaller than internode
- Intranode asymptotic bandwidth is surprisingly comparable to internode
- Difference in saturation behavior

## Other issues

- Mapping between ranks, subdomains and cores with Cartesian MPI topologies
- Overlapping intranode with internode communication

## **MPI and Multicores**

## Clusters: Unidirectional internode Ping-Pong bandwidth





SAHPC 2012 Tutorial

#### Performance Engineering

ъ

## **MPI and Multicores**

## Clusters: Unidirectional intranode Ping-Pong bandwidth





Mapping problem for most efficient communication paths!?

SAHPC 2012 Tutorial



Example: Stencil solver with halo exchange



- **Goal:** Reduce inter-node halo traffic
- Subdomains exchange halo with neighbors
  - Populate a node's ranks with "maximum neighboring" subdomains
  - This minimizes a node's communication surface
- Shouldn't MPI\_CART\_CREATE (w/ reorder) take care of this?

**MPI** rank-subdomain mapping in Cartesian topologies:

A 3D stencil solver and the growing number of cores per node





SAHPC 2012 Tutorial

## **MPI rank-subdomain mapping:**

3D stencil solver – measurements for 8ppn and 4ppn GBE vs. IB



SAHPC 2012 Tutorial

#### Intranode MPI

- May not be as fast as you think...
- Becomes more important as core counts increase
- May not be handled optimally by your MPI library

## Rank-core mapping may be crucial for best performance

- Reduce inter-node traffic
- Most MPIs do not recognize this
- Some (e.g., Cray) can give you hints toward optimal placement





## Affinity matters!

- Almost all performance properties depend on the position of
  - Data
  - Threads/processes
- Consequences
  - Know the topology of your machine
  - Know where your threads are running
  - Know where your data is

## Bandwidth bottlenecks are ubiquitous

- Bad scaling is not always a bad thing
- Do you exhaust your bottlenecks?

## Synchronization overhead may be an issue

... and also depends on affinity!

## The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization

## Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

#### SAHPC 2012 Tutorial



## Case study: OpenMP-parallel sparse matrix-vector multiplication

A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory

## **Sparse matrix-vector multiply (sMVM)**

- **FFEE**
- Key ingredient in some matrix diagonalization algorithms
  - Lanczos, Davidson, Jacobi-Davidson
- Store only N<sub>nz</sub> nonzero elements of matrix and RHS, LHS vectors with N<sub>r</sub> (number of matrix rows) entries
- "Sparse": N<sub>nz</sub> ~ N<sub>r</sub>







- val[] stores all the nonzeros (length N<sub>nz</sub>)
- col\_idx[] stores the column index of each nonzero (length N<sub>nz</sub>)
- row\_ptr[] stores the starting
  index of each new row in val[]
  (length: N<sub>r</sub>)



## **Case study: Sparse matrix-vector multiply**



## Strongly memory-bound for large data sets

Streaming, with partially indirect access:

```
!$OMP parallel do
do i = 1,N<sub>r</sub>
  do j = row_ptr(i), row_ptr(i+1) - 1
    c(i) = c(i) + val(j) * b(col_idx(j))
  enddo
enddo
!$OMP end parallel do
```

- Usually many spMVMs required to solve a problem
- MPI parallelization possible and well-studied

 Following slides: Performance data on one 24-core AMD Magny Cours node

## **Bandwidth-bound parallel algorithms:** Sparse MVM



- Data storage format is crucial for performance properties
  - Most useful general format: Compressed Row Storage (CRS)
  - SpMVM is easily parallelizable in shared and distributed memory
- For large problems, spMVM is inevitably memory-bound
  - Intra-LD saturation effect on modern multicores

- MPI-parallel spMVM is often communication-bound
  - See later part for what we can do about this...



## **Application: Sparse matrix-vector multiply**

Strong scaling on one XE6 Magny-Cours node



## Case 1: Large matrix



SAHPC 2012 Tutorial

## **Application: Sparse matrix-vector multiply**

Strong scaling on one XE6 Magny-Cours node



## Case 2: Medium size



## **Application: Sparse matrix-vector multiply**

Strong scaling on one Magny-Cours node



## Case 3: Small size



SAHPC 2012 Tutorial

- If the problem is "large", bandwidth saturation on the socket is a reality
  - → There are "spare cores"
  - Very common performance pattern
- What to do with spare cores?
  - Use them for other tasks, such as MPI communication
  - Let them idle → saves energy with minor loss in time to solution

## Can we predict the saturated performance?

- Bandwidth-based performance modeling!
- What is the significance of the indirect access?<sup>2</sup> Can it be modeled?
- Can we predict the saturation point?
  - ... and why is this important?





# The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

### SAHPC 2012 Tutorial



# Basic performance modeling and "motivated optimizations"

The Roofline Model Case study: The Jacobi smoother



# **The Roofline Model**



## The Roofline Model – A tool for more insight

- 1. Determine the applicable peak performance of a loop, assuming that data comes from L1 cache
- 2. Determine the computational intensity (flops per byte transferred) over the slowest data path utilized
- 3. Determine the applicable peak bandwidth of the slowest data path utilized



### SAHPC 2012 Tutorial









## **Bandwidth-bound (simple case)**

- Accurate traffic calculation (writeallocate, strided access, ...)
- Practical ≠ theoretical BW limits
- Erratic access patterns

## **Core-bound (may be complex)**

- Multiple bottlenecks: LD/ST, arithmetic, pipelines, SIMD, execution ports
- See next slide...



### SAHPC 2012 Tutorial

## **Complexities of in-core execution**



### Multiple bottlenecks:

- L1 Icache bandwidth
- Decode/retirement throughput
- Port contention (direct or indirect)
- Arithmetic pipeline stalls (dependencies)
- Overall pipeline stalls (branching)
- L1 Dcache bandwidth (LD/ST throughput)
- Scalar vs. SIMD execution

Register pressure

. . .

Alignment issues

SAHPC 2012 Tutorial





- Code balance (B<sub>c</sub>) quantifies the requirements of the code
  - Reciprocal of comp. intensity



- b<sub>s</sub> = achievable bandwidth over the slowest data path
  - E.g., measured by suitable microbenchmark (STREAM, ...)

Lightspeed for absolute performance:
 (*P*<sub>max</sub> : "applicable" peak performance)



- Example: Vector triad A(:)=B(:)+C(:) \*D(:) on 2.3 GHz Interlagos
  - B<sub>c</sub> = (4+1) Words / 2 Flops = 2.5 W/F (including write allocate)

 $b_{\rm S}/B_c = 1.7$  GF/s (1.2 % of peak performance)



- The balance metric formalism is based on some (crucial) assumptions:
  - There is a clear concept of "work" vs. "traffic"
    - "work" = flops, updates, iterations...
    - "traffic" = required data to do "work"
  - Attainable bandwidth of code = input parameter! Determine effective bandwidth via simple streaming benchmarks to model more complex kernels and applications
  - Data transfer and core execution overlap perfectly!
  - Slowest data path is modeled only; all others are assumed to be infinitely fast
  - If data transfer is the limiting factor, the bandwidth of the slowest data path can be utilized to 100% ("saturation")
  - Latency effects are ignored, i.e. perfect streaming mode



# Case study: A 3D Jacobi smoother

The basics in two dimensions Performance analysis and modeling



# - Laplace equation in 2D: $\Delta \Phi = 0$

# Solve with Dirichlet boundary conditions using Jacobi iteration scheme:

```
double precision, dimension(0:imax+1,0:kmax+1,0:1) :: phi
   integer :: t0,t1
   t0 = 0; t1 = 1
   do it = 1, itmax ! choose suitable number of sweeps
     do k = 1, kmax
                                                            Reuse when computing
       do i = 1, imax
                                                            phi(i+2,k,t1)
          ! four flops, one store, four loads
          phi(i,k,t1) = (phi(i+1,k,t0) + phi(i-1,k,t0))
                          + phi(i, k+1, t0) + phi(i, k-1, t0) ) * 0.25
       enddo
     enddo
                               Naive balance (incl. write allocate):
     ! swap arrays
            ; t0=t1 ; t1=i
                            phi(:,:,t0):3LD+
     i
   enddo
                               phi(:,:,t1):1 ST+1LD
                               \rightarrow B<sub>c</sub> = 5 W / 4 FLOPs = 1.25 W / F
WRITE ALLOCATE:
LD + ST phi(i,k,t1)
```



### Modern cache subsystems may further reduce memory traffic



If cache is large enough to hold at least 2 rows (shaded region): Each phi(:,:,t0) is loaded once from main memory and re-used 3 times from cache:

phi(:,:,t0): 1 LD + phi(:,:,t1): 1 ST+ 1LD  $\rightarrow B_c = 3 W / 4 F = 0.75 W / F$ 

If cache is too small to hold one row: phi(:,:,t0): 2 LD + phi(:,:,t1): 1 ST+ 1LD  $\rightarrow B_c = 5 W / 4 F = 1.25 W / F$ 



### Alternative implementation ("Macho FLOP version")

- MFlops/sec increases by 7/4 but time to solution remains the same
- Better metric (for many iterative stencil schemes): Lattice Site Updates per Second (LUPs/sec)

2D Jacobi example: Compute LUPs/sec metric via

$$P[LUPs / s] = \frac{it_{\max} \cdot i_{\max} \cdot k_{\max}}{T_{\text{wall}}}$$

# $2D \rightarrow 3D$



### 3D sweep:

- Best case balance: 1 LD phi(i,j,k+1,t0) 1 ST + 1 write allocate phi(i,j,k,t1) 6 flops →  $B_c = 0.5$  W/F (24 bytes/update)
- No 2-layer condition but 2 rows fit: B<sub>c</sub> = 5/6 W/F (40 bytes/update)
- Worst case (2 rows do not fit): B<sub>c</sub> = 7/6 W/F (56 bytes/update)

### **3D Jacobi solver**

### Performance of vanilla code on one Interlagos chip (8 cores)





SAHPC 2012 Tutorial



- We have made sense of the memory-bound performance vs. problem size
  - "Layer conditions" lead to predictions of code balance
  - Achievable memory bandwidth is input parameter

- The model works only if the bandwidth is "saturated"
  - In-cache modeling is more involved

 Optimization == reducing the code balance by code transformations

See below



# **Data access optimizations**

Case study: Optimizing a Jacobi solver Case study: Erratic RHS access for sparse MVM



# Case study: 3D Jacobi solver

## Spatial blocking for improved cache re-use



### Remember the 3D Jacobi solver on Interlagos?





SAHPC 2012 Tutorial

# ггее

## Assumptions:

- cache can hold 32 elements (16 for each array)
- Cache line size is 4 elements
- Perfect eviction strategy for source array



This element is needed for three more updates; but 29 updates happen before this element is used for the last time

## Assumptions:

- cache can hold 32 elements (16 for each array)
- Cache line size is 4 elements
- Perfect eviction strategy for source array



This element is needed for three more updates but has been evicted

### SAHPC 2012 Tutorial





- divide system into blocks
- update block after block
- same performance as if three complete rows of the systems fit into cache



- Spatial blocking reorders traversal of data to account for the data update rule of the code
- →Elements stay sufficiently long in cache to be fully reused
- Spatial blocking improves temporal locality!
  (Continuous access in inner loop ensures spatial locality)

(Continuous access in inner loop ensures spatial locality)



This element remains in cache until it is fully used (only 6 updates happen before last use of this element)

## Jacobi iteration (3D): Spatial blocking





## Guidelines:

- Blocking of inner loop levels (traversing continuously through main memory)
- Blocking sizes large enough to fulfill "layer condition"
- Cache size is a hard limit!
- Blocking loops may have some impact on ccNUMA page placement (see later)

## 3D Jacobi solver (problem size 400<sup>3</sup>)

Blocking different loop levels (8 cores Interlagos)





SAHPC 2012 Tutorial

### **3D Jacobi solver**

Spatial blocking + nontemporal stores





SAHPC 2012 Tutorial



# Case study: Erratic RHS access in sparse MVM

### "Modeling" indirect access





Sparse MVM in double precision w/ CRS:



- DP CRS code balance
  - κ quantifies extra traffic for loading RHS more than once
  - Naive performance = b<sub>S</sub>/B<sub>CRS</sub>
  - Determine k by measuring performance and actual memory bandwidth

G. Schubert, G. Hager, H. Fehske and G. Wellein: *Parallel sparse matrix-vector multiplication as a test case for hybrid MPI+OpenMP programming*. Workshop on Large-Scale Parallel Processing (LSPP 2011), May 20th, 2011, Anchorage, AK. <u>DOI:10.1109/IPDPS.2011.332</u>, Preprint: <u>arXiv:1101.0091</u>

## $\kappa$ is determined by the sparsity pattern and the cache



### Analysis for HMeP matrix on Nehalem EP socket

- BW used by spMVM kernel = 18.1 GB/s  $\rightarrow$  should get  $\approx$  2.66 Gflop/s spMVM performance if  $\kappa = 0$
- Measured spMVM performance = 2.25 Gflop/s
- Solve 2.25 Gflop/s =  $b_S/B_{CRS}$  for  $\kappa \approx 2.5$

→ 37.5 extra bytes per row
→ RHS is loaded 6 times from memory
→ about 33% of BW goes into RHS



 Conclusion: Even if the roofline/bandwidth model does not work 100%, we can still learn something from the deviations

Optimization? Perhaps you can reorganize the matrix



## ... on the example of spMVM with HMeP matrix





### Assumes one of two bottlenecks

- 1. In-core execution
- 2. Bandwidth of a single hierarchy level
- Latency effects are not modeled → pure data streaming assumed
- Data transfer and in-core time overlap 100%
- In-core execution is sometimes hard to model
- Saturation effects in multicore chips are not explained
  - ECM model gives more insight

G. Hager, J. Treibig, J. Habich and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Submitted. Preprint: arXiv:1208.2908



### SAHPC 2012 Tutorial



- There is no substitute for knowing what's going on between your code and the hardware
- Make sense of performance behavior through sensible application of performance models
  - However, there is no "golden formula" to do it all for you automagically
  - If the model does not work properly, you learn something new

## Model inputs:

- Code analysis/inspection
- Hardware counter data
- Microbenachmark analysis
- Architectural features

## Simple models work best; do not try to make it more complex than necessary

SAHPC 2012 Tutorial

# The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access

## Some more architecture

- Simultaneous multithreading (SMT)
- ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

### SAHPC 2012 Tutorial



# **Boosting core efficiency: Simultaneous multithreading (SMT)**

Principles and performance impact SMT vs. independent instruction streams Facts and fiction SMT Makes a single physical core appear as two or more "logical" cores → multiple threads/processes run concurrently



## SMT principle (2-way example):



# **SMT** impact

- **FF2E**
- SMT is primarily suited for increasing processor throughput
  - With multiple threads/processes running concurrently
- Scientific codes tend to utilize chip resources quite well
  - Standard optimizations (loop fusion, blocking, ...)
  - High data and instruction-level parallelism
  - Exceptions do exist

# SMT is an important topology issue

- SMT threads share almost all core resources
  - Pipelines, caches, data paths
- Affinity matters!
- If SMT is not needed
  - pin threads to physical cores
  - or switch it off via BIOS etc.



# **SMT** impact

- SMT adds another layer of topology (inside the physical core)
- Caveat: SMT threads share all caches!
- Possible benefit: Better pipeline throughput
  - Filling otherwise unused pipelines
  - Filling pipeline bubbles with other thread's executing instructions:



- Beware: Executing it all in a single thread (if possible) may reach the same goal without SMT:





### Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT MULT Pipeline depth: 5 stages $\rightarrow$ 1 F / 5 cycles for recursive update



SAHPC 2012 Tutorial

# Simultaneous recursive updates with SMT



### Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT MULT Pipeline depth: 5 stages → 1 F / 5 cycles for recursive update



### 5 independent updates on a single thread do the same job!

SAHPC 2012 Tutorial



### Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT Pure update benchmark can be vectorized $\rightarrow$ 2 F / cycle (store limited)



SAHPC 2012 Tutorial

#### SAHPC 2012 Tutorial

# SMT myths: Facts and fiction (1)

Myth: "If the code is compute-bound, then the functional units should be saturated and SMT should show no improvement."



- 1. A compute-bound loop does not necessarily saturate the pipelines; dependencies can cause a lot of bubbles, which may be filled by SMT threads.
- 2. If a pipeline is already full, SMT will not improve its utilization





# SMT myths: Facts and fiction (2)

- Myth: "If the code is memory-bound, SMT should help because it can fill the bubbles left by waiting for data from memory."
- Truth:
  - If the maximum memory bandwidth is already reached, SMT will not help since the relevant resource (bandwidth) is exhausted.
     If the maximum memory bandwidth is already reached, SMT will not 7000
     2 F/cycle
     4(i)=A(i)\*s [SIMD]
  - 2. If the relevant bottleneck is not exhausted, SMT may help since it can fill bubbles in the LOAD pipeline.

This applies also to other "relevant bottlenecks!"



#### Performance Engineering

 $10^{7}$ 



# **SMT** myths: Facts and fiction (3)



 Myth: "SMT can help bridge the latency to memory (more outstanding references)."

### Truth:

Outstanding references may or may not be bound to SMT threads; they may be a resource of the memory interface and shared by all threads. The benefit of SMT with memory-bound code is usually due to better utilization of the pipelines so that less time gets "wasted" in the cache hierarchy.

See also the "ECM Performance Model" later on.





| Functional parallelization                          | × × |
|-----------------------------------------------------|-----|
| FP-only parallel loop code                          | × 🗹 |
| Frequent thread synchronization                     | ×   |
| Code sensitive to cache size                        | ×   |
| Strongly memory-bound code                          | ×   |
| Independent pipeline-unfriendly instruction streams | V   |



# Beyond the chip boundary: Efficient parallel programming on ccNUMA nodes

Performance characteristics of ccNUMA nodes First touch placement policy ccNUMA locality and erratic access

## ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?



 Page placement is implemented in units of OS pages (often 4kB, possibly more)

### Cray XE6 Interlagos node 4 chips, two sockets, 8 threads per ccNUMA domain

### ccNUMA map: Bandwidth penalties for remote access

- Run 8 threads per ccNUMA domain (1 chip)
- Place memory in different domain  $\rightarrow$  4x4 combinations
- STREAM triad benchmark using nontemporal stores







### numactl can influence the way a binary maps its memory pages:

```
numactl --membind=<nodes> a.out  # map pages only on <nodes>
    --preferred=<node> a.out  # map pages on <node>
    # and others if <node> is full
    --interleave=<nodes> a.out  # map pages round robin across
    # all <nodes>
```

### Examples:

### But what is the default without numactl?



Golden Rule" of ccNUMA:

# A memory page gets mapped into the local memory of the processor that first touches it!

- Except if there is not enough local memory available
- This might be a problem, see later
- Caveat: "touch" means "write", not "allocate"
- Example:

Memory not mapped here yet

double \*huge = (double\*)malloc(N\*sizeof(double));



It is sufficient to touch a single item to map the entire page

# **Coding for ccNUMA data locality**



### Most simple case: explicit initialization



# **Coding for ccNUMA data locality**



 Sometimes initialization is not so obvious: I/O cannot be easily parallelized, so "localize" arrays before I/O



#### SAHPC 2012 Tutorial



- Required condition: OpenMP loop schedule of initialization must be the same as in all computational loops
  - Only choice: static! Specify explicitly on all NUMA-sensitive loops, just to be sure...
  - Imposes some constraints on possible optimizations (e.g. load balancing)
  - Presupposes that all worksharing loops with the same loop length have the same thread-chunk mapping
  - If dynamic scheduling/tasking is unavoidable, more advanced methods may be in order

### How about global objects?

- Better not use them
- If communication vs. computation is favorable, might consider properly placed copies of global data
- std::vector in C++ is initialized serially by default
  - STL allocators provide an elegant solution

**Coding for Data Locality:** *Placement of static arrays or arrays of objects* 

- **FFBE**
- Speaking of C++: Don't forget that constructors tend to touch the data members of an object. Example:

```
class D {
  double d;
public:
  D(double d=0.0) throw() : d(d) {}
  inline D operator+(const D& o) throw() {
    return D(d+o.d);
  }
  inline D operator*(const D& o) throw() {
    return D(d*o.d);
  }
};
                \rightarrow placement problem with
                   D^* array = new D[1000000];
```



 Placement of objects is then done automatically by the C++ runtime via "placement new"

**Coding for Data Locality:** 

NUMA allocator for parallel first touch in **std::vector**<>



```
template <class T> class NUMA Allocator {
public:
  T* allocate(size_type numObjects, const void
               *localityHint=0) {
    size type ofs,len = numObjects * sizeof(T);
    void *m = malloc(len);
    char *p = static cast<char*>(m);
    int i,pages = len >> PAGE BITS;
#pragma omp parallel for schedule(static) private(ofs)
    for(i=0; i<pages; ++i) {</pre>
      ofs = static cast<size t>(i) << PAGE BITS;</pre>
      p[ofs]=0;
    }
    return static cast<pointer>(m);
};
           Application:
```

vector<double,NUMA\_Allocator<double> > x(1000000)



- If your code is cache-bound, you might not notice any locality problems
- Otherwise, bad locality limits scalability at very low CPU numbers (whenever a node boundary is crossed)
  - If the code makes good use of the memory interface
  - But there may also be a general problem in your code...
- Try running with numactl --interleave ...
  - If performance goes up  $\rightarrow$  ccNUMA problem!
- Consider using performance counters
  - LIKWID-perfctr can be used to measure nonlocal memory accesses
  - Example for Intel Nehalem (Core i7):

env OMP\_NUM\_THREADS=8 likwid-perfctr -g MEM -C N:0-7 ./a.out

# Using performance counters for diagnosing bad ccNUMA access locality





# If all fails...



- Even if all placement rules have been carefully observed, you may still see nonlocal memory traffic. Reasons?
  - Program has erratic access patters → may still achieve some access parallelism (see later)
  - OS has filled memory with buffer cache data:

| # numactlha    | ardware # idle node! |  |
|----------------|----------------------|--|
| available: 2 r | nodes (0-1)          |  |
| node 0 size: 2 | 2047 MB              |  |
| node 0 free: 9 | 906 MB               |  |
| node 1 size: 1 | 1935 MB              |  |
| node 1 free: 1 | 1798 MB              |  |
|                |                      |  |

top - 14:18:25 up 92 days, 6:07, 2 users, load average: 0.00, 0.02, 0.00 Mem: 4065564k total, 1149400k used, 2716164k free, 43388k buffers Swap: 2104504k total, 2656k used, 2101848k free, 1038412k cached

## ccNUMA problems beyond first touch: Buffer cache

# OS uses part of main memory for disk buffer (FS) cache

- If FS cache fills part of memory, apps will probably allocate from foreign domains
- non-local access!
- "sync" is not sufficient to drop buffer cache blocks



# Remedies

- Drop FS cache pages after user job has run (admin's job)
  - seems to be automatic after aprun has finished on Crays
- User can run "sweeper" code that allocates and touches all physical memory before starting the real application
- numactl tool or aprun can force local allocation (where applicable)
- Linux: There is no way to limit the buffer cache size in standard kernels

## ccNUMA problems beyond first touch: Buffer cache



# Real-world example: ccNUMA and the Linux buffer cache Benchmark:

- 1. Write a file of some size from LD0 to disk
- 2. Perform bandwidth benchmark using all cores in LD0 and maximum memory installed in LD0

Result: By default, Buffer cache is given priority over local page placement → restrict to local

domain if possible!



#### SAHPC 2012 Tutorial

### ccNUMA placement and erratic access patterns



 Sometimes access patterns are just not nicely grouped into contiguous chunks:

```
double precision :: r, a(M)
!$OMP parallel do private(r)
do i=1,N
    call RANDOM_NUMBER(r)
    ind = int(r * M) + 1
    res(i) = res(i) + a(ind)
enddo
!OMP end parallel do
```

 Or you have to use tasking/dynamic scheduling:

```
!$OMP parallel
!$OMP single
do i=1,N
    call RANDOM_NUMBER(r)
    if(r.le.0.5d0) then
!$OMP task
      call do_work_with(p(i))
!$OMP end task
    endif
enddo
!$OMP end single
!$OMP end parallel
```

In both cases page placement cannot easily be fixed for perfect parallel access



- Worth a try: Interleave memory across ccNUMA domains to get at least some parallel access
  - 1. Explicit placement:



Fine-grained program-controlled placement via libnuma (Linux)
using, e.g., numa\_alloc\_interleaved\_subset(),
numa alloc interleaved() and others

### The curse and blessing of interleaved placement: OpenMP STREAM triad on 4-socket (48 core) Magny Cours node



- Parallel init: Correct parallel initialization
- LD0: Force data into LD0 via numact1 -m 0
- Interleaved: numactl --interleave <LD range>



SAHPC 2012 Tutorial



- ccNUMA is present on all standard cluster architectures
- With pure MPI (and proper affinity control) you should be fine
  - However, watch out for buffer cache
- With threading, you may be fine with one process per ccNUMA domain
- Thread groups spanning more than one domain may cause problems
  - Employ first touch placement ("Golden Rule")
  - Experiment with round-robin placement
- If access patterns are totally erratic, round-robin may be your only choice
  - But there are advanced solutions ("locality queues")

# The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA

# Putting cores to good use

- Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

#### SAHPC 2012 Tutorial



# Case study: Asynchronous MPI communication in sparse MVM

What to do with spare cores

### **Distributed-memory parallelization of spMVM**







Variant 1: "Vector mode" without overlap

- Standard concept for "hybrid MPI+OpenMP"
- Multithreaded computation (all threads)
- Communication only outside of computation



 Benefit of threaded MPI process only due to message aggregation and (probably) better load balancing

G. Hager, G. Jost, and R. Rabenseifner: *Communication Characteristics and Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-core SMP Nodes*.In: Proceedings of the Cray Users Group Conference 2009 (CUG 2009), Atlanta, GA, USA, May 4-7, 2009. <u>PDF</u>



Variant 2: "Vector mode" with naïve overlap ("good faith hybrid")

- Relies on MPI to support async nonblocking PtP
- Multithreaded computation (all threads)
- Still simple programming
- Drawback: Result vector is written twice to memory
  - modified performance model





- Variant 3: "Task mode" with dedicated communication thread
- Explicit overlap, more complex to implement
- One thread missing in team of compute threads
  - But that doesn't hurt here...
  - Using tasking seems simpler but may require some work on NUMA locality

# Drawbacks

- Result vector is written twice to memory
- No simple OpenMP worksharing (manual, tasking)



R. Rabenseifner and G. Wellein: *Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures.* International Journal of High Performance Computing Applications **17**, 49-62, February 2003. <u>DOI:10.1177/1094342003017001005</u>

### **Performance results for the HMeP matrix**





- Dominated by communication (and some load imbalance for large #procs)
- Single-node Cray performance cannot be maintained beyond a few nodes
- Task mode pays off esp. with one process (12 threads) per node
- Task mode overlap (over-)compensates additional LHS traffic

### Performance results for the sAMG matrix





- Much less communication-bound
- XE6 outperforms Westmere cluster, can maintain good node performance
- Hardly any discernible difference as to # of threads per process
- If pure MPI is good enough, don't bother going hybrid!



- Do not rely on asynchronous MPI progress
- Sparse MVM leaves resources (cores) free for use by communication threads
- Simple "vector mode" hybrid MPI+OpenMP parallelization is not good enough if communication is a real problem
- "Task mode" hybrid can truly hide communication and overcompensate penalty from additional memory traffic in spMVM
- Comm thread can share a core with comp thread via SMT and still be asynchronous
- If pure MPI scales ok and maintains its node performance according to the node-level performance model, don't bother going hybrid

## Extension to multi-GPGPU is possible

See references

# The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

### SAHPC 2012 Tutorial



# A simple power model for the Sandy Bridge processor

Assumptions Validation using simple benchmarks

G. Hager, J. Treibig, J. Habich and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Submitted. Preprint: <u>arXiv:1208.2908</u>



- Goal: Establish model for chip power and program energy consumption with respect to
  - Clock speed
  - Number of cores used
  - Single-thread program performance
- Choose different characteristic benchmark applications to measure a chip's power behavior
  - Matrix-matrix-multiply ("DGEMM"): "Hot" code, well scalable
  - Ray tracer: Sensitive to SMT execution (15% speedup), well scalable
  - 2D Jacobi solver: 4000x4000 grid, strong saturation on the chip
    - AVX variant
    - Scalar variant

### Measure characteristics of those apps and establish a power model

SAHPC 2012 Tutorial



### Sandy Bridge EP (8-core) processor:



SAHPC 2012 Tutorial



### Sandy Bridge EP (8-core) processor:



SAHPC 2012 Tutorial





#### **Performance Engineering**

8

# FFIE

## **Assumptions:**

- 1. Power is a quadratic polynomial in the clock frequency
- 2. Dynamic power is linear in the number of active cores t
- 3. Performance is linear in the number of cores until it hits a bottleneck (← ECM model)
- 4. Performance is linear in the clock frequency unless it hits a bottleneck
- 5. Energy to solution is power dissipation divided by performance

Model:

$$E = \frac{W_0 + (W_1 f + W_2 f^2)t}{\min((1 + \Delta v)tP_0, P_{\max})}$$

where  $f = (1 + \Delta \nu) f_0$ 



$$E = \frac{W_0 + (W_1 f + W_2 f^2)t}{\min((1 + \Delta v)tP_0, P_{\max})}$$

1. If there is no saturation, use all available cores to minimize *E* 



#### SAHPC 2012 Tutorial



$$E = \frac{W_0 + (W_1 f + W_2 f^2)t}{\min((1 + \Delta v)tP_0, P_{\max})}$$

2. There is an optimal frequency  $f_{opt}$  at which *E* is minimal in the non-saturated case, with

$$f_{\text{opt}} = \sqrt{\frac{W_0}{W_2 t}}$$
, hence it depends on the baseline power

→ "Clock race to idle" if baseline accommodates whole system! → May have to look at other metrics, e.g., C = E/P

$$\frac{\partial C}{\partial \Delta v} = -\frac{2W_0 + W_1 ft}{(f/f_0)^3 P_0^2} < 0$$



$$E = \frac{W_0 + (W_1 f + W_2 f^2)t}{\min((1 + \Delta v)tP_0, P_{\max})}$$

3. If there is saturation, *E* is minimal at the saturation point



SAHPC 2012 Tutorial



$$E = \frac{W_0 + (W_1 f + W_2 f^2)t}{\min((1 + \Delta v)tP_0, P_{\max})}$$

4. If there is saturation, absolute minimum *E* is reached if the saturation point is at the number of available cores



#### SAHPC 2012 Tutorial



$$E = \frac{W_0 + (W_1 f + W_2 f^2)t}{\min((1 + \Delta v)tP_0, P_{\max})}$$

### 5. Making code execute faster on the core saves energy since

- The time to solution is smaller if the code scales ("Code race to idle")
- We can use fewer cores to reach saturation if there is a bottleneck



#### SAHPC 2012 Tutorial

### Model validation with the benchmark apps







- Simple assumptions lead to surprising conclusions
- Performance saturation plays a key role
- "Clock race to idle" can be proven quantitatively
- "Code race to idle" (optimization saves energy) is a trivial result
  - Better: "Optimization makes better use of the energy budget"

- Possible extensions to the power model
  - Allow for per-core frequency setting (coming with Intel Haswell)
  - Accommodate load imbalance & sync overhead

# The Plan



- Motivation
- Performance Engineering
  - Performance modeling
  - The Performance Engineering process
- Modern architectures
  - Multicore
  - Accelerators
  - Programming models
- Data access
- Performance properties of multicore systems
  - Saturation
  - Scalability
  - Synchronization
- Case study: OpenMP-parallel sparse MVM

- Basic performance modeling: Roofline
  - Theory
  - Case study: 3D Jacobi solver and guided optimizations
  - Modeling erratic access
- Some more architecture
  - Simultaneous multithreading (SMT)
  - ccNUMA
- Putting cores to good use
  - Asynchronous communication in spMVM
- A simple power model for multicore
  - Power-efficient code execution
- Conclusions

### SAHPC 2012 Tutorial

# What I have left out



- LIKWID: Lightweight multicore peformance tools
  - http://code.google.com/p/likwid
- Multicore-specific properties of MPI communication
- Sparse MVM on multiple GPGPUs: Performance modeling for viability analysis
  - See references
- Exploting shared caches for temporal blocking of stencil codes
- Execution-Cache-Memory (ECM) model
  - Predictive model for multicore scaling
  - Goes well with the power model

### ■ ... and much more ⊗

# **Tutorial conclusion**



### Multicore architecture == multiple complexities

- Affinity matters  $\rightarrow$  pinning/binding is essential
- Bandwidth bottlenecks  $\rightarrow$  inefficiency is often made on the chip level
- Topology dependence of performance features  $\rightarrow$  know your hardware!

### Put cores to good use

- Bandwidth bottlenecks  $\rightarrow$  surplus cores  $\rightarrow$  functional parallelism!?
- Shared caches → fast communication/synchronization → better implementations/algorithms?
- Leave surplus cores idle to save energy

## Simple modeling techniques help us

- ... understand the limits of our code on the given hardware
- ... identify optimization opportunities and hence save energy
- I learn more, especially when they do not work!



## Code:

```
double precision, dimension(10000000) :: a,b
```

```
do i=1,N
    s=s+a(i)*b(i)
enddo
```

# **GPGPU:** 2880 cores, $P_{\text{peak}}$ = 1.3 Tflop/s, $b_{\text{S}}$ =160 Gbyte/s

# **Optimal** performance?



Jan Treibig Johannes Habich Moritz Kreutzer Markus Wittmann Thomas Zeiser Michael Meier Faisal Shahzad Gerald Schubert





Bundesministerium für Bildung und Forschung

> hpcADD SKALB

# THANK YOU.

SAHPC 2012 Tutorial

# **Author Biographies**

- Georg Hager holds a PhD in computational physics from the University of Greifswald. He has been working with high performance systems since 1995, and is now a senior research scientist in the HPC group at Erlangen Regional Computing Center (RRZE). Recent research includes architecture-specific optimization for current microprocessors, performance modeling on processor and system levels, and the efficient use of hybrid parallel systems. See his blog at <u>http://blogs.fau.de/hager</u> for current activities, publications, and talks.
- Gerhard Wellein holds a PhD in solid state physics from the University of Bayreuth and is a professor at the Department for Computer Science at the University of Erlangen. He leads the HPC group at Erlangen Regional Computing Center (RRZE) and has more than ten years of experience in teaching HPC techniques to students and scientists from computational science and engineering programs. His research interests include solving large sparse eigenvalue problems, novel parallelization approaches, performance modeling, and architecture-specific optimization.









Book:

 G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Computational Science Series, 2010. ISBN 978-1439811924

Papers:

- G. Hager, J. Treibig, J. Habich and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Submitted. Preprint: <u>arXiv:1208.2908</u>
- J. Treibig, G. Hager and G. Wellein: Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering. Workshop on Productivity and Performance (PROPER 2012) at Euro-Par 2012, August 28, 2012, Rhodes Island, Greece. Preprint: <u>arXiv:1206.3738</u>
- M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann and A. R. Bishop: Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation. Workshop on Large-Scale Parallel Processing 2012 (LSPP12), DOI: 10.1109/IPDPSW.2012.211
- J. Treibig, G. Hager, H. Hofmann, J. Hornegger and G. Wellein: Pushing the limits for medical image reconstruction on recent standard multicore processors. International Journal of High Performance Computing Applications, (published online before print). <u>DOI: 10.1177/1094342012442424</u>



Papers continued:

 G. Wellein, G. Hager, T. Zeiser, M. Wittmann and H. Fehske: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. Proc. COMPSAC 2009.

DOI: 10.1109/COMPSAC.2009.82

- M. Wittmann, G. Hager, J. Treibig and G. Wellein: Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters. Parallel Processing Letters 20 (4), 359-376 (2010).
   DOI: 10.1142/S0129626410000296. Preprint: <u>arXiv:1006.3148</u>
- J. Treibig, G. Hager and G. Wellein: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. Proc. <u>PSTI2010</u>, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego CA, September 13, 2010. <u>DOI: 10.1109/ICPPW.2010.38</u>. Preprint: <u>arXiv:1004.4431</u>
- G. Schubert, H. Fehske, G. Hager, and G. Wellein: Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems. Parallel Processing Letters 21(3), 339-358 (2011).
   <u>DOI: 10.1142/S0129626411000254</u>
- J. Treibig, G. Wellein and G. Hager: Efficient multicore-aware parallelization strategies for iterative stencil computations. Journal of Computational Science 2 (2), 130-137 (2011). <u>DOI 10.1016/j.jocs.2011.01.010</u>

# References



Papers continued:

- K. Iglberger, G. Hager, J. Treibig, and U. Rüde: <u>Expression Templates Revisited: A</u> <u>Performance Analysis of Current ET Methodologies</u>. SIAM Journal on Scientific Computing 34(2), C42-C69 (2012). <u>DOI: 10.1137/110830125</u>, Preprint: <u>arXiv:1104.1729</u>
- K. Iglberger, G. Hager, J. Treibig, and U. Rüde: High Performance Smart Expression Template Math Libraries. 2nd International Workshop on New Algorithms and Programming Models for the Manycore Era (<u>APMM 2012</u>) at <u>HPCS 2012</u>, July 2-6, 2012, Madrid, Spain. <u>DOI:</u> <u>10.1109/HPCSim.2012.6266939</u>
- J. Habich, T. Zeiser, G. Hager and G. Wellein: Performance analysis and optimization strategies for a D3Q19 Lattice Boltzmann Kernel on nVIDIA GPUs using CUDA. Advances in Engineering Software and Computers & Structures 42 (5), 266–272 (2011). <u>DOI:</u> <u>10.1016/j.advengsoft.2010.10.007</u>
- J. Treibig, G. Hager and G. Wellein: Multicore architectures: Complexities of performance prediction for Bandwidth-Limited Loop Kernels on Multi-Core Architectures. <u>DOI: 10.1007/978-3-642-13872-0\_1</u>, Preprint: <u>arXiv:0910.4865</u>.
- G. Hager, G. Jost, and R. Rabenseifner: Communication Characteristics and Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-core SMP Nodes. In: Proceedings of the Cray Users Group Conference 2009 (CUG 2009), Atlanta, GA, USA, May 4-7, 2009. <u>PDF</u>
- R. Rabenseifner and G. Wellein: Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures. International Journal of High Performance Computing Applications 17, 49-62, February 2003. DOI:10.1177/1094342003017001005



# **Backup material**



# **Probing node topology**

- Standard tools
- likwid-topology

## Topology =

- Where in the machine does core #n reside? And do I have to remember this awkward numbering anyway?
- Which cores share which cache levels?
- Which hardware threads ("logical cores") share a physical core?
- Linux
  - cat /proc/cpuinfo is of limited use
  - Core numbers may change across kernels and BIOSes even on identical hardware
  - numactl --hardware prints ccNUMA node information
  - Information on caches is harder to obtain

| \$ numact | lha     | rdware           |
|-----------|---------|------------------|
| availabl  | .e: 4 n | odes (0-3)       |
| node 0 c  | pus: 0  | 1 2 3 4 5        |
| node 0 s  | ize: 8  | 189 MB           |
| node 0 f  | ree: 3  | 824 MB           |
| node 1 c  | pus: 6  | 7 8 9 10 11      |
| node 1 s  | ize: 8  | 192 MB           |
| node 1 f  | ree: 2  | 8 MB             |
| node 2 c  | pus: 1  | 8 19 20 21 22 23 |
| node 2 s  | ize: 8  | 192 MB           |
| node 2 f  | ree: 8  | 036 MB           |
| node 3 c  | pus: 1  | 2 13 14 15 16 17 |
| node 3 s  | ize: 8  | 192 MB           |
| node 3 f  | ree: 7  | 840 MB           |



## Likwid Lightweight Performance Tools

- Lightweight command line tools for Linux
- Help to face the challenges without getting in the way
- Focus on X86 architecture
- Philosophy:
  - Simple
  - Efficient
  - Portable
  - Extensible



Open source project (GPL v2):

http://code.google.com/p/likwid/





- Based on cpuid information
- Functionality:
  - Measured clock frequency
  - Thread topology
  - Cache topology
  - Cache parameters (-c command line switch)
  - ASCII art output (-g command line switch)
- Currently supported (more under development):
  - Intel Core 2 (45nm + 65 nm)
  - Intel Nehalem + Westmere (Sandy Bridge in beta phase)
  - AMD K10 (Quadcore and Hexacore)
  - AMD K8
  - Linux OS

### Output of likwid-topology -g

on one node of Cray XE6 "Hermit"

|               |          | agos processor                       | ****        | **** |
|---------------|----------|--------------------------------------|-------------|------|
| Hardware Thre |          | ****                                 |             |      |
| Sockets:      | 2        |                                      | *****       | **** |
| Cores per so  | cket: 16 | 5                                    |             |      |
| Threads per o | core: 1  |                                      |             |      |
| HWThread      |          |                                      | Socket      |      |
| 0             | 0        | 0                                    | 0           |      |
| 1             | 0        | 1                                    | 0           |      |
| 2             | 0        | 2                                    | 0           |      |
| 3             | 0        | 3                                    | 0           |      |
| []            | 0        | •                                    |             |      |
| 16            | 0        | 0                                    | 1           |      |
| 17            | 0        | 1                                    | 1           |      |
| 18            | 0        | 2<br>3                               | 1           |      |
| 19            | 0        | 3                                    | 1           |      |
| []            |          |                                      |             |      |
|               |          | 7 8 9 10 11 12 1<br>) 21 22 23 24 25 |             | 31 ) |
| ****          | *****    | ****                                 | ****        | **** |
| Cache Topolo  | ах       | ****                                 |             |      |
| Level: 1      |          |                                      |             |      |
| Size: 16 kl   | —        |                                      |             |      |
|               |          | ) (2) (3) (                          |             |      |
|               |          | 17) (18) (19                         | ) (20) (21) | (22  |
| 20) (29)      | (30)(31) |                                      |             |      |

LL5J

SAHPC 2012 Tutorial

# **Output of likwid-topology continued**



Level: 2 Size: 2 MB Cache groups: (01)(23)(45)(67)(89)(1011)(1213)(1415)(1617)(18 19) (2021) (2223) (2425) (2627) (2829) (3031) \_\_\_\_\_ Level: 3 Size: 6 MB Cache groups: (01234567) (89101112131415) (1617181920212223) (242526 27 28 29 30 31 ) NUMA Topology NUMA domains: 4 \_\_\_\_\_ Domain 0: Processors: 0 1 2 3 4 5 6 7 Memory: 7837.25 MB free of total 8191.62 MB \_\_\_\_\_ Domain 1: Processors: 8 9 10 11 12 13 14 15 Memory: 7860.02 MB free of total 8192 MB \_\_\_\_\_ Domain 2: Processors: 16 17 18 19 20 21 22 23 Memory: 7847.39 MB free of total 8192 MB \_\_\_\_\_ Domain 3: Processors: 24 25 26 27 28 29 30 31 Memory: 7785.02 MB free of total 8192 MB \_\_\_\_\_

# **Output of likwid-topology continued**



|                                         | ******                   | ******                 | ****                                   | ******                   | **                       | *****      | ***                                 | *****                    | ***                    | ****                |                                     |                          |                                                                          |                                |                          |                                   |                             |                   |                                     |                |                                                   |            |                              |                           |                                        |                                  |                               |                                  |
|-----------------------------------------|--------------------------|------------------------|----------------------------------------|--------------------------|--------------------------|------------|-------------------------------------|--------------------------|------------------------|---------------------|-------------------------------------|--------------------------|--------------------------------------------------------------------------|--------------------------------|--------------------------|-----------------------------------|-----------------------------|-------------------|-------------------------------------|----------------|---------------------------------------------------|------------|------------------------------|---------------------------|----------------------------------------|----------------------------------|-------------------------------|----------------------------------|
| ket 0:                                  |                          |                        |                                        |                          |                          |            |                                     |                          |                        |                     |                                     |                          |                                                                          |                                |                          |                                   |                             |                   |                                     |                |                                                   |            |                              |                           |                                        |                                  |                               |                                  |
| + -                                     | +                        | · +                    | -+ +                                   |                          | . +                      |            | + +                                 | +                        | + +-                   |                     | + +                                 | +                        | +.                                                                       | +                              | +-                       | +                                 | +                           |                   | + +                                 | +              | + +                                               |            | + +                          | +                         | -+                                     | +                                | -+                            | +                                |
| 0                                       | 1                        | 1 2                    | i i                                    | 3                        | i.                       | 4          | I I                                 | 5                        |                        | 6                   |                                     | 7 1                      | Ť.                                                                       | 8                              | i.                       | 9                                 | i.                          | 10                | i i                                 | 11             | Ϊİ                                                | 12         | i i                          | 13                        | i.                                     | 14                               | i.                            | ,<br>15                          |
| + -                                     | +                        | +                      | -+ +                                   |                          | + +                      |            | + +                                 | +                        | + +-                   |                     | + +                                 | +                        | +-                                                                       | +                              | +-                       | +                                 | +                           |                   | + +                                 | +              | + +                                               |            | + +                          | +                         | -+ -                                   | +                                | -+ -                          | +                                |
| + +                                     | +                        | +                      | -+ +                                   | +                        |                          |            |                                     |                          |                        |                     |                                     |                          |                                                                          |                                |                          |                                   |                             |                   |                                     |                |                                                   |            |                              | •                         |                                        |                                  | •                             | •                                |
| 16kB                                    |                          |                        |                                        |                          |                          |            | • •                                 |                          |                        |                     |                                     | 16kB                     |                                                                          |                                |                          |                                   |                             |                   |                                     |                |                                                   |            |                              |                           |                                        | •                                | - C.                          | •                                |
| + +                                     |                          | •                      |                                        |                          |                          |            |                                     |                          |                        |                     |                                     |                          |                                                                          |                                |                          |                                   |                             |                   |                                     |                |                                                   |            |                              |                           |                                        |                                  | •                             |                                  |
| 2ME                                     |                          | · +                    | 2MB                                    |                          | - +                      |            | <br>2мв                             |                          | · +-                   |                     | 2MB                                 | +                        | +                                                                        |                                | <br>ИВ                   | +                                 | +                           |                   | 2MB                                 |                | + +                                               |            | 2ME                          |                           | .+ .                                   | +                                | 2M                            |                                  |
|                                         |                          | · +                    |                                        |                          | - +                      |            |                                     |                          | - +-                   |                     |                                     | , I<br>+                 | +                                                                        |                                | _                        | +                                 | +                           |                   |                                     |                | 1 1<br>+ +                                        |            |                              | -                         | -+ -                                   | ۱<br>+                           |                               | _                                |
|                                         |                          |                        |                                        |                          |                          |            |                                     |                          |                        |                     |                                     |                          |                                                                          |                                |                          |                                   |                             |                   |                                     |                |                                                   |            |                              |                           |                                        | •                                |                               |                                  |
|                                         |                          |                        |                                        | e                        | бмв                      | i.         |                                     |                          |                        |                     |                                     | 1                        | Т                                                                        |                                |                          |                                   |                             |                   |                                     |                | 6MB                                               | 3          |                              |                           |                                        |                                  |                               |                                  |
|                                         |                          |                        |                                        |                          |                          |            |                                     |                          |                        |                     |                                     | +                        | +-                                                                       |                                |                          |                                   |                             |                   |                                     |                |                                                   |            |                              |                           |                                        |                                  |                               |                                  |
| ket 1:                                  |                          |                        |                                        |                          |                          |            |                                     |                          |                        |                     |                                     |                          |                                                                          |                                |                          |                                   |                             |                   |                                     |                |                                                   |            |                              |                           |                                        |                                  |                               |                                  |
|                                         |                          |                        |                                        |                          |                          |            |                                     |                          | ы <b>н</b> .           |                     | <u>ь</u> ц                          |                          |                                                                          |                                |                          |                                   |                             |                   |                                     |                |                                                   |            |                              |                           |                                        | +                                |                               |                                  |
| + +                                     |                          |                        | -+ +                                   |                          | + +                      |            |                                     |                          | • •                    |                     |                                     |                          |                                                                          |                                |                          |                                   | · ·                         |                   | + +                                 |                | + +                                               |            | + +                          | •                         | -+ -                                   | •                                | -+ -                          | •                                |
| 16                                      | 17                       | 18                     | -+ +                                   | +<br>19                  | +                        | 20         |                                     | 21                       | i i                    | 22                  | i i                                 | 23 I                     | ÷.                                                                       | 24                             | i.                       | 25 I                              | i.                          | 26                | i i                                 | 27             | i i                                               | 28         | + +                          | 29                        | ļ                                      | 30                               | 1                             | +<br>  31                        |
| 16    <br>+ +                           | 17                       | 18<br>+                | i i<br>-+ +                            | 19  <br>+                | <br>- +                  | 20         | · ·<br>   <br>+ +                   | 21  <br>+                | <br>  +-               | 22                  | <br>+ +                             | 23  <br>+                | <br> +-                                                                  | 24  <br>+                      | i<br>+-                  | 25  <br>+                         | i<br>+                      | 26<br>            | i i<br>+ +                          | 27             | i i<br>+ +                                        | 28         | <br>+ +                      | 29<br>+                   | .+ .                                   | 30<br>+                          | <br>-+ ·                      | ,<br>  31<br>+                   |
| 16    <br>+ +<br>+ +                    | 17  <br>+                | 18<br>+                | <br>-+ +<br>-+ +                       | 19  <br>+                | <br> <br>  +             | 20         | <br>   <br>+ +<br>+ +               | 21  <br>+                | <br>  +-               | 22                  | <br>   <br>+ +<br>+ +               | 23  <br>+                | <br> ++<br> ++                                                           | 24  <br>+                      | <br>+-<br>+-             | 25  <br>+                         | <br>+                       | 26<br>            | <br>+ +<br>+ +                      | 27             | <br>+ +<br>+ +                                    | 28         | <br>+ +<br>+ +               | 29<br>+                   | <br>-+ .                               | 30<br>+                          | <br> -+                       | , 31<br>+                        |
| 16    <br>+ +<br>16kB                   | 17  <br>+<br>16kB        | 18<br>+<br>+<br>  16kB | <br>-+ +<br>-+ +<br>                   | 19  <br>+<br>16kB        | - +<br>- +               | 20<br>     | · · ·<br>+ +<br>+ +                 | 21  <br>+<br>16kB        | · ·<br>   <br>  +-<br> | 22<br>16kB          | <br>   <br>+ +<br>                  | 23  <br>+<br>16kB        | ·<br>+·<br>+·                                                            | 24  <br>+<br>16kB              | <br>+-<br>+-<br>         | 25  <br>+<br>16kB                 | <br>+<br>  1                | 26<br><br>6kB     | · · ·<br>+ +<br>+ +                 | 27<br><br>16kB | · · ·<br>+ +<br>+ +                               | 28<br>16kB | · · ·<br>+ ·<br>+ ·          | 29<br>+<br>+<br>16kB      | <br>-+ .<br>                           | 30<br>+<br>+<br>  16kB           | <br> -+                       | 31<br>+<br>+<br>  16k            |
| 16    <br>+ +<br>16kB                   | 17  <br>                 | 18<br>+<br>+<br>  16kB | <br>-+ +<br>-+ +<br>                   | 19  <br>+<br>16kB        | <br>+ +<br> <br>         | 20<br>16kB | + +<br>+ +<br>                      | 21  <br>+<br>16kB        | · · ·<br>· +-<br>· +-  | 22<br>16kB          | + +<br>+ +<br>                      | 23  <br>+<br>16kB        | · +· +· +· +· +·                                                         | 24  <br>+<br>16kB  <br>+       | <br>+-<br>+-<br>         | 25  <br>+<br>16kB  <br>+          | <br>+<br>  1<br>+           | 26<br><br>6kB<br> | <br>+ +<br>+ +<br>                  | 27<br><br>16kB | <br>+ +<br>+ +<br>                                | 28<br>16kB | <br>+ +<br>+ +<br>           | 29<br>+<br>+<br>16kB<br>+ | <br>-+ ·<br>-+ ·                       | 30<br>+<br>  16kB<br>+           | <br>+ -<br>-+ -<br>3  <br>+ - | 31<br>+<br>+<br>  16k:           |
| 16    <br>+ +<br>16kB                   | 17  <br>                 | 18<br>+<br>  16kB      | <br>-+ +<br>-+ +<br>                   | 19  <br>+<br>16kB  <br>+ | <br>+ +<br> <br>         | 20<br>16kB | + +<br>+ +<br>                      | 21  <br>+<br>16kB  <br>+ | · · ·<br>· +-<br>· +-  | 22<br>16kB          | + +<br>+ +<br>                      | 23  <br>+<br>16kB  <br>+ | · +· +· +· +· +·                                                         | 24  <br>+<br>16kB  <br>+       | <br>+-<br>+-<br>         | 25  <br>+<br>16kB  <br>+          | <br>+<br>  1<br>+           | 26<br><br>6kB<br> | <br>+ +<br>+ +<br>                  | 27<br><br>16kB | <br>+ +<br>+ +<br>                                | 28<br>16kB | <br>+ +<br>+ +<br>           | 29<br>+<br>16kB           | <br>-+ ·<br>-+ ·                       | 30<br>+<br>  16kB<br>+           | <br>+ -<br>-+ -<br>3  <br>+ - | 31<br>+<br>+<br>  16k:<br>+      |
| 16    <br>+ +<br>16kB    <br>+ +        | 17  <br>+<br>16kB  <br>+ | 18<br>+<br>  16kB      | <br>-+ +<br>-+ +<br>   <br>-+ +<br>2MB | 19  <br>                 | + +<br>+ +<br>+ +<br>+ + | 20<br>16kB | <br>+ +<br>+ +<br>   <br>+ +<br>2MB | 21  <br>+<br>16kB  <br>+ |                        | 22  <br>16kB  <br>2 | <br>+ +<br>+ +<br>   <br>+ +<br>2MB | 23  <br>+<br>16kB  <br>+ | · + + + + + + + + + + + + + + + + + + +                                  | 24  <br>+<br>16kB  <br>+<br>21 | <br>+-<br> <br>+-<br>(B  | 25  <br>+<br>16kB  <br>+<br>      | <br>+<br>  1<br>+<br>       | 26<br><br>6kB<br> | <br>+ +<br>+ +<br>   <br>+ +<br>2MB | 27<br><br>16kB | <br>+ +<br>+ +<br>   <br>+ +<br>+ +               | 28<br>16kB | <br>+ +<br>   <br>+ +<br>2ME | 29<br>+<br>  16kB<br>+    | -+ -<br>-+ -<br> <br>-+ -<br> <br>-+ - | 30<br>+<br>  16kB<br>+           | <br>-+ -<br>3  <br>-+ -<br>2M | 31<br>+<br>+<br>  16k:<br>+<br>B |
| 16    <br>+ +<br>16kB    <br>+ +<br>2ME | 17  <br>+<br>16kB  <br>+ | 18<br>+<br>  16kB<br>+ | <br>-+ +<br>-+ +<br>   <br>-+ +<br>2MB | 19  <br>                 | + +<br>+ +<br>+ +<br>+ + | 20<br>16kB | <br>+ +<br>+ +<br>   <br>+ +<br>2MB | 21  <br>+<br>16kB  <br>+ |                        | 22  <br>16kB  <br>2 | <br>+ +<br>+ +<br>   <br>+ +<br>2MB | 23  <br>                 | $\begin{array}{c} \cdot \\ + \\ + \\ + \\ + \\ + \\ + \\ + \\ + \\ + \\$ | 24  <br>+<br>16kB  <br>+<br>21 | <br>+-<br> <br>+-<br>(1B | 25  <br>+<br>16kB  <br>+<br> <br> | <br>+<br>  1<br>+<br> <br>+ | 26<br><br>6kB<br> | <br>+ +<br>   <br>+ +<br>2MB        | 27<br>16kB     | <br>+ +<br>   <br>+ +<br>   <br>+ +<br>   <br>+ + | 28<br>16kB | <br>+ +<br>   <br>+ +<br>2MB | 29<br><br>16kB<br>        | -+ ·<br>-+ ·<br>-+ ·<br>-+ ·           | 30<br>+<br>  16kE<br>+<br> <br>+ | <br>-+ -<br>3  <br>-+ -<br>2M | 31<br>+<br>  16k:<br>+<br>B<br>B |



# Enforcing thread/process-core affinity under the Linux OS

Standard tools and OS affinity facilities under program control

likwid-pin

# **Motivation: STREAM benchmark on 12-core Intel Westmere**







### SAHPC 2012 Tutorial

### **Generic thread/process-core affinity under Linux** *Overview*



- taskset [OPTIONS] [MASK | -c LIST ] \
   [PID | command [args]...]
- taskset binds processes/threads to a set of CPUs. Examples:

```
taskset 0x0006 ./a.out
taskset -c 4 33187
mpirun -np 2 taskset -c 0,2 ./a.out # doesn't always work
```

- Processes/threads can still move within the set!
- Alternative: let process/thread bind itself by executing syscall #include <sched.h> int sched\_setaffinity(pid\_t pid, unsigned int len, unsigned long \*mask);
- Disadvantage: which CPUs should you bind to on a non-exclusive machine?
- Still of value on multicore/multisocket cluster nodes, UMA or ccNUMA



Complementary tool: numactl

Example: numactl --physcpubind=0,1,2,3 command [args] Bind process to specified physical core numbers

Example: numactl --cpunodebind=1 command [args] Bind process to specified ccNUMA node(s)

- Many more options (e.g., interleave memory across nodes)
  - $\rightarrow$  see section on ccNUMA optimization
- Diagnostic command (see earlier): numactl --hardware
- Again, this is not suitable for a shared machine



# Highly OS-dependent system calls

But available on all systems

```
Linux: sched_setaffinity(), PLPA (see below) → hwloc
Solaris: processor_bind()
Windows: SetThreadAffinityMask()
```

- Support for "semi-automatic" pinning in some compilers/environments
  - Intel compilers > V9.1 (KMP\_AFFINITY environment variable)
  - PGI, Pathscale, GNU
  - SGI Altix dplace (works with logical CPU numbers!)
  - Generic Linux: taskset, numactl, likwid-pin (see below)

# Affinity awareness in MPI libraries

- SGI MPT
- OpenMPI
- Intel MPI
- •



If combined with OpenMP, issues may arise

#### SAHPC 2012 Tutorial

#### Likwid-pin Overview



- Part of the LIKWID tool suite: <u>http://code.google.com/p/likwid</u>
- Pins processes and threads to specific cores without touching code
- Directly supports pthreads, gcc OpenMP, Intel OpenMP
  - Detects OpenMP implementation automatically
- Based on combination of wrapper tool together with overloaded pthread library 
   → binary must be dynamically linked!
- Can also be used as a superior replacement for taskset

- Usage examples:
  - Physical numbering: likwid-pin -c 0,2,4-6 ./myApp parameters
  - Logical numbering (4 cores on socket 0) with "skip mask" specified: likwid-pin -s 3 -c S0:0-3 ./myApp parameters



## Running the STREAM benchmark with likwid-pin:





- Core numbering may vary from system to system even with identical hardware
  - Likwid-topology delivers this information, which can then be fed into likwidpin
- Alternatively, likwid-pin can abstract this variation and provide a purely logical numbering (physical cores first)



Across all cores in the node:

OMP\_NUM\_THREADS=8 likwid-pin -c N:0-7 ./a.out

Across the cores in each socket and across sockets in each node: OMP\_NUM\_THREADS=8 likwid-pin -c S0:0-3@S1:0-3 ./a.out

# Likwid-pin Using logical core numbering







- How do you manage affinity with MPI or hybrid MPI/threading?
- In the long run a unified standard is needed
- Till then, likwid-mpirun provides a portable/flexible solution
- The examples here are for Intel MPI/OpenMP programs, but are also applicable to other threading models

# Pure MPI:

\$ likwid-mpirun -np 16 -nperdomain S:2 ./a.out

# Hybrid:

\$ likwid-mpirun -np 16 -pin S0:0,1\_S1:0,1 ./a.out

#### likwid-mpirun 1 MPI process per node



#### likwid-mpirun -np 2 -pin N:0-11 ./a.out



#### Intel MPI+compiler:

OMP\_NUM\_THREADS=12 mpirun -ppn 1 -np 2 -env KMP\_AFFINITY scatter ./a.out

#### SAHPC 2012 Tutorial

#### likwid-mpirun 1 MPI process per socket



#### likwid-mpirun -np 4 -pin S0:0-5\_S1:0-5 ./a.out



Intel MPI+compiler:

```
OMP_NUM_THREADS=6 mpirun -ppn 2 -np 4 \
-env I_MPI_PIN_DOMAIN socket -env KMP_AFFINITY scatter ./a.out
```



- Iikwid-mpirun can optionally set up likwid-perfctr for you
- \$ likwid-mpirun -np 16 -nperdomain S:2 -perf FLOPS\_DP \
   -marker -mpi intelmpi ./a.out
- likwid-mpirun generates an intermediate perl script which is called by the native MPI start mechanism
- According the MPI rank the script pins the process and threads
- If you use perfctr after the run for each process a file in the format Perf-<hostname>-<rank>.txt

Its output which contains the perfctr results.

 In the future analysis scripts will be added which generate reports of the raw data (e.g. as html pages)





# Best practices for using hardware performance metrics

likwid-perfctr

# **Probing performance behavior**

# **FF==**

# How do we find out about the performance properties and requirements of a parallel code?

Profiling via advanced tools is often overkill

# A coarse overview is often sufficient

- likwid-perfctr (similar to "perfex" on IRIX, "hpmcount" on AIX, "lipfpm" on Linux/Altix)
- Simple end-to-end measurement of hardware performance metrics
- Operating modes:
  - Wrapper
  - Stethoscope
  - Timeline
  - Marker API
- Preconfigured and extensible metric groups, list with
   likwid-perfctr -a

```
BRANCH: Branch prediction miss rate/ratio
CACHE: Data cache miss rate/ratio
CLOCK: Clock of cores
DATA: Load to store ratio
FLOPS_DP: Double Precision MFlops/s
FLOPS_SP: Single Precision MFlops/s
FLOPS_X87: X87 MFlops/s
L2: L2 cache bandwidth in MBytes/s
L2CACHE: L2 cache miss rate/ratio
L3: L3 cache bandwidth in MBytes/s
L3CACHE: L3 cache miss rate/ratio
MEM: Main memory bandwidth in MBytes/s
TLB: TLB miss rate/ratio
```

# **likwid-perfctr** *Example usage with preconfigured metric group*





#### SAHPC 2012 Tutorial



# Things to look at (in roughly this order)

- Load balance (flops, instructions, BW)
- In-socket memory BW saturation
- Shared cache BW saturation
- Flop/s, loads and stores per flop metrics
- SIMD vectorization
- CPI metric
- # of instructions, branches, mispredicted branches

### Caveats

- Load imbalance may not show in CPI or # of instructions
  - Spin loops in OpenMP barriers/MPI blocking calls
  - Looking at "top" or the Windows Task Manager does not tell you anything useful
- In-socket performance saturation may have various reasons
- Cache miss metrics are overrated
  - If I really know my code, I can often calculate the misses
  - Runtime and resource utilization is much more important

# likwid-perfctr Identify load imbalance...



- Instructions retired / CPI may not be a good indication of useful workload – at least for numerical / FP intensive codes....
- Floating Point Operations Executed is often a better indicator
- Waiting / "Spinning" in barrier generates a high instruction count





#### env OMP\_NUM\_THREADS=6 likwid-perfctr -t intel -C S0:0-5 -g FLOPS\_DP ./a.out





Iikwid-perfctr counts events on cores; it has no notion of what kind of code is running (if any)

This enables to listen on what currently happens without any overhead:

likwid-perfctr -c N:0-11 -g FLOPS\_DP -s 10

- It can be used as cluster/server monitoring tool
- A frequent use is to measure a certain part of a long running parallel application from outside



Iikwid-perfctr supports time resolved measurements of full node: likwid-perfctr -c N:0-11 -g MEM -d 50ms > out.txt



SAHPC 2012 Tutorial

# likwid-perfctr Marker API



- To measure only parts of an application a marker API is available.
- The API only turns counters on/off. The configuration of the counters is still done by likwid-perfctr application.
- Multiple named regions can be measured
- Results on multiple calls are accumulated
- Inclusive and overlapping Regions are allowed

```
likwid_markerInit(); // must be called from serial region
likwid_markerStartRegion("Compute");
....
likwid_markerStopRegion("Compute");
likwid_markerStartRegion("postprocess");
....
likwid_markerStopRegion("postprocess");
```

likwid\_markerClose(); // must be called from serial region

# **likwid-perfctr** *Group files*



SHORT PSTI EVENTSET FIXCO INSTR RETIRED ANY FIXC1 CPU CLK UNHALTED CORE FIXC2 CPU CLK UNHALTED REF FP COMP OPS EXE SSE FP PACKED PMC0 FP COMP OPS EXE SSE FP SCALAR PMC1 FP COMP OPS EXE SSE SINGLE PRECISION PMC2 FP COMP OPS EXE SSE DOUBLE PRECISION PMC3 UPMCO UNC QMC NORMAL READS ANY UPMC1 UNC QMC WRITES FULL ANY UPMC2 UNC QHL REQUESTS REMOTE READS UPMC3 UNC QHL REQUESTS LOCAL READS METRICS Runtime [s] FIXC1\*inverseClock CPI FIXC1/FIXC0 Clock [MHz] 1.E-06\*(FIXC1/FIXC2)/inverseClock DP MFlops/s (DP assumed) 1.0E-06\*(PMC0\*2.0+PMC1)/time Packed MUOPS/s 1.0E-06\*PMC0/time Scalar MUOPS/s 1.0E-06\*PMC1/time SP MUOPS/s 1.0E-06\*PMC2/time DP MUOPS/s 1.0E-06\*PMC3/time Memory bandwidth [MBytes/s] 1.0E-06\*(UPMC0+UPMC1)\*64/time; Remote Read BW [MBytes/s] 1.0E-06\*(UPMC2)\*64/time; LONG Formula:

Groups are architecture-specific

- They are defined in simple text files
- Code is generated on recompile of likwid
- likwid-perfctr -a outputs list of groups
- For every group an extensive documentation is available

DP MFlops/s = (FP\_COMP\_OPS\_EXE\_SSE\_FP\_PACKED\*2 + FP\_COMP\_OPS\_EXE\_SSE\_FP\_SCALAR) / runtime.



# Measuring energy consumption with LIKWID

| Measuring e<br>likwid-powern | ГГ⊒Е                                                                 |  |
|------------------------------|----------------------------------------------------------------------|--|
| •                            | s Intel RAPL interface (Sandy Bridge)<br>unning average power limit" |  |
| CPU name:                    |                                                                      |  |
| CPU clock:                   | 3.49 GHz                                                             |  |
| Base clock:                  | 3500.00 MHz                                                          |  |
| Minimal clock:               | 1600.00 MHz                                                          |  |
| Turbo Boost Ste              | eps:                                                                 |  |
| C1 3900.00 MHz               |                                                                      |  |
| C2 3800.00 MHz               |                                                                      |  |
| C3 3700.00 MHz               |                                                                      |  |
| C4 3600.00 MHz               |                                                                      |  |
| Thermal Spec Po              | ower: 95 Watts                                                       |  |
| Minimum Power:               | : 20 Watts                                                           |  |
| Maximum Power:               | : 95 Watts                                                           |  |
| Maximum Time V               | Nindow: 0.15625 micro sec                                            |  |
|                              |                                                                      |  |

SAHPC 2012 Tutorial

## **Example:**

#### A medical image reconstruction code on Sandy Bridge







# Sandy Bridge EP (8 cores, 2.7 GHz base freq.)

| Runtime [s] | Power [W]                      |                                                               | Energy [J]                                                                        |
|-------------|--------------------------------|---------------------------------------------------------------|-----------------------------------------------------------------------------------|
| 90.43       | 90                             | Fas<br><b>∳</b> le                                            | 8110                                                                              |
| 29.63       | 93                             | ss e                                                          | 2750                                                                              |
| 22.61       | 102                            | code                                                          | 2300                                                                              |
| 18.42       | 111                            |                                                               | 2040                                                                              |
|             | <b>90.43</b><br>29.63<br>22.61 | 90.43       90         29.63       93         22.61       102 | 90.43       90       ♥ less energy         29.63       93         22.61       102 |

SAHPC 2012 Tutorial