For final slides and example code see:

# http://goo.gl/0AARRW



# Node-Level Performance Engineering

Georg Hager, Jan Treibig, Gerhard Wellein

Erlangen Regional Computing Center (RRZE) and Department of Computer Science University of Erlangen-Nuremberg

SC14 full-day tutorial November 17, 2014 New Orleans, LA, USA



qrme.com



## **Agenda**



| ВНа | - | Preliminaries                                                      | 08:30 |  |  |
|-----|---|--------------------------------------------------------------------|-------|--|--|
|     |   | Introduction to multicore architecture                             |       |  |  |
|     |   | <ul><li>Cores, caches, chips, sockets, ccNUMA, SIMD</li></ul>      |       |  |  |
|     |   | Multicore tools                                                    | 10:00 |  |  |
| 5   |   | Microbenchmarking for architectural exploration                    | 10:30 |  |  |
|     |   | <ul> <li>Streaming benchmarks</li> </ul>                           |       |  |  |
|     |   | Roadblocks for scalability: Saturation effects and OpenMP overhead | d     |  |  |
| GHa |   | Node-level performance modeling (part I)                           |       |  |  |
|     |   | The Roofline Model and sparse MVM                                  | 12:00 |  |  |
|     |   | Lunch break                                                        |       |  |  |
|     |   | Node-level performance modeling (part II)                          | 13:30 |  |  |
|     |   | Case study: 3D Jacobi solver and model-guided optimization         |       |  |  |
|     |   | Optimal resource utilization                                       |       |  |  |
|     |   | <ul><li>SIMD parallelism</li></ul>                                 | 15:00 |  |  |
| 4   |   | <ul><li>ccNUMA</li></ul>                                           | 15:30 |  |  |
|     |   | <ul><li>Simultaneous multi-threading (SMT)</li></ul>               |       |  |  |
|     | _ | Pattern-driven performance engineering                             | 17:00 |  |  |

#### **A conversation**



From a student seminar on "Efficient programming of modern multi- and manycore processors"

**Student**: I have implemented this algorithm on the GPGPU, and it

solves a system with 26546 unknowns in 0.12 seconds,

so it is really fast.

**Me**: What makes you think that 0.12 seconds is fast?

**Student**: It is fast because my baseline C++ code on the CPU is about

20 times slower.



# Prelude: Scalability 4 the win!

#### Scalability Myth: Code scalability is the key issue



```
!$OMP PARALLEL DO
do k = 1 , Nk
  do j = 1 , Nj; do i = 1 , Ni
      y(i,j,k) = b*(x(i-1,j,k) + x(i+1,j,k) + x(i,j-1,k) +
                        x(i,j+1,k) + x(i,j,k-1) + x(i,j,k+1)
     enddo; enddo
enddo
                                   3D Stencil Update
!$OMP END PARALLEL DO
                                      ("Jacobi")
Changing only the compile
options makes this code
                                   ■ Version 1
                           Speed-Up
                                     Version 2
scalable on an 8-core chip
                                                          Prepared for
                                                          the highly
                                                          parallel era!
                                                           -03 - xAVX
      Memory
```

6

#cores

#### Scalability Myth: Code scalability is the key issue





#### Questions to ask in high performance computing



- Do I understand the performance behavior of my code?
  - Does the performance match a model I have made?
- What is the optimal performance for my code on a given machine?
  - High Performance Computing == Computing at the bottleneck
- Can I change my code so that the "optimal performance" gets higher?
  - Circumventing/ameliorating the impact of the bottleneck
- My model does not work what's wrong?
  - This is the good case, because you learn something
  - Performance monitoring / microbenchmarking may help clear up the situation

#### **How model-building works: Physics**



#### **Newtonian mechanics**



Fails @ small scales!



Relativistic quantum field theory

 $U(1)_{V} \otimes SU(2)_{L} \otimes SU(3)_{C}$ 



# There is no alternative to knowing what is going on between your code and the hardware

Without performance modeling, optimizing code is like stumbling in the dark



# Introduction: Modern node architecture

Multi- and manycore chips and nodes

A glance at basic core fatures

Caches and data transfers through the memory hierarchy

**Memory organization** 

**Accelerators** 

**Programming models** 

#### Multi-Core: Intel Xeon 2600 (2012)



Xeon 2600 "Sandy Bridge EP":8 cores running at 2.7 GHz (max 3.2 GHz)



- Simultaneous Multithreading→ reports as 16-way chip
- 2.3 Billion Transistors / 32 nm
- Die size: 435 mm²



2-socket server



#### General-purpose cache based microprocessor core





- Implements "Stored Program Computer" concept (Turing 1936)
- Similar designs on all modern systems
- (Still) multiple potential bottlenecks



#### Pipelining of arithmetic/functional units



#### Idea:

- Split complex instruction into several simple / fast steps (stages)
- Each step takes the same amount of time, e.g. a single cycle
- Execute different steps on different instructions at the same time (in parallel)

#### • Allows for shorter cycle times (simpler logic circuits), e.g.:

- floating point multiplication takes 5 cycles, but
- processor can work on 5 different multiplications simultaneously
- one result at each cycle after the pipeline is full

#### Drawback:

- Pipeline must be filled startup times (#Instructions >> pipeline steps)
- Efficient use of pipelines requires large number of independent instructions → instruction level parallelism
- Requires complex instruction scheduling by compiler/hardware softwarepipelining / out-of-order
- Pipelining is widely used in modern computer architectures

#### 5-stage Multiplication-Pipeline: A(i)=B(i)\*C(i); i=1,...,N





First result is available after 5 cycles (=latency of pipeline)! Wind-up/-down phases: Empty pipeline stages

#### **Pipelining: The Instruction pipeline**



 Besides arithmetic & functional unit, instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps:



- Branches can stall this pipeline! (Speculative Execution, Predication)
- Each unit is pipelined itself (e.g., Execute = Multiply Pipeline)

#### **Superscalar Processors – Instruction Level Parallelism**



Multiple units enable use of Instrucion Level Parallelism (ILP): Instruction stream is "parallelized" on the fly



- Issuing m concurrent instructions per cycle: m-way superscalar
- Modern processors are 3- to 6-way superscalar & can perform 2 or 4 floating point operations per cycles

#### **Core details: Simultaneous multi-threading (SMT)**







#### Core details: SIMD processing



- Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on "wide" registers
- x86 SIMD instruction sets:
  - SSE: register width = 128 Bit → 2 double precision floating point operands
  - AVX: register width = 256 Bit → 4 double precision floating point operands

Adding two registers holding double precision floating point



#### Registers and caches: Data transfers in a memory hierarchy



- How does data travel from memory to the CPU and back?
- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- MISS: Load or store instruction does not find the data in a cache level
   → CL transfer required

Example: Array copy A(:)=C(:)



#### Commodity cluster nodes: From UMA to ccNUMA

Basic architecture of commodity compute cluster nodes



#### Yesterday (2006): UMA



Uniform Memory Architecture (UMA)

Flat memory; symmetric MPs

But: system "anisotropy"

#### Today: ccNUMA



Cache-coherent Non-Uniform Memory Architecture (ccNUMA)

HT / QPI provide scalable bandwidth at the price of ccNUMA architectures: Where does my data finally end up?

#### There is no single driving force for chip performance!





Floating Point (FP) Performance:

$$P = n_{core} * F * S * v$$

core

number of cores:

F

FP instructions per cycle: 2 (1 MULT and 1 ADD)

"Sandy Bridge EP" socket 4,6,8 core variants available

FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers – "AVX")

V

S

Clock speed:

~2.7 GHz

TOP500 rank 1 (1995)

P = 173 GF/s (dp) / 346 GF/s (sp)

But: P=5.4 GF/s (dp) for serial, non-SIMD code



#### Interlude:

A glance at current accelerator technology

# **NVIDIA Kepler GK110 Block Diagram**



#### **Architecture**

- 7.1B Transistors
- 15 "SMX" units
  - 192 (SP) "cores" each
- > 1 TFLOP DP peak
- 1.5 MB L2 Cache
- 384-bit GDDR5
- PCI Express Gen3
- 3:1 SP:DP performance



© NVIDIA Corp. Used with permission.

## Intel Xeon Phi block diagram



#### **Architecture**

- 3B Transistors
- 60+ cores
- 512 bit SIMD
- ≈ 1 TFLOP DP
  peak
- 0.5 MB
   L2/core
- GDDR5
- 2:1 SP:DP performance



#### **Comparing accelerators**



#### Intel Xeon Phi

- 60+ IA32 cores each with 512 Bit SIMD FMA unit → 480/960 SIMD DP/SP tracks
- Clock Speed: ~1000 MHz
- Transistor count: ~3 B (22nm)
- Power consumption: ~250 W



- Peak Performance (DP): ~ 1 TF/s
- Memory BW: ~250 GB/s (GDDR5)
- Threads to execute: 60-240+
- Programming: Fortran/C/C++ +OpenMP + SIMD
- TOP7: "Stampede" at Texas Center for Advanced Computing

#### NVIDIA Kepler K20

■ 15 SMX units each with 192 "cores" → 960/2880 DP/SP "cores"



- Clock Speed: ~700 MHz
- Transistor count: 7.1 B (28nm)
- Power consumption: ~250 W
- Peak Performance (DP): ~ 1.3 TF/s
- Memory BW: ~ 250 GB/s (GDDR5)
- Threads to execute: 10,000+
- Programming: CUDA, OpenCL, (OpenACC)
- TOP1: "Titan" at Oak Ridge National Laboratory

**TOP500** 

rankings

Nov 2012

# Trading single thread performance for parallelism: *GPGPUs vs. CPUs*



**GPU vs. CPU light speed estimate:** 

1. Compute bound: 2-10x

2. Memory Bandwidth: 1-5x





|                    | Intel Core i5 – 2500<br>("Sandy Bridge") | Intel Xeon E5-2680 DP node ("Sandy Bridge") | NVIDIA K20x<br>("Kepler") |
|--------------------|------------------------------------------|---------------------------------------------|---------------------------|
| Cores@Clock        | 4 @ 3.3 GHz                              | 2 x 8 @ 2.7 GHz                             | 2880 @ 0.7 GHz            |
| Performance+/core  | 52.8 GFlop/s                             | 43.2 GFlop/s                                | 1.4 GFlop/s               |
| Threads@STREAM     | <4                                       | <16                                         | >8000?                    |
| Total performance+ | 210 GFlop/s                              | 691 GFlop/s                                 | 4,000 GFlop/s             |
| Stream BW          | 18 GB/s                                  | 2 x 40 GB/s                                 | 168 GB/s (ECC=1)          |
| Transistors / TDP  | 1 Billion* / 95 W                        | 2 x (2.27 Billion/130W)                     | 7.1 Billion/250W          |

<sup>+</sup> Single Precision

Complete compute device

<sup>\*</sup> Includes on-chip GPU and PCI-Express



# Node topology and programming models

## Parallelism in a modern compute node



Parallel and shared resources within a shared-memory node



#### Parallel resources:

- Execution/SIMD units 1
- Cores 2
- Inner cache levels 3
- Sockets / ccNUMA domains 4
- Multiple accelerators 5

#### **Shared resources:**

- Outer cache level per socket 6
- Memory bus per socket 7
- Intersocket link 8
- PCle bus(es)
- Other I/O resources 10

How does your application react to all of those details?

#### **Parallel programming models**

#### on modern compute nodes



#### Shared-memory (intra-node)

- Good old MPI (current standard: 3.0)
- OpenMP (current standard: 4.0)
- POSIX threads
- Intel Threading Building Blocks (TBB)
- Cilk+, OpenCL, StarSs,... you name it

#### "Accelerated"

- OpenMP 4.0
- CUDA
- OpenCL
- OpenACC

#### Distributed-memory (inter-node)

- MPI (current standard: 3.0)
- PVM (gone)
- Hybrid
  - Pure MPI + X, X == <you name it>

All models require awareness of topology and affinity issues for getting best performance out of the machine!

#### **Parallel programming models:**

#### Pure MPI





- → Very simple programming model
- → MPI "knows what to do"!?

#### Performance issues

- Intranode vs. internode MPI
- Node/system topology







#### **Parallel programming models:**

Pure threading on the node



#### Machine structure is invisible to user

- → Very simple programming model
- Threading SW (OpenMP, pthreads, TBB,...) should know about the details

#### Performance issues

- Synchronization overhead
- Memory access
- Node topology





#### Parallel programming models: Lots of choices

Hybrid MPI+OpenMP on a multicore multisocket cluster



#### One MPI process / node

One MPI process / socket: OpenMP threads on same socket: "blockwise"

> OpenMP threads pinned "round robin" across cores in node

Two MPI processes / socket
OpenMP threads
on same socket



#### **Conclusions about architecture**



- Modern computer architecture has a rich "topology"
- Node-level hardware parallelism takes many forms
  - Sockets/devices CPU: 1-8, GPGPU: 1-6
  - Cores moderate (CPU: 4-16) to massive (GPGPU: 1000's)
  - SIMD moderate (CPU: 2-8) to massive (GPGPU: 10's-100's)
  - Superscalarity (CPU: 2-6)
- Exploiting performance: parallelism + bottleneck awareness
  - "High Performance Computing" == computing at a bottleneck
- Performance of programming models is sensitive to architecture
  - Topology/affinity influences overheads
  - Standards do not contain (many) topology-aware features
  - Apart from overheads, performance features are largely independent of the programming model



#### **Multicore Performance and Tools**

# **Probing node topology**

- Standard tools
- likwid-topology

### How do we figure out the node topology?



#### Topology =

- Where in the machine does core #n reside? And do I have to remember this awkward numbering anyway?
- Which cores share which cache levels?
- Which hardware threads ("logical cores") share a physical core?

#### Linux

- cat /proc/cpuinfo is of limited use
- Core numbers may change across kernels and BIOSes even on identical hardware
- numactl --hardware prints ccNUMA node information
- Information on caches is harder to obtain

```
$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8189 MB
node 0 free: 3824 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 8192 MB
node 1 free: 28 MB
node 2 cpus: 18 19 20 21 22 23
node 2 size: 8192 MB
node 3 cpus: 12 13 14 15 16 17
node 3 size: 8192 MB
```

node 3 free: 7840 MB

## How do we figure out the node topology?



LIKWID tool suite:

Like
I
Knew
What
I'm
Doing

Open source tool collection (developed at RRZE):

http://code.google.com/p/likwid



J. Treibig, G. Hager, G. Wellein: *LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments.*PSTI2010, Sep 13-16, 2010, San Diego, CA http://arxiv.org/abs/1004.4431

### **Likwid Tool Suite**



#### Command line tools for Linux:

- easy to install
- works with standard linux 2.6 kernel
- simple and clear to use
- supports Intel and AMD CPU



#### Current tools:

- likwid-topology: Print thread and cache topology
- likwid-pin: Pin threaded application without touching code
- likwid-perfctr: Measure performance counters
- likwid-mpirun: mpirun wrapper script for easy LIKWID integration
- likwid-bench: Low-level bandwidth benchmark generator tool
- ... some more

# Output of likwid-topology -g

on one node of Cray XE6 "Hermit"



```
CPU type: AMD Interlagos processor
Hardware Thread Topology
Sockets:
Cores per socket: 16
Threads per core: 1
HWThread
            Thread
                         Core
                                      Socket
0
                         0
                                      0
                                      0
[\ldots]
16
                                      1
                                      1
17
18
19
[...]
Socket 0: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 )
Socket 1: ( 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 )
******************
Cache Topology
Level: 1
Size: 16 kB
Cache groups: (0)(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13
) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (
28 ) (29 ) (30 ) (31 )
```

# **Output of likwid-topology continued**



```
Level: 2
Size:
      2 MB
Cache groups: (01)(23)(45)(67)(89)(1011)(1213)(1415)(1617)(18
19 ) ( 20 21 ) ( 22 23 ) ( 24 25 ) ( 26 27 ) ( 28 29 ) ( 30 31 )
Level: 3
Size: 6 MB
Cache groups: (0 1 2 3 4 5 6 7 ) (8 9 10 11 12 13 14 15 ) (16 17 18 19 20 21 22 23 ) (24 25 26
27 28 29 30 31 )
********************
NUMA Topology
**********************
NUMA domains: 4
Domain 0:
Processors: 0 1 2 3 4 5 6 7
Memory: 7837.25 MB free of total 8191.62 MB
Domain 1:
Processors: 8 9 10 11 12 13 14 15
Memory: 7860.02 MB free of total 8192 MB
Domain 2:
Processors: 16 17 18 19 20 21 22 23
Memory: 7847.39 MB free of total 8192 MB
Domain 3:
Processors: 24 25 26 27 28 29 30 31
Memory: 7785.02 MB free of total 8192 MB
```

# **Output of likwid-topology continued**



| + +                        | + ++                                | +                     | +           | -+ +-      | +<br>5 I      | +                   | -+             | +                           | + +                   | ·+                   | +                 | +<br>9 I               | +<br>I 10                | -+                 | +<br>  11                | + -       | +<br>l 12                | + +                            | 13   | +-                        | +<br>14 I | + +-                                    | 15         |
|----------------------------|-------------------------------------|-----------------------|-------------|------------|---------------|---------------------|----------------|-----------------------------|-----------------------|----------------------|-------------------|------------------------|--------------------------|--------------------|--------------------------|-----------|--------------------------|--------------------------------|------|---------------------------|-----------|-----------------------------------------|------------|
| + +                        | + ++                                | +                     | · +         | -+ +-      | ا د           | +                   | -+             | +                           | 1 1<br>+ +            | ۱ ۰<br>+             | . +               | ı<br>+                 | +                        | -+                 | +                        | -+ ·      | 12<br>+                  | + +                            |      | <br> - +-                 |           | <br> - +-                               |            |
| + +                        | + ++                                | +                     | +           | -+ +-      | +             | +                   | -+             | +                           | + +                   | +                    | +                 | +                      | +                        | -+                 | +                        | +         | +                        | + +                            |      | +-                        |           | + +-                                    |            |
| 16kB     16kB              | 16kB                                | 16kB                  | 16kB        |            | 16kB          |                     |                | •                           |                       |                      | •                 | •                      | •                        |                    | 16kB                     | 1         | 16kB                     | 1 1                            | 16kB |                           | 16kB      |                                         | 16kB       |
| + +                        | .+ +                                | +                     | · +         | -+ +-      | +             | •                   |                | +                           |                       | •                    |                   |                        | •                        | -+                 |                          | -+ -      | +<br>+                   | + +                            |      | - +-<br>- +-              |           | - +                                     |            |
| 2MB                        | 2MB   2MB   2MB                     |                       |             |            | i             | 2MB                 |                |                             | 2MB                   |                      |                   | 2MB                    |                          | Ĺ                  | I                        | 2MB       |                          | 2MB                            |      |                           |           |                                         |            |
|                            | + +                                 |                       | •           |            |               | •                   |                |                             |                       |                      |                   |                        | •                        |                    |                          | •         | +                        |                                |      | •                         |           |                                         |            |
|                            |                                     |                       | MB          |            |               |                     |                |                             | + +                   |                      |                   |                        |                          |                    |                          | 6М        |                          |                                |      |                           |           |                                         |            |
|                            | 0МВ                                 |                       |             |            |               |                     | · +            |                             |                       |                      |                   |                        |                          |                    |                          |           |                          |                                |      |                           |           |                                         |            |
|                            |                                     |                       |             |            |               |                     |                |                             | + +<br>               |                      |                   |                        |                          |                    |                          |           |                          |                                |      |                           |           |                                         |            |
| ket 1:                     |                                     |                       |             |            |               |                     |                |                             | + +<br>               |                      |                   |                        |                          |                    |                          |           |                          |                                |      |                           |           |                                         |            |
| ket 1:<br>+ +<br>16     17 |                                     | +                     | <br><br>. + |            | <br>+<br>21 I | +                   | •              | <br><br>+<br>I 23           |                       |                      |                   |                        |                          | <br>               | <br><br>+<br>I 27        |           |                          |                                |      | ++-                       | 30        | <br><br>+ +-                            |            |
| 16   17                    | + ++<br>                            | •                     |             |            |               | 22                  | i              | 23                          | Ϊİ                    | 24                   | <br>- +           | 25                     | +                        | <br><br>-+         | <br>+<br>  27            | +         | +                        | <br>+ +                        |      | + +-                      |           | <br>+ +-                                | 31         |
| 16     17<br>+ +           | + ++                                | +                     | +           | -+ +-      | ·+            | 22                  | <br> -+<br> -+ | 23<br>+                     | <br>   <br>+ +<br>+ + | 24                   | <br>              | 25                     | +<br>  26<br>+           | <br>+<br>+         | +<br>  27<br>+           | -+ ·      | 28<br>                   | <br>+ +<br>                    | 29   | + +-                      | 30        | <br>+ +-<br>   <br>+ +-                 | 31         |
| 16   17                    | + ++                                | +                     | +           | -+ +-      | ·+            | 22<br>  +<br>  16kE | -+             | 23<br>+<br>+<br>  16kB      | <br>   <br>+ +<br>    | 24  <br>24  <br>16kB | +<br>    +<br>  + | 25  <br>+              | +<br>  26<br>+<br>  16kE | <br>-+<br>-+<br>-+ | +<br>  27<br>+<br>  16kB | -+ ·      | +<br>  28<br>+<br>  16kB | <br>+ +<br>                    | 29   | + +-                      | 30        | <br>+ +-<br>   <br>+ +-                 | 31         |
| 16     17<br>+ +           | + ++<br>+ ++<br>    16kB            | +                     | 16kB        | -+ +-<br>1 | ·+            | 22<br>  +<br>  16kE | -+             | 23<br>+                     |                       | 24  <br>             | +                 | 25  <br>+<br>+<br>16kB | +<br>  26<br>+<br>  16kE | <br> <br> +<br> +  | +<br>  27<br>+<br>  16kB | -+ · -+ · | +<br>  28<br>+<br>  16kB | <br>+ +<br>     <br>+ +<br>+ + | 29   | + +-<br>   <br>  + +-<br> | 30  <br>  | + + · · · · · · · · · · · · · · · · · · | 31<br>16kB |
| 16   17                    | + ++<br>+ ++<br>    16kB  <br>-+ ++ | +<br>+<br>  16kB<br>+ | 16kB        | -+ +-<br>1 | 16kB          | 22<br>  +<br>  16kE | -+             | 23<br>+<br>+<br>  16kB<br>+ |                       | 24  <br>             | +                 | 25  <br>+<br>+<br>16kB | +<br>  26<br>+<br>  16kE | <br> <br> +<br> +  | +<br>  27<br>+<br>  16kB | -+ · -+ · | +<br>  28<br>+<br>  16kB | <br>+ +<br>     <br>+ +<br>+ + | 29   | + +-<br>   <br>  + +-<br> | 30  <br>  | + + · · · · · · · · · · · · · · · · · · | 31<br>16kB |



# **Enforcing thread/process-core affinity under the Linux OS**

- Standard tools and OS affinity facilities under program control
- likwid-pin

# **Example: STREAM benchmark on 16-core Sandy Bridge:**

Anarchy vs. thread pinning







- Eliminating performance variation
- Making use of architectural features
- Avoiding resource contention



#### More thread/Process-core affinity ("pinning") options



- Highly OS-dependent system calls
  - But available on all systems

```
Linux: sched_setaffinity()

Solaris: processor_bind()

Windows: SetThreadAffinityMask()
```

■ PLPA → hwloc

- Support for "semi-automatic" pinning in some compilers/environments
  - All modern compilers with OpenMP support
  - Generic Linux: taskset, numactl, likwid-pin (see below)
  - OpenMP 4.0 (see OpenMP tutorial)
- Affinity awareness in MPI libraries
  - SGI MPT
  - OpenMPI
  - Intel MPI

• ...

#### Overview



- Pins processes and threads to specific cores without touching code
- Directly supports pthreads, gcc OpenMP, Intel OpenMP
- Based on combination of wrapper tool together with overloaded pthread library → binary must be dynamically linked!
- Can also be used as a superior replacement for taskset
- Supports logical core numbering within a node and within an existing CPU set
  - Useful for running inside CPU sets defined by someone else, e.g., the MPI start mechanism or a batch system

#### Usage examples:

- likwid-pin -c 0,2,4-6 ./myApp parameters
- likwid-pin -c S0:0-3 ./myApp parameters

Example: Intel OpenMP



### Running the STREAM benchmark with likwid-pin:

```
$ export OMP NUM THREADS=4
$ likwid-pin -c 0,1,4,5 ./stream
[likwid-pin] Main PID -> core 0 - OK —
                                                           Main PID always
                                                                pinned
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
[... some STREAM output omitted ...]
The *best* time for each test is used
*EXCLUDING* the first and last iterations
[pthread wrapper] PIN MASK: 0->1 1->4 2->5
                                                          Skip shepherd
[pthread wrapper] SKIP MASK: 0x1 —
[pthread wrapper 0] Notice: Using libpthread.so.0
                                                              thread
       threadid 1073809728 -> SKIP
[pthread wrapper 1] Notice: Using libpthread.so.0
        threadid 1078008128 -> core 1 - OK
[pthread wrapper 2] Notice: Using libpthread.so.0
       threadid 1082206528 -> core 4 - OK
                                                            Pin all spawned
[pthread wrapper 3] Notice: Using libpthread.so.0
                                                             threads in turn
       threadid 1086404928 -> core 5 - OK
[... rest of STREAM output omitted ...]
```

#### Using logical core numbering



- Core numbering may vary from system to system even with identical hardware
  - Likwid-topology delivers this information, which can then be fed into likwidpin
- Alternatively, likwid-pin can abstract this variation and provide a purely logical numbering (physical cores first)



- Across all cores in the node:
  OMP NUM THREADS=8 likwid-pin -c N:0-7 ./a.out
- Across the cores in each socket and across sockets in each node:
  OMP NUM THREADS=8 likwid-pin -c S0:0-3@S1:0-3 ./a.out

Using logical core numbering





#### **Advanced options for pinning: Expressions**



 Expressions are more powerful in situations where the pin mask would be very long or clumsy

#### Compact pinning:

```
likwid-pin -c E:<thread domain>:<number of threads>\
  [:<chunk size>:<stride>] ...
```

#### Scattered pinning across all domains of the designated type:

```
likwid-pin -c <domaintype>:scatter
```

Examples:

```
likwid-pin -c E:N:8 ... # equivalent to N:0-7
likwid-pin -c E:N:120:2:4 ... # Phi: 120 threads, 2 per core
```

Scatter across all NUMA domains:

```
likwid-pin -c M:scatter
```

# Intel KMP AFFINITY environment variable



• KMP\_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]

- modifier
  - granularity=<specifier> takes the following specifiers: fine, thread, and core
  - norespect
  - noverbose
  - proclist={proc-list>}
  - respect
  - verbose

type (required)

- compact
- disabled
- explicit (GOMP\_CPU\_AFFINITY)
- none
- scatter

**OS processor IDs** 

Respect an OS affinity mask in place

- Default:
  - noverbose,respect,granularity=core
- KMP AFFINITY=verbose, none to list machine topology map

## Intel KMP AFFINITY examples



# KMP\_AFFINITY=granularity=fine,compact



Package means chip/socket

# KMP\_AFFINITY=granularity=fine,scatter



#### Intel KMP AFFINITY permute example



KMP\_AFFINITY=granularity=fine,compact,1,0



KMP\_AFFINITY=granularity=core,compact



Threads may float within core

#### **GNU GOMP AFFINITY**



■ GOMP\_AFFINITY=3,0-2 used with 6 threads



Round robin oversubscription

Always operates with OS processor IDs



# Multicore performance tools: Probing performance behavior

likwid-perfctr

#### Basic approach to performance analysis



- Runtime profile / Call graph (gprof)
- Instrument those parts which consume a significant part of runtime
- 3. Find performance signatures

#### **Possible signatures:**

- Bandwidth saturation
- Instruction throughput limitation (real or language-induced)
- Latency impact (irregular data access, high branch ratio)
- Load imbalance
- ccNUMA issues (data access across ccNUMA domains)
- Pathologic cases (false cacheline sharing, expensive operations)

# **Probing performance behavior**



- How do we find out about the performance properties and requirements of a parallel code?
  - Profiling via advanced tools is often overkill
- A coarse overview is often sufficient
  - likwid-perfctr (similar to "perfex" on IRIX, "hpmcount" on AIX, "lipfpm" on Linux/Altix)
  - Simple end-to-end measurement of hardware performance metrics
  - "Marker" API for starting/stopping counters
  - Multiple measurement region support
  - Preconfigured and extensible metric groups, list with

likwid-perfctr -a

BRANCH: Branch prediction miss rate/ratio

CACHE: Data cache miss rate/ratio

CLOCK: Clock of cores

DATA: Load to store ratio

FLOPS\_DP: Double Precision MFlops/s FLOPS\_SP: Single Precision MFlops/s

FLOPS\_X87: X87 MFlops/s

L2: L2 cache bandwidth in MBytes/s
L2CACHE: L2 cache miss rate/ratio
L3: L3 cache bandwidth in MBytes/s
L3CACHE: L3 cache miss rate/ratio

MEM: Main memory bandwidth in MBytes/s

TLB: TLB miss rate/ratio

#### Example usage with preconfigured metric group



\$ env OMP NUM THREADS=4 likwid-perfctr -C N:0-3 -g FLOPS DP ./stream.exe CPU type: Intel Core Lynnfield processor CPU clock: 2.93 GHz **Configured metrics Always** Measuring group FLOPS DP (this group) measured YOUR PROGRAM OUTPUT core 3 core 1 Event core 0 core 2 INSTR RETIRED ANY 1.97463e+08 | 2.31001e+08 | 2.30963e+08 | 2.31885e+08 9.56999e+08 | CPU CLK UNHALTED CORE 9.58401e+08 | 9.58637e+08 FF COMP OF EYF SSE FF PACKED 4.00294e+07 3.08927e+07 I 3.08866e+07 FP COMP OPS EXE SSE FP SCALAR 882 FP COMP OPS EXE SSE SINGLE PRECISION FR COMP OPS EXE SSE DOUBLE PRECISION 4.00303e+07 | 3.08927e+07 | 3.08866e+07 | 3.08904e+07 core 0 Metric Runtime [s] 0.326242 0.326801 | 0.326358 **Derived** CPI 4.84647 4.15061 1 4.12849 | 4.14891 | DP MFlops/s (DP assumed) | 245.399 189.024 | 189.108 | I 189.304 metrics Packed MUOPS/s 122.698 94.5121 94.6519 Scalar MUOPS/s 0.00270351 SP MUOPS/s DP MUOPS/s 1 94.554 94.5121 122.701

#### Best practices for runtime counter analysis



# Things to look at (in roughly this order)

- Load balance (flops, instructions, BW)
- In-socket memory BW saturation
- Shared cache BW saturation
- Flop/s, loads and stores per flop metrics
- SIMD vectorization
- CPI metric
- # of instructions, branches, mispredicted branches

#### **Caveats**

- Load imbalance may not show in CPI or # of instructions
  - Spin loops in OpenMP barriers/MPI blocking calls
  - Looking at "top" or the Windows Task
     Manager does not tell you anything useful
- In-socket performance saturation may have various reasons
- Cache miss metrics are overrated
  - If I really know my code, I can often calculate the misses
  - Runtime and resource utilization is much more important

#### Marker API



- A marker API is available to restrict measurements to code regions
- The API only turns counters on/off. The configuration of the counters is still done by likwid-perfctr
- Multiple named regions support, accumulation over multiple calls
- Inclusive and overlapping regions allowed

```
#include <likwid.h>
                                   // must be called from serial region
LIKWID MARKER INIT;
#pragma omp parallel
  LIKWID MARKER THREADINIT; // only reqd. if measuring multiple threads
LIKWID MARKER START ("Compute");
LIKWID MARKER STOP("Compute");
                                                             Activate macros with
                                                              -DLIKWID PERFMON
LIKWID MARKER START ("Postprocess");
LIKWID MARKER STOP("Postprocess");
                                   // must be called from serial region
LIKWID MARKER CLOSE;
```



# **Measuring energy consumption** with LIKWID

# Measuring energy consumption

likwid-powermeter and likwid-perfctr-g ENERGY



- Implements Intel RAPL interface (Sandy Bridge)
- RAPL = "Running average power limit"

CPU name: Intel Core SandyBridge processor

CPU clock: 3.49 GHz

\_\_\_\_\_

Base clock: 3500.00 MHz Minimal clock: 1600.00 MHz

Turbo Boost Steps:

C1 3900.00 MHz

C2 3800.00 MHz

C3 3700.00 MHz

C4 3600.00 MHz

\_\_\_\_\_

Thermal Spec Power: 95 Watts

Minimum Power: 20 Watts

Maximum Power: 95 Watts

Maximum Time Window: 0.15625 micro sec

\_\_\_\_\_

# **Example:**

A medical image reconstruction code on Sandy Bridge





Sandy Bridge EP (8 cores, 2.7 GHz base freq.)

| Test case          | Runtime [s] | Power [W] |       | Energy [J] |
|--------------------|-------------|-----------|-------|------------|
| 8 cores, plain C   | 90.43       | 90        | Fas   | 8110       |
| 8 cores, SSE       | 29.63       | 93        | ter o | 2750       |
| 8 cores (SMT), SSE | 22.61       | 102       | code  | 2300       |
| 8 cores (SMT), AVX | 18.42       | 111       |       | 2040       |



# Microbenchmarking for architectural exploration

Probing of the memory hierarchy
Saturation effects in cache and memory
Typical OpenMP overheads

# Latency and bandwidth in modern computer environments





#### The parallel vector triad benchmark

A "swiss army knife" for microbenchmarking



#### Simple streaming benchmark:

```
double precision, dimension(N) :: A,B,C,D
A=1.d0; B=A; C=A; D=A

do j=1,NITER
    do i=1,N
        A(i) = B(i) + C(i) * D(i)
    enddo
    if(.something.that.is.never.true.) then
        call dummy(A,B,C,D)
    endif
enddo
```

- Report performance for different N
- Choose NITER so that accurate time measurement is possible
- This kernel is limited by data transfer performance for all memory levels on all current architectures!

# A(:)=B(:)+C(:)\*D(:) on one Sandy Bridge core (3 GHz)





# A(:)=B(:)+C(:)\*D(:) on one Sandy Bridge core (3 GHz): Observations and further questions





See later for answers!

#### The throughput-parallel vector triad benchmark



#### **Every core runs its own, independent triad benchmark**

```
double precision, dimension(:), allocatable :: A,B,C,D
!$OMP PARALLEL private(i,j,A,B,C,D)
allocate (A(1:N), B(1:N), C(1:N), D(1:N))
A=1.d0; B=A; C=A; D=A
do j=1,NITER
  do i=1,N
    A(i) = B(i) + C(i) * D(i)
  enddo
  if(.something.that.is.never.true.) then
    call dummy (A,B,C,D)
  endif
enddo
!$OMP END PARALLEL
```

### > pure hardware probing, no impact from OpenMP overhead

# Throughput vector triad on Sandy Bridge socket (3 GHz)





# Attainable memory bandwidth: Comparing architectures





#### **Bandwidth limitations: Outer-level cache**

Scalability of shared data paths in L3 cache





#### The OpenMP-parallel vector triad benchmark



#### OpenMP work sharing in the benchmark loop

```
double precision, dimension(:), allocatable :: A,B,C,D
allocate (A(1:N), B(1:N), C(1:N), D(1:N))
A=1.d0; B=A; C=A; D=A
!$OMP PARALLEL private(i,j)
do j=1,NITER
!$OMP DO
  do i=1,N
    A(i) = B(i) + C(i) * D(i)
  enddo
                          Implicit barrier
!SOMP END DO
  if (.something.that.is.never.true.) then
    call dummy (A,B,C,D)
  endif
enddo
!$OMP END PARALLEL
```

# OpenMP vector triad on Sandy Bridge socket (3 GHz)







# OpenMP performance issues on multicore

Synchronization (barrier) overhead

#### Welcome to the multi-/many-core era

Synchronization of threads may be expensive!



!\$OMP PARALLEL ...

•••

!\$OMP BARRIER

!\$OMP DO

•••

!\$OMP ENDDO

!\$OMP END PARALLEL

Threads are synchronized at **explicit** AND **implicit** barriers. These are a main source of overhead in OpenMP progams.

Determine costs via modified OpenMP

Microbenchmarks testcase (epcc)

#### On x86 systems there is no hardware support for synchronization!

- Next slides: Test OpenMP Barrier performance...
- for different compilers
- and different topologies:
  - shared cache
  - shared socket
  - between sockets
- and different thread counts
  - 2 threads
  - full domain (chip, socket, node)

# Thread synchronization overhead on SandyBridge-EP

Barrier overhead in CPU cycles



| 2 Threads    | Intel 13.1.0 | GCC 4.7.0 | GCC 4.6.1 |
|--------------|--------------|-----------|-----------|
| Shared L3    | 384          | 5242      | 4616      |
| SMT threads  | 2509         | 3726      | 3399      |
| Other socket | 1375         | 5959      | 4909      |





| Full domain | Intel 13.1.0 | GCC 4.7.0 | GCC 4.6.1 |
|-------------|--------------|-----------|-----------|
| Socket      | 1497         | 14546     | 14418     |
| Node        | 3401         | 34667     | 29788     |
| Node +SMT   | 6881         | 59038     | 58898     |

#### Thread synchronization overhead on Intel Xeon Phi

Barrier overhead in CPU cycles



2 threads on distinct cores: 1936

| 3                  | <del>66</del> | SMT1  | SMT2  | SMT3  | SMT4  |
|--------------------|---------------|-------|-------|-------|-------|
| One core Full chip |               | n/a   | 1597  | 2825  | 3557  |
|                    |               | 10604 | 12800 | 15573 | 18490 |

Still the pain may be much larger, as more work can be done in one cycle on Phi compared to a full Sandy Bridge node

3.75x cores (16 vs 60) on Phi2x more operations per cycle on Phi

- $\rightarrow$  2 · 3.75 = 7.5x more work done on Xeon Phi per cycle
- 2.7x more barrier penalty (cycles) on Phi 3x smaller clock speed (assuming 3 GHz SNB)
- → One barrier causes 2.7 · 7.5 / 3 ≈ 7x more pain ②.

#### **Conclusions from the microbenchmarks**



- Affinity matters!
  - Almost all performance properties depend on the position of
    - Data
    - Threads/processes
  - Consequences
    - Know where your threads are running
    - Know where your data is

Bandwidth bottlenecks are ubiquitous

- Synchronization overhead may be an issue
  - ... and also depends on affinity!
  - Many-core poses new challenges in terms of synchronization







# "Simple" performance modeling: The Roofline Model<sup>(1)</sup>

Loop-based performance modeling: Execution vs. data transfer

Example: array summation Example: A 3D Jacobi solver Model-guided optimization

# Prelude: Modeling customer dispatch in a bank



Revolving door throughput: b<sub>S</sub> [customers/sec]





















## Prelude: Modeling customer dispatch in a bank



#### How fast can tasks be processed? P [tasks/sec]

#### The bottleneck is either

- The service desks (max. tasks/sec):  $P_{\text{max}}$
- The revolving door (max. customers/sec):  $I \cdot b_S$



#### The Roofline Model<sup>1,2</sup>



- 1.  $P_{\text{max}}$  = Applicable peak performance of a loop, assuming that data comes from L1 cache (this is not necessarily  $P_{\text{peak}}$ )
- 2. I = Computational intensity ("work" per byte transferred) over the slowest data path utilized ("the bottleneck")
  - Code balance  $B_C = I^{-1}$
- 3.  $b_s$  = Applicable peak bandwidth of the slowest data path utilized

# **Expected performance:**

nce: 
$$[F/B] \quad [B/s]$$

$$P = \min(P_{\text{max}}, I \cdot b_S)$$

<sup>&</sup>lt;sup>1</sup>W. Schönauer: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. (2000)

<sup>&</sup>lt;sup>2</sup> S. Williams: Auto-tuning Performance on Multicore Computers. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)

# **Preliminary: Estimating** *P*<sub>max</sub>



How to perform a instruction throughput analysis on the example of Intel's port based scheduler model



First-order assumption: All instructions in a loop are fed independently to the various ports/pipelines

Complex cases (dependencies, hazards): Add penalty cycles / use tools (Intel IACA, Intel Amplifier)

## Throughput capabilities of the Intel Sandy Bridge core



#### Per cycle with AVX

- 1 load instruction (256 bits) AND ½ store instruction (128 bits)
- 1 AVX MULT and 1 AVX ADD instruction
   (4 DP / 8 SP flops each)

| Port 0 | Port 1 | Port 2 | Port 3 | Port 4 | Port 5 |
|--------|--------|--------|--------|--------|--------|
| ALU    | ALU    | LOAD   | LOAD   | STORE  | ALU    |
| FMUL   | FADD   | AGU    | AGU    |        | FSHUF  |
|        |        |        |        |        | JUMP   |

## Per cycle with SSE or scalar

- 2 load instruction OR 1 load and 1 store instruction
- 1 MULT and 1 ADD instruction

#### Overall maximum of 4 micro-ops

In practice, 3 is more realistic

# Example: Estimate $P_{\text{max}}$ of vector triad on SandyBridge



```
double *A, *B, *C, *D;
for (int i=0; i<N; i++) {
    A[i] = B[i] + C[i] * D[i]
}</pre>
```

How many cycles to process one AVX-vectorized iteration (one core)?

→ Equivalent to 4 scalar iterations

```
Cycle 1: LOAD + ½ STORE + MULT + ADD
```

Cycle 2: LOAD + ½ STORE

Cycle 3: LOAD Answer: 3 cycles

# Example: Estimate $P_{\text{max}}$ of vector triad on SandyBridge



```
double *A, *B, *C, *D;
for (int i=0; i<N; i++) {
    A[i] = B[i] + C[i] * D[i]
}</pre>
```

# What is the performance in GFlops/s and the bandwidth in GBytes/s?

One AVX iteration (3 cycles) does  $4 \times 2 = 8$  flops:

$$\frac{3.0 \cdot 10^9 \text{ cy/s}}{3 \text{ cy}} \cdot 4 \text{ updates} \cdot \frac{2 \text{ flops}}{\text{update}} = 8 \frac{\text{Gflops}}{\text{s}}$$

$$4 \cdot 10^9 \frac{\text{updates}}{\text{s}} \cdot 32 \frac{\text{bytes}}{\text{update}} = 128 \frac{\text{Gbyte}}{\text{s}}$$



# P<sub>max</sub> + bandwidth limitations: The vector triad



Example: Vector triad A(:)=B(:)+C(:)\*D(:)
on a 3 GHz 8-core Sandy Bridge chip (AVX vectorized)

- $b_{\rm S}$  = **40 GB/s**
- B<sub>c</sub> = (4+1) Words / 2 Flops = 2.5 W/F (including write allocate)

$$\rightarrow$$
 / = 0.4 F/W = 0.05 F/B

$$\rightarrow$$
 /·  $b_S$  = 2.0 GF/s (1.04 % of peak performance)

- P<sub>peak</sub> = 192 Gflop/s (8 FP units x (4+4) Flops/cy x 3.0 GHz)
- $P_{\text{max}} = 8 \times 8 \text{ Gflop/s} = 64 \text{ Gflop/s} (33\% \text{ peak})$

$$P = \min(P_{\text{max}}, I \cdot b_S) = \min(64,2.0) \text{ GFlop/s}$$
  
= 2.0 GFlop/s

# A not so simple Roofline example



Example: do i=1,N; s=s+a(i); enddo

in single precision on a 2.2 GHz Sandy Bridge socket @ "large" N



#### Applicable peak for the summation loop



#### Plain scalar code, no SIMD

```
LOAD r1.0 ← 0

i ← 1

loop:

LOAD r2.0 ← a(i)

ADD r1.0 ← r1.0+r2.0

++i →? loop

result ← r1.0
```



→ 1/24 of ADD peak

#### Applicable peak for the summation loop



#### Scalar code, 3-way unrolling

LOAD r1.0 
$$\leftarrow$$
 0  
LOAD r2.0  $\leftarrow$  0  
LOAD r3.0  $\leftarrow$  0  
i  $\leftarrow$  1

#### loop:

LOAD r4.0 
$$\leftarrow$$
 a(i)  
LOAD r5.0  $\leftarrow$  a(i+1)  
LOAD r6.0  $\leftarrow$  a(i+2)

ADD 
$$r1.0 \leftarrow r1.0 + r4.0$$
  
ADD  $r2.0 \leftarrow r2.0 + r5.0$   
ADD  $r3.0 \leftarrow r3.0 + r6.0$ 

#### **ADD pipes utilization:**



→ 1/8 of ADD peak

#### Applicable peak for the summation loop



#### SIMD-vectorized, 3-way unrolled

```
LOAD [r1.0,...,r1.7] \leftarrow [0,...,0]

LOAD [r2.0,...,r2.7] \leftarrow [0,...,0]

LOAD [r3.0,...,r3.7] \leftarrow [0,...,0]

i \leftarrow 1
```

#### loop:

```
LOAD [r4.0,...,r4.7] \leftarrow [a(i),...,a(i+7)]

LOAD [r5.0,...,r5.7] \leftarrow [a(i+8),...,a(i+15)]

LOAD [r6.0,...,r6.7] \leftarrow [a(i+16),...,a(i+23)]
```

ADD r1 
$$\leftarrow$$
 r1 + r4  
ADD r2  $\leftarrow$  r2 + r5  
ADD r3  $\leftarrow$  r3 + r6

#### ADD pipes utilization:



→ ADD peak

#### Input to the roofline model



... on the example of in single precision

do i=1,N; s=s+a(i); enddo



#### **Prerequisites for the Roofline Model**



- The roofline formalism is based on some (crucial) assumptions:
  - There is a clear concept of "work" vs. "traffic"
    - "work" = flops, updates, iterations...
    - "traffic" = required data to do "work"
  - Attainable bandwidth of code = input parameter! Determine effective bandwidth via simple streaming benchmarks to model more complex kernels and applications
  - Data transfer and core execution overlap perfectly!
    - Either the limit is core execution or it is data transfer
  - Slowest data path is modeled only; all others are assumed to be infinitely fast
  - If data transfer is the limiting factor, the bandwidth of the slowest data path can be utilized to 100% ("saturation")
  - Latency effects are ignored, i.e. perfect streaming mode

## **Exercise: Dense matrix-vector multiplication in DP (AVX)**



do 
$$i=1,N$$

do  $i=1,N$ 

tmp =  $c(i)$ 

do  $j=1,N$ 
 $c(i)=c(i)+A(j,i)*b(j)$ 

enddo

enddo

 $c(i) = tmp$ 

enddo

enddo

enddo

enddo

- Assume N ≈ 5000
- Applicable peak performance?
- Relevant data path?
- Computational Intensity?

## Typical code optimizations in the Roofline Model



- Hit the BW bottleneck by good serial code (e.g., Perl → Fortran)
- Increase intensity to make better use of BW bottleneck (e.g., loop blocking → see later)
- 3. Increase intensity and go from memory-bound to core-bound (e.g., temporal blocking)
- Hit the core bottleneck by good serial code (e.g., -fno-alias → see later)
- Shift P<sub>max</sub> by accessing additional hardware features or using a different algorithm/implementation (e.g., scalar → SIMD)



#### Shortcomings of the roofline model



- Saturation effects in multicore chips are not explained
  - Reason: "saturation assumption"
  - Cache line transfers and core execution do sometimes not overlap perfectly
  - It is not sufficient to measure single-core STREAM to make it work
  - Only increased "pressure" on the memory interface can saturate the bus
     → need more cores!
- In-cache performance is not correctly predicted
- The ECM performance model gives more insight:

G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience (2013).

DOI: 10.1002/cpe.3180 Preprint: arXiv:1208.2908

http://youtu.be/Z8a513NCFjs





**Case study: Sparse Matrix Vector Multiplication** 

# **Sparse Matrix Vector Multiplication (spMVM)**



- Key ingredient in some matrix diagonalization algorithms
  - Lanczos, Davidson, Jacobi-Davidson
- Store only N<sub>nz</sub> nonzero elements of matrix and RHS, LHS vectors with N<sub>r</sub> (number of matrix rows) entries
- "Sparse": N<sub>nz</sub> ~ N<sub>r</sub>



#### **SpMVM** characteristics



- For large problems, spMVM is inevitably memory-bound
  - Intra-socket saturation effect on modern multicores
- SpMVM is easily parallelizable in shared and distributed memory
- Data storage format is crucial for performance properties
  - Most useful general format on CPUs: Compressed Row Storage (CRS)
  - Depending on compute architecture

#### **CRS** matrix storage scheme





- val[] stores all the nonzeros (length N<sub>nz</sub>)
- col\_idx[] stores the column index of each nonzero (length N<sub>nz</sub>)
- row\_ptr[] stores the starting index of each new row in val[] (length: N<sub>r</sub>)



# **Case study: Sparse matrix-vector multiply**



- Strongly memory-bound for large data sets
  - Streaming, with partially indirect access:

```
!$OMP parallel do
do i = 1,Nr
do j = row_ptr(i), row_ptr(i+1) - 1
   c(i) = c(i) + val(j) * b(col_idx(j))
   enddo
enddo
!$OMP end parallel do
```

- Usually many spMVMs required to solve a problem
- Following slides: Performance data on one 24-core AMD Magny Cours node

#### **Application: Sparse matrix-vector multiply**

Strong scaling on one XE6 Magny-Cours node



#### Case 1: Large matrix



#### **Application: Sparse matrix-vector multiply**

Strong scaling on one XE6 Magny-Cours node



#### Case 2: Medium size



# **Application: Sparse matrix-vector multiply**

Strong scaling on one Magny-Cours node



#### Case 3: Small size



#### **Example: SpMVM node performance model**



Sparse MVM in double precision w/ CRS data storage:

do i = 1,
$$N_{\Gamma}$$
  
do j = row\_ptr(i), row\_ptr(i+1) - 1  
 $C(i) = C(i) + val(j) * B(col_idx(j))$   
enddo  
enddo

- DP CRS comp. intensity
  - α quantifies traffic for loading RHS
    - $\alpha = 0 \rightarrow RHS$  is in cache
    - $\alpha = 1/N_{nzr} \rightarrow RHS$  loaded once
    - $\alpha = 1 \rightarrow \text{no cache}$
    - $\alpha > 1 \rightarrow$  Houston, we have a problem!
  - "Expected" performance = b<sub>S</sub> x I<sub>CRS</sub>
  - Determine α by measuring performance and actual memory traffic
    - Maximum memory BW may not be achieved with spMVM

#### **Determine RHS traffic**



$$I_{CRS}^{DP} = \frac{2}{8+4+8\alpha+16/N_{nzr}} \frac{\text{flops}}{\text{byte}} = \frac{N_{nz} \cdot 2 \text{ flops}}{V_{meas}}$$

- $V_{meas}$  is the measured overall memory data traffic (using, e.g., likwid-perfctr)
- Solve for  $\alpha$ :

$$\alpha = \frac{1}{4} \left( \frac{V_{meas}}{N_{nz} \cdot 2 \text{ bytes}} - 6 - \frac{8}{N_{nzr}} \right)$$

Example: kkt\_power matrix from the UoF collection on one Intel SNB socket

$$N_{nz} = 14.6 \cdot 10^6$$
,  $N_{nzr} = 7.1$ 

■ 
$$V_{meas} \approx 258 \text{ MB}$$

• 
$$\rightarrow \alpha = 0.43, \ \alpha N_{nzr} = 3.1$$

- → RHS is loaded 3.1 times from memory
- and:

$$\frac{I_{CRS}^{DP}(1/N_{nzr})}{I_{CRS}^{DP}(\alpha)} = 1.15$$

15% extra traffic → optimization potential!



#### Roofline analysis for spMVM



- Conclusion from Roofline analysis
  - The roofline model does not work 100% for spMVM due to the RHS traffic uncertainties
  - We have "turned the model around" and measured the actual memory traffic to determine the RHS overhead
  - Result indicates:
    - 1. how much actual traffic the RHS generates
    - 2. how efficient the RHS access is (compare BW with max. BW)
    - 3. how much optimization potential we have with matrix reordering

Consequence: Modeling is not always 100% predictive. It's all about *learning more* about performance properties!



# Case study: A Jacobi smoother

#### The basics in two dimensions

**Layer conditions** 

Validating the model in 3D

Optimization by spatial blocking in 3D

#### Stencil schemes



- Stencil schemes frequently occur in PDE solvers on regular lattice structures
- Basically it is a sparse matrix vector multiply (spMVM) embedded in an iterative scheme (outer loop)
- but the regular access structure allows for matrix free coding

```
do iter = 1, max_iterations

Perform sweep over regular grid: y(:) 
Swap y 
x
```

enddo

- Complexity of implementation and performance depends on
  - stencil operator, e.g. Jacobi-type, Gauss-Seidel-type, ...
  - spatial extent, e.g. 7-pt or 25-pt in 3D,...

## Jacobi-type 5-pt stencil in 2D



Lattice Update (LUP)



Appropriate performance metric: "Lattice Updates per second" [LUP/s]

(here: Multiply by 4 FLOP/LUP to get FLOP/s rate)

## Jacobi 5-pt stencil in 2D: data transfer analysis



```
Available in cache
    LD+ST y(i,k)
                                       (used 2 updates before) ,
      (incl. write
       allocate)
                                                            LD x(j+1,k)
    do k=1,kmax
      do j=1,jmax
        y(j,k) = const * ( x(j-1,k) + x(j+1,k) &
                                   x(j,k-1) + x(j,k+1))
      enddo
    enddo
                                         LD \times (j,k-1)
                                                            LD x(j,k+1)
Naive balance (incl. write allocate):
x(:,:):3LD+
y(:,:):1 ST+1LD
\rightarrow B<sub>c</sub> = 5 Words / LUP = 40 B / LUP (assuming double precision)
```

## Jacobi 5-pt stencil in 2D: Single core performance







## Case study: A Jacobi smoother

The basics in two dimensions

**Layer conditions** 

Validating the model in 3D

Optimization by spatial blocking in 3D

#### **Analyzing the data flow**





#### **Analyzing the data flow**



Worst case: Cache not large enough to hold 3 layers (rows) of grid (+assume "Least Recently Used" replacement strategy)



## **Analyzing the data flow**





#### **Analyzing the data flow: Layer condition**





#### **Layer condition:**

- Does not depend on outer loop length (kmax)
- No strict guideline (cache associativity data traffic for y not included)
- Needs to be adapted for other stencils (e.g., 3D 7-pt stencil)

## Analyzing the data flow: Layer condition (2D 5-pt Jacobi)



```
y: (1 LD + 1 ST) / LUP
                                                                x: 1 LD / LUP
           do k=1, kmax
             do j=1,jmax
               y(j,k) = const * (x(j-1,k) + x(j+1,k) &
                               + x(j,k-1) + x(j,k+1))
YES
             enddo
                                                              B_C = 24 B / LUP
           enddo
 3 * jmax * 8B < CacheSize/2
     "Layer condition" fulfilled?
                                      y: (1 LD + 1 ST) / LUP
           do k=1, kmax
             do j=1,jmax
NO
               y(j,k) = const * (x(j-1,k) + x(j+1,k) &
                               + x(j,k-1) + x(j,k+1))
             enddo
                           x: 3 LD / LUP
                                                             B_C = 40 B / LUP
           enddo
```

#### **Analyzing the data flow: Layer condition (2D 5-pt Jacobi)**



- Establish layer condition for all domain sizes
- Idea: Spatial blocking
  - Reuse elements of x () as long as they stay in cache
  - Sweep can be executed in any order, e.g. compute blocks in j-direction

#### → "Spatial Blocking" of j-loop:

→ Determine for given CacheSize an appropriate jblock value:

```
jblock < CacheSize / 48 B</pre>
```

## Establish the layer condition by blocking



Split up domain into subblocks:

e.g. block size = 5



## **Establish the layer condition by blocking**





## Establish layer condition by spatial blocking





#### Layer condition & spatial blocking: Memory code balance





#### Layer condition & spatial blocking: L3 cache balance





#### From 2D to 3D





- \* x(0:imax+1, 0:jmax+1,0:kmax+1) Assume i-direction
  contiguous in main memory (Fortran notation)
- Stay at 2D picture and consider one cell of j-k plane as a contiguous slab of elements in i-direction: x (0:imax, j, k)

## Layer condition: From 2D 5-pt to 3D 7-pt Jacobi-type stencil



**2D** 

 $B_C = 24 B / LUP$ 

**3D** 

 $B_C = 24 B / LUP$ 

## 3D 7-pt Jacobi-type Stencil (sequential)



```
"Layer condition"
         3*jmax*imax*8B < CS/2
                                                       "Layer condition" OK →
                                                 5 accesses to x() served by cache
do k=1, kmax
   do j=1, jmax \sqrt{\phantom{a}}
      do i=1,imax
         y(i,j,k) = const.* (x(i-1,j,k) +x(i+1,j,k) & +x(i,j-1,k) +x(i,j+1,k) & +x(i,j,k-1) +x(i,j,k+1) )
      enddo
   enddo
enddo
```

#### **Question:**

Does parallelization/multi-threading change the layer condition?

## Jacobi Stencil - OpenMP parallelization (I)



! \$OMP PARALLEL DO SCHEDULE (STATIC)
do k=1,kmax

Basic guideline:
Parallelize outermost loop

enddo

Equally large chunks in k-direction→ "Layer condition" for each thread

```
"Layer condition": nthreads * 3 * jmax*imax * 8B < CS/2
```

```
Layer condition (cubic domain; CacheSize=25 MB)

1 thread: imax=jmax < 720 → 10 threads: imax=jmax < 230
```

## Jacobi Stencil – OpenMP parallelization (II)



```
"Layer condition": nthreads *3*jmax*imax*8B < CS/2
 !$OMP PARALLEL DO SCHEDULE(STATIC)
 do k=1, kmax
   do j=1, jmax
                                                       B_C = 24 B / LUP
      do i=1,imax
         y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &
                             + x(i,j-1,k) + x(i,j+1,k) &
                             + x(i,j,k-1) + x(i,j,k+1)
      enddo
                                              Roofline model:
   enddo
                                        P = \min(P_{\text{max}}, b_{\text{S}} / (24 \text{ B/LUP}))
 enddo
                                    What is P_{\text{max}} here? \rightarrow homework!
Intel® Xeon® Processor E5-2690 v2
10 cores@3 GHz
```

```
CS = 25 MB (L3)
      = 48 \text{ GB/s}
b_{s}
```

**Best performance:** P = 2000 MLUP/s



## Case study: A Jacobi smoother

The basics in two dimensions

**Layer conditions** 

Validating the model in 3D

Optimization by spatial blocking in 3D

## Jacobi Stencil - OpenMP parallelization (I)







## Case study: A Jacobi smoother

The basics in two dimensions
Layer conditions
Validating the model in 3D
Spatial blocking in 3D

## Jacobi Stencil – simple spatial blocking



```
Ensure layer condition by choosing jblock appropriately (cubic domain): jblock < CS/(imax* nthreads* 48B)
```

Test system: Intel® Xeon® Processor E5-2690 v2 (10 cores / 3 GHz), as before

## Jacobi Stencil - simple spatial blocking





#### **Conclusions from the Jacobi example**



- We have made sense of the memory-bound performance vs. problem size
  - "Layer conditions" lead to predictions of code balance
  - "What part of the data comes from where" is a crucial question
  - The model works only if the bandwidth is "saturated"
    - In-cache modeling is more involved
- Avoiding slow data paths == re-establishing the most favorable layer condition
- Improved code showed the speedup predicted by the model
- Optimal blocking factor can be estimated
  - Be guided by the cache size the layer condition
  - No need for exhaustive scan of "optimization space"
- Food for thought
  - 2D/3D loop blocking would it make sense?
  - Can we choose a "better" OpenMP loop schedule?
  - What would change if we parallelized inner loops (j or i)?



# Coding for SingleInstructionMultipleData processing

## SIMD processing – Basics



- Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on "wide" registers.
- x86 SIMD instruction sets:
  - SSE: register width = 128 Bit → 2 double precision floating point operands
  - AVX: register width = 256 Bit → 4 double precision floating point operands
- Adding two registers holding double precision floating point operands



## SIMD processing – Basics



Steps (done by the compiler) for "SIMD processing"

```
for(int i=0; i<n;i++)
C[i]=A[i]+B[i];
```

"Loop unrolling"

```
for(int i=0; i<n;i+=4) {
        C[i] =A[i] +B[i];
        C[i+1]=A[i+1]+B[i+1];
        C[i+2]=A[i+2]+B[i+2];
        C[i+3]=A[i+3]+B[i+3];}
//remainder loop handling</pre>
```

Load 256 Bits starting from address of A[i] to register R0

Add the corresponding 64 Bit entries in R0 and R1 and store the 4 results to R2

Store R2 (256 Bit) to address starting at C[i]

```
LABEL1:

VLOAD R0 ← A[i]

VLOAD R1 ← B[i]

V64ADD[R0,R1] → R2

VSTORE R2 → C[i]

i←i+4

i<(n-4)? JMP LABEL1

//remainder loop handling
```

## SIMD processing – Basics



No SIMD vectorization for loops with data dependencies:

```
for(int i=0; i<n;i++)
A[i]=A[i-1]*s;
```

"Pointer aliasing" may prevent SIMDfication

```
void f(double *A, double *B, double *C, int n) {
    for(int i=0; i<n; ++i)
        C[i] = A[i] + B[i];
}</pre>
```

- C/C++ allows that  $A \rightarrow \&C[-1]$  and  $B \rightarrow \&C[-2]$  $\rightarrow$  C[i] = C[i-1] + C[i-2]: dependency  $\rightarrow$  No SIMD
- If "pointer aliasing" is not used, tell it to the compiler:

```
-fno-alias (Intel), -Msafeptr (PGI), -fargument-noalias (gcc)
- restrict keyword (C only!):
void f(double restrict *A, double restrict *B, double restrict *C, int n) {...}
```



## Reading x86 assembly code and exploting SIMD parallelism

Understanding SIMD execution by inspecting assembly code

SIMD vectorization how-to

Intel compiler options and features for SIMD

#### Why and how?



#### Why check the assembly code?

- Sometimes the only way to make sure the compiler "did the right thing"
  - Example: "LOOP WAS VECTORIZED" message is printed, but Loads & Stores may still be scalar!
- Get the assembler code (Intel compiler):

```
icc -S -O3 -xHost triad.c -o a.out
```

Disassemble Executable:

```
objdump -d ./a.out | less
```

#### The x86 ISA is documented in:

Intel Software Development Manual (SDM) 2A and 2B AMD64 Architecture Programmer's Manual Vol. 1-5

#### Basics of the x86-64 ISA



```
16 general Purpose Registers (64bit):
```

```
rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8-r15
alias with eight 32 bit register set:
eax, ebx, ecx, edx, esi, edi, esp, ebp
```

#### Floating Point SIMD Registers:

```
xmm0-xmm15 SSE (128bit) alias with 256-bit registers ymm0-ymm15 AVX (256bit)
```

#### **SIMD** instructions are distinguished by:

AVX (VEX) prefix: v

Operation: mul, add, mov

Modifier: nontemporal (nt), unaligned (u), aligned (a), high (h)

Width: scalar (s), packed (p)

Data type: single (s), double (d)

## Case Study: Simplest code for the summation of the elements of a vector (single precision)



```
float sum = 0.0;
                                      To get object code use
for (int j=0; j<size; j++) {</pre>
                                      objdump -d on object file or
    sum += data[j];
                                      executable or compile with -s
                          AT&T syntax:
                          addss 0(%rdx,%rax,4),%xmm0
Instruction code:
401d08:
           f3 Of 58 O4 82
                                       addss
                                               xmm0, [rdx + rax * 4]
401d0d:
           48 83 c0 01
                                       add
                                                rax,1
401d11: 39 c7
                                               edi, eax
                                       cmp
401d13:
           77 f3
                                       jа
                                               401d08
                            (final sum
                            across xmm0
                            omitted)
 Instruction
                 Opcodes
                                               Assembly
  address
                                                 code
```

#### Summation code (single precision): Improvements



```
1:
        xmm0, [rsi + rax * 4]
addss
                                    1:
add
        rax, 1
                     3 cycles add
                                    vaddps ymm0,...,[rsi + rax * 4]
       eax,edi
cmp
                     pipeline
                     latency
                                    vaddps ymm1,...,[rsi + rax * 4 + 32]
js 1b
                                    vaddps ymm2,...,[rsi + rax * 4 + 64]
     Unrolling with sub-sums to break up
                                    vaddps ymm3,...,[rsi + rax * 4 + 96]
     register dependency
                                    add rax, 32
                                        eax,edi
                                    cmp
1:
                                    js 1b
addss xmm0, [rsi + rax *
                                           AVX SIMD vectorization
addss xmm1, [rsi + rax * 4 + 4]
addss xmm2, [rsi + rax * 4 + 8]
addss xmm3, [rsi + rax * 4 + 12]
add
     rax, 4
cmp eax,edi
js 1b
```

#### **How to leverage SIMD**



#### **Alternatives:**

- The compiler does it for you (but: aliasing, alignment, language)
- Compiler directives (pragmas)
- Alternative programming models for compute kernels (OpenCL, ispc)
- Intrinsics (restricted to C/C++)
- Implement directly in assembler

#### To use intrinsics the following headers are available:

- mmintrin.h (SSE)
- pmmintrin.h (SSE2)
- immintrin.h (AVX)
- \*86intrin.h (all instruction set extensions)
- See next slide for an example

## Example: array summation using C intrinsics (SSE, single precision)



```
m128 sum0, sum1, sum2, sum3;
                                    sum0 = mm add ps(sum0, sum1);
m128 t0, t1, t2, t3;
                                    sum0 = mm add ps(sum0, sum2);
float scalar sum;
                                    sum0 = mm add ps(sum0, sum3);
sum0 = mm setzero ps();
                                    sum0 = mm hadd ps(sum0, sum0);
sum1 = mm setzero ps();
                                    sum0 = mm hadd ps(sum0, sum0);
sum2 = mm setzero ps();
                                    mm store ss(&scalar sum, sum0);
sum3 = mm setzero ps();
for (int j=0; j<size; j+=16) {
   t0 = mm loadu ps(data+j);
                                              summation of
   t1 = mm loadu ps(data+j+4);
                                              partial results
   t2 = mm loadu ps(data+j+8);
                                        core loop
    t3 = mm loadu ps(data+j+12);
                                        (bulk)
    sum0 = mm add ps(sum0, t0);
    sum1 = mm add ps(sum1, t1);
    sum2 = mm add ps(sum2, t2);
    sum3 = mm add ps(sum3, t3);
```

#### **Example: array summation from intrinsics, instruction code**



```
14:
      0f 57 c9
                                       %xmm1,%xmm1
                                xorps
17:
      31 c0
                                xor
                                       %eax,%eax
19:
      0f 28 d1
                                movaps %xmm1, %xmm2
      0f 28 c1
                                movaps %xmm1, %xmm0
1c:
      0f 28 d9
                                movaps %xmm1, %xmm3
1f:
22:
      66 Of 1f 44 00 00
                                       0x0(%rax,%rax,1)
                                nopw
28:
      0f 10 3e
                                movups (%rsi),%xmm7
      Of 10 76 10
                                movups 0x10(%rsi),%xmm6
2b:
2f:
      Of 10 6e 20
                                movups 0x20(%rsi),%xmm5
      Of 10 66 30
                                movups 0x30(%rsi),%xmm4
33:
      83 c0 10
                                add
                                       $0x10, %eax
37:
3a:
      48 83 c6 40
                                add
                                       $0x40,%rsi
      0f 58 df
                                addps %xmm7,%xmm3
3e:
      0f 58 c6
                                addps %xmm6,%xmm0
41:
                                addps %xmm5, %xmm2
      0f 58 d5
44:
47:
      0f 58 cc
                                addps %xmm4,%xmm1
      39 c7
                                       %eax,%edi
4a:
                                cmp
      77 da
                                       28 <compute sum SSE+0x18>
4c:
                                jа
                                                                          Loop body
      0f 58 c3
                                addps
                                       %xmm3,%xmm0
<u>4e:</u>
      0f 58 c2
51:
                                addps
                                       %xmm2,%xmm0
      0f 58 c1
                                addps
                                       %xmm1,%xmm0
54:
57:
      f2 Of 7c c0
                                haddps %xmm0, %xmm0
      f2 Of 7c c0
5b:
                                haddps %xmm0,%xmm0
5f:
      с3
                                retq
```

#### **Vectorization and the Intel compiler**



- Intel compiler will try to use SIMD instructions when enabled to do so
  - "Poor man's vector computing"
  - Compiler can emit messages about vectorized loops (not by default):

```
plain.c(11): (col. 9) remark: LOOP WAS VECTORIZED.
```

- Use option -vec\_report3 to get full compiler output about which loops were vectorized and which were not and why (data dependencies!)
- Some obstructions will prevent the compiler from applying vectorization even if it is possible
- You can use source code directives to provide more information to the compiler

#### **Vectorization compiler options**



- The compiler will vectorize starting with -02.
- To enable specific SIMD extensions use the –x option:
  - -xSSE2 vectorize for SSE2 capable machines Available SIMD extensions: SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX

-xAVX on Sandy Bridge processors

#### Recommended option:

-xHost will optimize for the architecture you compile on

On AMD Opteron: use plain -o3 as the -x options may involve CPU type checks.

#### **Vectorization compiler options**



#### Controlling non-temporal stores (part of the SIMD extensions)

-opt-streaming-stores always|auto|never

**always** use NT stores, assume application is memory

bound (use with caution!)

auto compiler decides when to use NT stores

**never** do not use NT stores unless activated by

source code directive

#### Rules for vectorizable loops



- Countable
- 2. Single entry and single exit
- 3. Straight line code
- 4. No function calls (exception intrinsic math functions)

#### **Better performance with:**

- 1. Simple inner loops with unit stride
- Minimize indirect addressing
- 3. Align data structures (SSE 16 bytes, AVX 32 bytes)
- 4. In C use the restrict keyword for pointers to rule out aliasing

#### **Obstacles for vectorization:**

- 1. Non-contiguous memory access
- Data dependencies

#### **Vectorization source code directives**



- Fine-grained control of loop vectorization
- Use !DEC\$ (Fortran) or #pragma (C/C++) sentinel to start a compiler directive
- #pragma vector always vectorize even if it seems inefficient (hint!)
- #pragma novector do not vectorize even if possible
- #pragma vector nontemporal use NT stores when allowed (i.e. alignment conditions are met)
- #pragma vector aligned specifies that all array accesses are aligned to 16-byte boundaries (DANGEROUS! You must not lie about this!)

#### **User mandated vectorization**



- Since Intel Compiler 12.0 the simd pragma is available
- #pragma simd enforces vectorization where the other pragmas fail
- Prerequesites:
  - Countable loop
  - Innermost loop
  - Must conform to for-loop style of OpenMP worksharing constructs
- There are additional clauses: reduction, vectorlength, private
- Refer to the compiler manual for further details

```
#pragma simd reduction(+:x)
for (int i=0; i<n; i++) {
    x = x + A[i];
}</pre>
```

 NOTE: Using the #pragma simd the compiler may generate incorrect code if the loop violates the vectorization rules!

#### **x86 Architecture:**

#### SIMD and Alignment



#### Alignment issues

- Alignment of arrays with SSE (AVX) should be on 16-byte (32-byte) boundaries to allow packed aligned loads and NT stores (for Intel processors)
  - AMD has a scalar nontemporal store instruction
- Otherwise the compiler will revert to unaligned loads and not use NT stores – even if you say vector nontemporal
- Modern x86 CPUs have less (not zero) impact for misaligned LD/ST, but Xeon Phi relies heavily on it!
- How is manual alignment accomplished?
- Dynamic allocation of aligned memory (align = alignment boundary):



## Efficient parallel programming on ccNUMA nodes

Performance characteristics of ccNUMA nodes
First touch placement policy
C++ issues
ccNUMA locality and dynamic scheduling
ccNUMA locality beyond first touch

#### ccNUMA performance problems

"The other affinity" to care about



#### ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?



 Page placement is implemented in units of OS pages (often 4kB, possibly more)

#### **Cray XE6 Interlagos node**

4 chips, two sockets, 8 threads per ccNUMA domain



#### ccNUMA map: Bandwidth penalties for remote access

- Run 8 threads per ccNUMA domain (1 chip)
- Place memory in different domain → 4x4 combinations
- STREAM triad benchmark using nontemporal stores





#### numactl as a simple ccNUMA locality tool:

How do we enforce some locality of access?



numactl can influence the way a binary maps its memory pages:

#### Examples:

```
for m in `seq 0 3`; do
    for c in `seq 0 3`; do
    env OMP_NUM_THREADS=8 \
        numactl --membind=$m --cpunodebind=$c ./stream
    enddo
enddo
```

```
env OMP_NUM_THREADS=4 numactl --interleave=0-3 \
likwid-pin -c N:0,4,8,12 ./stream
```

But what is the default without numactl?

#### ccNUMA default memory locality



"Golden Rule" of ccNUMA:

## A memory page gets mapped into the local memory of the processor that first touches it!

- Except if there is not enough local memory available
- This might be a problem, see later
- Caveat: "touch" means "write", not "allocate"
- Example:

Memory not mapped here yet

It is sufficient to touch a single item to map the entire page

#### Coding for ccNUMA data locality



#### Most simple case: explicit initialization

```
integer, parameter :: N=10000000
double precision A(N), B(N)
A=0.d0
!$OMP parallel do
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end parallel do
```

```
integer, parameter :: N=10000000
double precision A(N),B(N)
!$OMP parallel
!$OMP do schedule(static)
do i = 1, N
 A(i) = 0.d0
end do
!$OMP end do
!$OMP do schedule(static)
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end do
!$OMP end parallel
```

#### Coding for ccNUMA data locality



 Sometimes initialization is not so obvious: I/O cannot be easily parallelized, so "localize" arrays before I/O



```
integer, parameter :: N=10000000
double precision A(N),B(N)
!$OMP parallel
!$OMP do schedule(static)
do i = 1, N
 A(i) = 0.d0
end do
!$OMP end do
!$OMP single
READ (1000) A
!$OMP end single
!$OMP do schedule(static)
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end do
!$OMP end parallel
```

#### **Coding for Data Locality**



- Required condition: OpenMP loop schedule of initialization must be the same as in all computational loops
  - Only choice: static! Specify explicitly on all NUMA-sensitive loops, just to be sure...
  - Imposes some constraints on possible optimizations (e.g. load balancing)
  - Presupposes that all worksharing loops with the same loop length have the same thread-chunk mapping
  - If dynamic scheduling/tasking is unavoidable, more advanced methods may be in order
    - See below
- How about global objects?
  - Better not use them
  - If communication vs. computation is favorable, might consider properly placed copies of global data
- C++: Arrays of objects and std::vector<> are by default initialized sequentially
  - STL allocators provide an elegant solution

#### **Coding for Data Locality:**

Placement of static arrays or arrays of objects



Don't forget that constructors tend to touch the data members of an object. Example:

```
class D {
  double d;
public:
  D(double _d=0.0) throw() : d(_d) {}
  inline D operator+(const D& o) throw() {
    return D(d+o.d);
  }
  inline D operator*(const D& o) throw() {
    return D(d*o.d);
  }
...
};
```

```
→ placement problem with
  D* array = new D[1000000];
```

#### **Coding for Data Locality:**

#### Parallel first touch for arrays of objects



Solution: Provide overloaded D::operator new[]

```
void* D::operator new[](size t n) {
  char *p = new char[n];  // allocate
                                                 parallel first
                                                 touch
  size t i,j;
#pragma omp parallel for private(j) schedule(...)
  for (i=0; i < n; i += size of (D))
    for(j=0; j<sizeof(D); ++j)</pre>
      p[i+j] = 0;
  return p;
void D::operator delete[](void* p) throw() {
  delete [] static cast<char*>p;
```

Placement of objects is then done automatically by the C++ runtime via "placement new"

#### **Coding for Data Locality:**

NUMA allocator for parallel first touch in std::vector<>



```
template <class T> class NUMA Allocator {
public:
  T* allocate(size_type numObjects, const void
               *localityHint=0) {
    size type ofs,len = numObjects * sizeof(T);
    void *m = malloc(len);
    char *p = static cast<char*>(m);
    int i,pages = len >> PAGE BITS;
#pragma omp parallel for schedule(static) private(ofs)
    for(i=0; i<pages; ++i) {</pre>
      ofs = static cast<size t>(i) << PAGE BITS;</pre>
      p[ofs]=0;
    return static cast<pointer>(m);
           Application:
```

vector<double, NUMA Allocator<double> > x(10000000)

#### **Diagnosing Bad Locality**



- If your code is cache-bound, you might not notice any locality problems
- Otherwise, bad locality limits scalability at very low CPU numbers (whenever a node boundary is crossed)
  - If the code makes good use of the memory interface
  - But there may also be a general problem in your code...
- Running with numactl --interleave might give you a hint
  - See later
- Consider using performance counters
  - LIKWID-perfctr can be used to measure nonlocal memory accesses
  - Example for Intel Westmere dual-socket system (Core i7, hex-core):

```
env OMP_NUM_THREADS=12 likwid-perfctr -g MEM -C N:0-11 ./a.out
```

## Using performance counters for diagnosing bad ccNUMA access locality



Intel Westmere EP node (2x6 cores):



#### If all fails...

- Even if all placement rules have been carefully observed, you may still see nonlocal memory traffic. Reasons?
  - Program has erratic access patters → may still achieve some access parallelism (see later)
  - OS has filled memory with buffer cache data:

# numactl --hardware # idle node!

```
available: 2 nodes (0-1)
node 0 size: 2047 MB
node 0 free: 906 MB
node 1 size: 1935 MB
node 1 free: 1798 MB

top - 14:18:25 up 92 days, 6:07, 2 users, load average: 0.00, 0.02, 0.00
Mem: 4065564k total, 1149400k used, 2716164k free, 43388k buffers
Swap: 2104504k total, 2656k used, 2101848k free, 1038412k cached
```

#### ccNUMA problems beyond first touch:

#### Buffer cache

## optional

#### OS uses part of main memory for disk buffer (FS) cache

- If FS cache fills part of memory, apps will probably allocate from foreign domains
- → non-local access!
- "sync" is not sufficient to drop buffer cache blocks



#### Remedies

- Drop FS cache pages after user job has run (admin's job)
  - seems to be automatic after aprun has finished on Crays
- User can run "sweeper" code that allocates and touches all physical memory before starting the real application
- numactl tool or aprun can force local allocation (where applicable)
- Linux: There is no way to limit the buffer cache size in standard kernels

#### The curse and blessing of interleaved placement:

OpenMP STREAM on a Cray XE6 Interlagos node



- Parallel init: Correct parallel initialization
- LD0: Force data into LD0 via numactl -m 0
- Interleaved: numactl --interleave <LD range>



#### The curse and blessing of interleaved placement:

OpenMP STREAM triad on 4-socket (48 core) Magny Cours node



- Parallel init: Correct parallel initialization
- LD0: Force data into LD0 via numactl -m 0
- Interleaved: numactl --interleave <LD range>



#### **Summary on ccNUMA issues**



- Identify the problem
  - Is ccNUMA an issue in your code?
  - Simple test: run with numactl --interleave
- Apply first-touch placement
  - Look at initialization loops
  - Consider loop lengths and static scheduling
  - C++ and global/static objects may require special care
- If dynamic scheduling cannot be avoided
  - Distribute the data anyway, just do not use sequential placement!

 Not shown here: OS file buffer cache may impact proper placement



#### Simultaneous multithreading (SMT)

Principles and performance impact
SMT vs. independent instruction streams
Facts and fiction

### SMT Makes a single physical core appear as two or more "logical" cores → multiple threads/processes run concurrently



SMT principle (2-way example):



#### **SMT** impact



- SMT is primarily suited for increasing processor throughput
  - With multiple threads/processes running concurrently
- Scientific codes tend to utilize chip resources quite well
  - Standard optimizations (loop fusion, blocking, ...)
  - High data and instruction-level parallelism
  - Exceptions do exist

#### SMT is an important topology issue

- SMT threads share almost all core resources
  - Pipelines, caches, data paths
- Affinity matters!
- If SMT is not needed
  - pin threads to physical cores
  - or switch it off via BIOS etc.



#### **SMT** impact

- SMT adds another layer of topology (inside the physical core)
- Caveat: SMT threads share all caches!
- Possible benefit: Better pipeline throughput
  - Filling otherwise unused pipelines
  - Filling pipeline bubbles with other thread's executing instructions:





Unrelated work in other thread can fill the pipeline bubbles

Westmere EP

Beware: Executing it all in a single thread (if possible) may reach the same goal without SMT:

**Memory** 

stalls until previous MULT

is over

#### Simultaneous recursive updates with SMT



Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT MULT Pipeline depth: 5 stages → 1 F / 5 cycles for recursive update



#### Simultaneous recursive updates with SMT



Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT MULT Pipeline depth: 5 stages → 1 F / 5 cycles for recursive update







5 independent updates on a single thread do the same job!

#### Simultaneous recursive updates with SMT



Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT

Pure update benchmark can be vectorized → 2 F / cycle (store limited)



#### **Recursive update:**

- SMT can fill pipeline bubles
- A single thread can do so as well
- Bandwidth does not increase through SMT
- SMT can not replace SIMD!

#### **SMT** myths: Facts and fiction (1)



Myth: "If the code is compute-bound, then the functional units should be saturated and SMT should show no improvement."

#### Truth

- 1. A compute-bound loop does not necessarily saturate the pipelines; dependencies can cause a lot of bubbles, which may be filled by SMT threads.
- 2. If a pipeline is already full, SMT will not improve its utilization

# Thread 0: do i=1,N A(i)=A(i-1)\*c B(i)=B(i-1)\*d enddo Thread 1: do i=1,N A(i)=A(i-1)\*c B(i)=B(i-1)\*d enddo



#### SMT myths: Facts and fiction (2)



- Myth: "If the code is memory-bound, SMT should help because it can fill the bubbles left by waiting for data from memory."
- Truth:
  - 1. If the maximum memory bandwidth is already reached, SMT will not
    - help since the relevant resource (bandwidth) is exhausted.
  - 2. If the relevant bottleneck is not exhausted, SMT may help since it can fill bubbles in the LOAD pipeline.

This applies also to other "relevant bottlenecks!"



### Things to remember



### **Goals for optimization:**

- 1. Map your work to an instruction mix with highest throughput using the most effective instructions.
- 2. Reduce data volume over slow data paths fully utilizing available bandwidth.
- 3. Avoid possible hazards/overhead which prevent reaching goals one and two.



# Pattern-driven Performance Engineering

Basics of Benchmarking Performance Patterns Signatures

# **Basics of optimization**



- Define relevant test cases
- 2. Establish a sensible performance metric
- Acquire a runtime profile (sequential)
- 4. Identify hot kernels (Hopefully there are any!)
- 5. Carry out optimization process for each kernel



#### **Motivation:**

- Understand observed performance
- Learn about code characteristics and machine capabilities
- Deliberately decide on optimizations

# **Best practices for benchmarking**



#### Preparation

- Reliable timing (minimum time which can be measured?)
- Document code generation (flags, compiler version)
- Get access to an exclusive system
- System state (clock speed, turbo mode, memory, caches)
- Consider to automate runs with a script (shell, python, perl)

### Doing

- Affinity control
- Check: Is the result reasonable?
- Is result deterministic and reproducible?
- Statistics: Mean, Best ?
- Basic variants: Thread count, affinity, working set size

# Thinking in bottlenecks



- A bottleneck is a performance limiting setting
- Microarchitectures expose numerous bottlenecks

#### **Observation 1:**

Most applications face a single bottleneck at a time!

#### **Observation 2:**

There is a limited number of relevant bottlenecks!

# **Performance Engineering Process: Analysis**





Step 1 Analysis: Understanding observed performance

# Performance Engineering Process: Modeling





Step 2 Formulate Model: Validate pattern and get quantitative insight

# **Performance Engineering Process: Optimization**





Step 3 Optimization: Improve utilization of available resources

# Performance pattern classification



1. Maximum resource utilization (computing at a bottleneck)

Hazards (something "goes wrong")

3. Work related (too much work or too inefficiently done)

# Patterns (I): Bottlenecks & hazards



| Pattern                               |                      | Performance behavior                                                                     | Metric signature, LIKWID performance group(s)                                                                              |
|---------------------------------------|----------------------|------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| Bandwidth saturation Jacobi           |                      | Saturating speedup across cores sharing a data path                                      | Bandwidth meets BW of suitable streaming benchmark (MEM, L3)                                                               |
| ALU saturation In-L1 sum optimal code |                      | Throughput at design limit(s)                                                            | Good (low) CPI, integral ratio of cycles to specific instruction count(s) (FLOPS_*, DATA, CPI)                             |
| Inefficient<br>data<br>access         | Excess data volume   | Simple bandwidth performance model much too optimistic                                   | Low BW utilization / Low cache hit ratio, frequent CL evicts or replacements (CACHE, DATA, MEM)                            |
|                                       | Latency-bound access |                                                                                          |                                                                                                                            |
| Micro-architectural anomalies         |                      | Large discrepancy from simple performance model based on LD/ST and arithmetic throughput | Relevant events are very hardware-specific, e.g., memory aliasing stalls, conflict misses, unaligned LD/ST, requeue events |

# Patterns (II): Hazards



| Pattern                                   | Performance behavior                                                                        | Metric signature, LIKWID performance group(s)                                                          |
|-------------------------------------------|---------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| False sharing of cache lines              | Large discrepancy from performance model in parallel case, bad scalability                  | Frequent (remote) CL evicts (CACHE)                                                                    |
| Bad ccNUMA page placement                 | Bad or no scaling across NUMA domains, performance improves with interleaved page placement | Unbalanced bandwidth on memory interfaces / High remote traffic (MEM)                                  |
| In-L1 sum w/o unrolling Pipelining issues | In-core throughput far from design limit, performance insensitive to data set size          | (Large) integral ratio of cycles to specific instruction count(s), bad (high) CPI (FLOPS_*, DATA, CPI) |
| Control flow issues                       | See above                                                                                   | High branch rate and branch miss ratio (BRANCH)                                                        |

# Patterns (III): Work-related



| Pattern                          |                          | Performance behavior                                                                                                 | Metric signature, LIKWID performance group(s)                                                                       |
|----------------------------------|--------------------------|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| Load imbalance / serial fraction |                          | Saturating/sub-linear speedup                                                                                        | Different amount of "work" on the cores (FLOPS_*); note that instruction count is not reliable!                     |
| Synchronization overhead         |                          | Speedup going down as more cores are added / No speedup with small problem sizes / Cores busy but low FP performance | Large non-FP instruction count (growing with number of cores used) / Low CPI (FLOPS_*, CPI)                         |
| Instruction overhead             |                          | Low application performance, good scaling across cores, performance insensitive to problem size                      | Low CPI near theoretical limit / Large non-FP instruction count (constant vs. number of cores) (FLOPS_*, DATA, CPI) |
| Code composition                 | Expensive instructions   | Similar to instruction overhead  C/C++ aliasing problem                                                              | Many cycles per instruction (CPI) if the problem is large-latency arithmetic                                        |
|                                  | Ineffective instructions |                                                                                                                      | Scalar instructions dominating in data-parallel loops (FLOPS_*, CPI)                                                |

### **Patterns conclusion**



- Pattern signature = performance behavior + hardware metrics
- Patterns are applies hotspot (loop) by hotspot
- Patterns map to typical execution bottlenecks
- Patterns are extremely helpful in classifying performance issues
  - The first pattern is always a hypothesis
  - Validation by tanking data (more performance behavior, HW metrics)
  - Refinement or change of pattern
- Performance models are crucial for most patterns
  - Model follows from pattern

### **Tutorial conclusion**



- Multicore architecture == multiple complexities
  - Affinity matters → pinning/binding is essential
  - Bandwidth bottlenecks → inefficiency is often made on the chip level
  - Topology dependence of performance features → know your hardware!
- Put cores to good use
  - Bandwidth bottlenecks → surplus cores → functional parallelism!?
  - Shared caches → fast communication/synchronization → better implementations/algorithms?
- Simple modeling techniques and patterns help us
  - ... understand the limits of our code on the given hardware
  - ... identify optimization opportunities
  - ... learn more, especially when they do not work!
- Simple tools get you 95% of the way
  - e.g., with the LIKWID tool suite





Moritz Kreutzer
Markus Wittmann
Thomas Zeiser
Michael Meier
Holger Stengel
Thomas Röhl
Faisal Shahzad
Salah Saleh







# THANK YOU.

### **Presenter Biographies**



Georg Hager holds a PhD in computational physics from the University of Greifswald. He is a senior research scientist in the HPC group at Erlangen Regional Computing Center (RRZE). Recent research includes architecture-specific optimization for current microprocessors, performance modeling on processor and system levels, and the efficient use of hybrid parallel systems. His textbook "Introduction to High Performance Computing for Scientists and Engineers" is required or recommended reading in many HPC-related courses around the world. See his blog at <a href="http://blogs.fau.de/hager">http://blogs.fau.de/hager</a> for current activities, publications, and talks.



Jan Treibig holds a PhD in Computer Science from the University of Erlangen. He is now a postdoctoral researcher in the HPC Services group at Erlangen Regional Computing Center (RRZE). His current research revolves around architecture-specific and low-level optimization for current processor architectures, performance modeling on processor and system levels, and programming tools. He is the developer of LIKWID, a collection of lightweight performance tools. In his daily work he is involved in all aspects of user support in High Performance Computing: training, code parallelization, profiling and optimization, and the evaluation of novel computer architectures.



• **Gerhard Wellein** holds a PhD in solid state physics from the University of Bayreuth and is a professor at the Department for Computer Science at the University of Erlangen. He leads the HPC group at Erlangen Regional Computing Center (RRZE) and has more than ten years of experience in teaching HPC techniques to students and scientists from computational science and engineering programs. His research interests include solving large sparse eigenvalue problems, novel parallelization approaches, performance modeling, and architecture-specific optimization.



#### **Abstract**



- SC14 tutorial: Node-Level Performance Engineering
- Presenter(s): Georg Hager, Jan Treibig, Gerhard Wellein

#### ABSTRACT:

The advent of multi- and manycore chips has led to a further opening of the gap between peak and application performance for many scientific codes. This trend is accelerating as we move from petascale to exascale. Paradoxically, bad node-level performance helps to "efficiently" scale to massive parallelism, but at the price of increased overall time to solution. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as far as they are relevant for the practitioner. Peculiarities like SIMD vectorization, shared vs. separate caches, bandwidth bottlenecks, and ccNUMA characteristics are introduced, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering and performance patterns are suggested as powerful tools that help the user understand the bottlenecks at hand and to assess the impact of possible code optimizations. A cornerstone of these concepts is the roofline model, which is described in detail, including useful case studies, limits of its applicability, and possible refinements.

#### References



#### Books:

 G. Hager and G. Wellein: <u>Introduction to High Performance Computing for Scientists and</u> Engineers. CRC Computational Science Series, 2010. ISBN 978-1439811924

#### Papers:

- H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Submitted. Preprint: <u>arXiv:1410.5010</u>
- M. Kreutzer, G. Hager, G. Wellein, A. Pieper, A. Alvermann, and H. Fehske: Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems. Submitted. Preprint:arXiv:1410.5242
- M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing (SISC) 36(5), C401–C423 (2014). DOI: 10.1137/130930352 Preprint: arXiv:1307.6209
- G. Hager, J. Treibig, J. Habich and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience (2013).
  - DOI: 10.1002/cpe.3180 Preprint: arXiv:1208.2908
- J. Treibig, G. Hager and G. Wellein: Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering. Workshop on Productivity and Performance (PROPER 2012) at Euro-Par 2012, August 28, 2012, Rhodes Island, Greece. Preprint: <u>arXiv:1206.3738</u>

### References



#### Papers continued:

- M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann and A. R. Bishop: Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation. Workshop on Large-Scale Parallel Processing 2012 (LSPP12), DOI: 10.1109/IPDPSW.2012.211
- J. Treibig, G. Hager, H. Hofmann, J. Hornegger and G. Wellein: Pushing the limits for medical image reconstruction on recent standard multicore processors. International Journal of High Performance Computing Applications, (published online before print).
   DOI: 10.1177/1094342012442424
- G. Wellein, G. Hager, T. Zeiser, M. Wittmann and H. Fehske: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. Proc. COMPSAC 2009. DOI: 10.1109/COMPSAC.2009.82
- M. Wittmann, G. Hager, J. Treibig and G. Wellein: Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters. Parallel Processing Letters 20 (4), 359-376 (2010).
  - DOI: 10.1142/S0129626410000296. Preprint: arXiv:1006.3148
- J. Treibig, G. Hager and G. Wellein: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. Proc. <u>PSTI2010</u>, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego CA, September 13, 2010. <u>DOI:</u> 10.1109/ICPPW.2010.38. Preprint: <u>arXiv:1004.4431</u>

### References



#### Papers continued:

- G. Schubert, H. Fehske, G. Hager, and G. Wellein: Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems. Parallel Processing Letters 21(3), 339-358 (2011).
   DOI: 10.1142/S0129626411000254
- J. Treibig, G. Wellein and G. Hager: Efficient multicore-aware parallelization strategies for iterative stencil computations. Journal of Computational Science 2 (2), 130-137 (2011). <u>DOI</u> 10.1016/j.jocs.2011.01.010
- J. Habich, T. Zeiser, G. Hager and G. Wellein: Performance analysis and optimization strategies for a D3Q19 Lattice Boltzmann Kernel on nVIDIA GPUs using CUDA. Advances in Engineering Software and Computers & Structures 42 (5), 266–272 (2011). <u>DOI:</u> 10.1016/j.advengsoft.2010.10.007
- J. Treibig, G. Hager and G. Wellein: Multicore architectures: Complexities of performance prediction for Bandwidth-Limited Loop Kernels on Multi-Core Architectures.
   DOI: 10.1007/978-3-642-13872-0\_1, Preprint: arXiv:0910.4865.
- G. Hager, G. Jost, and R. Rabenseifner: Communication Characteristics and Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-core SMP Nodes. In: Proceedings of the Cray Users Group Conference 2009 (CUG 2009), Atlanta, GA, USA, May 4-7, 2009. PDF
- R. Rabenseifner and G. Wellein: Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures. International Journal of High Performance Computing Applications 17, 49-62, February 2003. DOI:10.1177/1094342003017001005