

# Performance-oriented programming on multicore-based systems, with a focus on the Cray XE6/XC30

Georg Hager<sup>(a)</sup>, Jan Treibig<sup>(a)</sup>, and Gerhard Wellein<sup>(a,b)</sup>

- (a)HPC Services, Erlangen Regional Computing Center (RRZE)
- (b) Department for Computer Science Friedrich-Alexander-University Erlangen-Nuremberg

Cray XE6 optimization workshop, March17-20, 2014, HLRS

# The Rules™



# There is no alternative to knowing what is going on between your code and the hardware

Without performance modeling, optimizing code is like stumbling in the dark

# **Agenda**



- Basics of multicore processor and node architecture
- Multicore performance and tools
  - Affinity enforcement
  - Performance counter measurements
  - Basics and best practice for performance counter profiling
- Microbenchmarking for architectural exploration
- Roadblocks for scalability on multicore chips
  - Scaling properties and typical OpenMP overhead
  - Bandwidth saturation in cache and main memory
- Simple Performance Modeling: The Roofline model
- Optimal utilization of parallel resources
  - Programming for SIMD parallelism
  - Programming in ccNUMA environments
- Case study: The roofline model for a 3D Jacobi solver
  - Understanding performance characteristics
  - Model-guided optimization
- Case study: sparse matrix-vector multiplication
  - What we can do if the roofline model "does not work"



# Multicore processor and system architecture – an overview

**Performance composition** 

Memory organization: UMA vs. ccNUMA

**Simultaneous Multi-Threading (SMT)** 

**Data paths in HPC systems** 

Memory access

Single Instruction Multiple Data (SIMD)

**Topology and programming models** 

# There is no longer a single driving force for chip performance!





4,6,8 core variants available

# Floating Point (FP) Performance:

$$P = n_{core} * F * S * v$$

number of cores:

F

S

**FP** instructions per cycle: (1 MULT and 1 ADD)

"Sandy Bridge EP" socket

FP ops / instruction: 4 (dp) / 8 (sp)

(256 Bit SIMD registers – "AVX")

**Clock speed:** 

2.5 GHz

**TOP500** rank 1 (1996)

P = 160 GF/s (dp) / 320 GF/s (sp)

But: P=5 GF/s (dp) for serial, non-SIMD code

#### From UMA to ccNUMA





## Yesterday (2006): Dual-socket Intel "Core2" node:



**Uniform Memory Architecture (UMA)** 

Flat memory; symmetric MPs

But: system "anisotropy"

#### **Today:** Dual-socket Intel (Westmere) node:



Cache-coherent Non-Uniform Memory Architecture (ccNUMA)

HT / QPI provide scalable bandwidth at the price of ccNUMA architectures: Where does my data finally end up?

On AMD it is even more complicated -> ccNUMA within a socket!

#### Back to the 2-chip-per-case age

16 core AMD Interlagos – a 2x8-core ccNUMA socket





■ WHY? → Shared resources are hard to scale:

2 x 2 memory channels vs. 1 x 4 memory channels per socket

# Cray XE6 (Hermit) "Interlagos" 16-core dual socket node





- Two 8- (integer-) core chips per socket @ 2.3 GHz (3.3 @ turbo)
- Separate DDR3 memory interface per chip
  - ccNUMA on the socket!
- Shared FP unit per pair of integer cores ("module")
  - 2 128bit FMA FP units
  - SSE4.2, AVX, FMA4
- 16 kB L1 data cache per core
- 2 MB L2 cache per module
- 8 MB L3 cache per chip (6 MB usable)

# SMT Makes a single physical core appear as two or more "logical" cores → multiple threads/processes run concurrently



SMT principle (2-way example):



#### **Another flavor of "SMT"**

# AMD Interlagos / Bulldozer



Up to 16 cores (8 Bulldozer modules) in a single socket



## Cray XC30 "SandyBridge-EP" 8-core dual socket node





- 8 cores per socket 2.7 GHz
   (3.5 @ turbo)
- DDR3 memory interface with 4 channels per chip
- Two-way SMT
- Two 256-bit SIMD FP units
  - SSE4.2, AVX
- 32 kB L1 data cache per core
- 256 kB L2 cache per core
- 20 MB L3 cache per chip

# Latency and bandwidth in modern computer environments





# Interlude: Data transfers in a memory hierarchy



- How does data travel from memory to the CPU and back?
- Example: Array copy A(:)=C(:)





# SIMD-processing – Basics



- Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on "wide" registers.
- x86 SIMD instruction sets:
  - SSE: register width = 128 Bit → 2 double precision floating point operands
  - AVX: register width = 256 Bit → 4 double precision floating point operands
- Adding two registers holding double precision floating point operands



# Challenges of modern compute nodes



15

Heterogeneous programming SIMD + OpenMP + MPI + CUDA, OpenCL,...



# Parallelism in modern computer systems



Parallel and shared resources within a shared-memory node



#### Parallel resources:

- Execution/SIMD units
- Cores
- Inner cache levels
- Sockets / memory domains
- Multiple accelerators

#### **Shared resources:**

- Outer cache level per socket
- Memory bus per socket
- Intersocket link 8
- PCle bus(es)
- Other I/O resources

How does your application react to all of those details?

# Parallel programming models

on multicore multisocket nodes



- Shared-memory (intra-node)
  - Good old MPI (current standard: 2.2)
  - OpenMP (current standard: 3.0)
  - POSIX threads
  - Intel Threading Building Blocks
  - Cilk++, OpenCL, StarSs,... you name it
- Distributed-memory (inter-node)
  - MPI (current standard: 2.2)
  - PVM (gone)
- Hybrid
  - Pure MPI
  - MPI+OpenMP
  - MPI + any shared-memory model

All models require awareness of topology and affinity issues for getting best performance out of the machine!

## **Parallel programming models:**

Pure MPI





#### **Parallel programming models:**

Pure threading on the node



#### Machine structure is invisible to user

- → Very simple programming model
- Threading SW (OpenMP, pthreads, TBB,...) should know about the details

#### Performance issues

- Synchronization overhead
- Memory access
- Node topology







# **Multicore Performance and Tools**

# **Probing node topology**

- Standard tools
- likwid-topology

# How do we figure out the node topology?



#### Topology =

- Where in the machine does core #n reside? And do I have to remember this awkward numbering anyway?
- Which cores share which cache levels?
- Which hardware threads ("logical cores") share a physical core?

#### Linux

- cat /proc/cpuinfo is of limited use
- Core numbers may change across kernels and BIOSes even on identical hardware
- numactl --hardware prints ccNUMA node information
- Information on caches is harder to obtain

```
$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8189 MB
node 0 free: 3824 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 8192 MB
node 1 free: 28 MB
node 2 cpus: 18 19 20 21 22 23
node 2 size: 8192 MB
```

node 3 size: 8192 MB node 3 free: 7840 MB

node 2 free: 8036 MB

# How do we figure out the node topology?



LIKWID tool suite:

Like

Knew

**What** 

ľm

**Doing** 

Open source tool collection (developed at RRZE):

http://code.google.com/p/likwid



J. Treibig, G. Hager, G. Wellein: *LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments.* Accepted for PSTI2010, Sep 13-16, 2010, San Diego, CA http://arxiv.org/abs/1004.4431

# **Likwid Tool Suite**



#### Command line tools for Linux:

- easy to install
- works with standard linux 2.6 kernel
- simple and clear to use
- supports Intel and AMD CPU



#### Current tools:

- likwid-topology: Print thread and cache topology
- likwid-pin: Pin threaded application without touching code
- likwid-perfctr: Measure performance counters
- likwid-mpirun: mpirun wrapper script for easy LIKWID integration
- likwid-bench: Low-level bandwidth benchmark generator tool
- ... some more

# Output of likwid-topology -g

on one node of Cray XE6 "Hermit"



```
CPU type: AMD Interlagos processor
Hardware Thread Topology
Sockets:
Cores per socket: 16
Threads per core: 1
HWThread
            Thread
                         Core
                                      Socket
0
                         0
                                      0
                                      0
[\ldots]
16
                                      1
                                      1
17
18
19
[...]
Socket 0: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 )
Socket 1: ( 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 )
*******************
Cache Topology
Level: 1
Size: 16 kB
Cache groups: (0)(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13
) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (
28 ) (29 ) (30 ) (31 )
```

# **Output of likwid-topology continued**



```
Level: 2
Size:
      2 MB
Cache groups: (01)(23)(45)(67)(89)(1011)(1213)(1415)(1617)(18
19 ) ( 20 21 ) ( 22 23 ) ( 24 25 ) ( 26 27 ) ( 28 29 ) ( 30 31 )
Level: 3
Size: 6 MB
Cache groups: (0 1 2 3 4 5 6 7 ) (8 9 10 11 12 13 14 15 ) (16 17 18 19 20 21 22 23 ) (24 25 26
27 28 29 30 31 )
*******************
NUMA Topology
**********************
NUMA domains: 4
Domain 0:
Processors: 0 1 2 3 4 5 6 7
Memory: 7837.25 MB free of total 8191.62 MB
Domain 1:
Processors: 8 9 10 11 12 13 14 15
Memory: 7860.02 MB free of total 8192 MB
Domain 2:
Processors: 16 17 18 19 20 21 22 23
Memory: 7847.39 MB free of total 8192 MB
Domain 3:
Processors: 24 25 26 27 28 29 30 31
Memory: 7785.02 MB free of total 8192 MB
```

# **Output of likwid-topology continued**



|                            | ****              | *****                    | *****            | ***       | *****           | ****                 | *****                         | ****                | *                     |                   |                           |                        |                        |                             |                            |                        |                       |                       |               |                            |                    |             |              |        |
|----------------------------|-------------------|--------------------------|------------------|-----------|-----------------|----------------------|-------------------------------|---------------------|-----------------------|-------------------|---------------------------|------------------------|------------------------|-----------------------------|----------------------------|------------------------|-----------------------|-----------------------|---------------|----------------------------|--------------------|-------------|--------------|--------|
| cket 0:                    |                   |                          |                  |           |                 |                      |                               |                     |                       |                   |                           |                        |                        |                             |                            |                        |                       |                       |               |                            |                    |             |              |        |
| 0                          | +<br>L            | ++<br>  2                | 3                | -+ +<br>  | <br><br>  4     | + +-·<br>            | +<br>5                        | +<br>               | 6                     | +                 | - +<br>  8                | +<br>+                 | +<br>  9               | + +<br>    10               | +<br>                      | +                      | -+ ·                  | +<br>  12             | -+ +<br>      | ++<br>  13                 | <br>- +            | +<br>14     | +            | <br>5  |
| + +                        | +                 | ++                       | +                | -+ +      | +<br>+          | + +                  | +                             | •                   |                       | +                 | •                         |                        | •                      |                             | +                          | +                      | -+ ·                  | +                     | -+ +<br>-+ +  | ++                         | + +                | +           | +            |        |
| 16kB     16                | +                 | 16kB  <br>++             | +                | -+ +      |                 | + +                  | +                             | +                   | +                     | 16kB<br>+         | +                         | +                      | +                      | + +                         | +                          | +                      | -+ -                  | +                     | -+ +          | 16kB  <br>++<br>           | + +                | 16kB  <br>+ | +            |        |
| 2MB                        | 2MB<br>+          |                          | <br> -+ +        | •         |                 |                      | 2MB                           |                     |                       | <br> -<br>        | 2MB                       |                        | <br>+ +                |                             | MB                         |                        | <br>+                 | 2MB                   |               |                            |                    |             |              |        |
| 6MB                        |                   |                          |                  |           |                 |                      |                               |                     |                       |                   | +<br>   <br>  +           | 6MB                    |                        |                             |                            |                        |                       |                       |               |                            |                    |             |              |        |
|                            |                   |                          |                  |           |                 |                      |                               |                     |                       |                   |                           |                        |                        |                             |                            |                        |                       |                       |               |                            |                    |             |              |        |
| ket 1:                     |                   |                          |                  |           |                 |                      |                               |                     |                       |                   |                           |                        |                        |                             |                            |                        |                       |                       |               |                            |                    |             |              |        |
| + +                        | <br>+             | ++                       |                  | -+ +      |                 |                      | •                             | •                   |                       | +                 | •                         |                        | •                      |                             |                            | •                      |                       | +                     | -+ +          | ++                         |                    | +           | +            |        |
| 16   1 1                   | <br>+<br>7  <br>+ | ++<br>++<br>  18  <br>++ | 19<br>  +        |           | •               | i i<br>+ +           | 21  <br>+                     | 2                   | 2                     | 23                | 24<br>- +                 | i<br>+                 | 25<br>+                |                             | i<br>+                     | 27<br>+                | <br>-+ ·<br> <br>-+ · | <br>+<br>  28<br>+    | <br>-+ +<br>  | ++<br>  29  <br>++         |                    | 30          | <br>  3<br>+ | 1      |
| 16     1                   | cB                | ++<br>++<br>  16kB       | +<br>+<br>  16kB | i         | <br><br>  16kB  | <br>  + +<br>      1 | 21  <br>+<br>+<br>16kB        | 2<br>+<br>  16      | 2  <br>+<br>kB        | 23<br>+<br>  16kB | 24<br>  +<br>  16k        | <br> +<br> +<br> B     | 25<br>+<br>  16kB      | 26<br>+ +<br>+ +<br>    16k | <br> +<br> +<br> B         | 27<br>+<br>  16kB      | Ĺ                     | +<br>+<br>  16kB      | Ü             | ++<br>++<br>  16kB         | ı                  | +<br>16kB   | 16           | kB     |
| 16   1<br>+ +<br>16kB   16 | cB                | ++<br>  16kB  <br>++     | +<br>+<br>  16kB | <br> -+ + | +<br>  16kB<br> |                      | 21  <br>+<br>+<br>16kB  <br>+ | 2<br>+<br>  16<br>+ | 2  <br>+<br>kB  <br>+ | 23<br>+<br>  16kB | 24<br>  +<br>  16k<br>  + | <br>+<br>+<br>B  <br>+ | 25<br>+<br>  16kB<br>+ |                             | <br> +<br> +<br> B  <br> + | 27<br>+<br>  16kB<br>+ | <br> -+               | +<br>+<br>  16kB<br>+ | <br> -+ +<br> | ++<br>++<br>  16kB  <br>++ | <br> - +-<br> - +- | 16kB        | 16           | kB<br> |



# **Enforcing thread/process-core affinity under the Linux OS**

- Standard tools and OS affinity facilities under program control
- likwid-pin
- aprun (Cray)

#### **Example: STREAM benchmark on 12-core Intel Westmere:**

Anarchy vs. thread pinning





# **Generic thread/process-core affinity under Linux**

#### Overview



taskset binds processes/threads to a set of CPUs. Examples:

```
taskset 0x0006 ./a.out
taskset -c 4 33187
mpirun -np 2 taskset -c 0,2 ./a.out # doesn't always work
```

- Processes/threads can still move within the set!
- Disadvantage: which CPUs should you bind to on a non-exclusive machine?
- Still of value on multicore/multisocket cluster nodes, UMA or ccNUMA

# **Generic thread/process-core affinity under Linux**



Complementary tool: numactl

```
Example: numactl --physcpubind=0,1,2,3 command [args] Bind process to specified physical core numbers
```

```
Example: numactl --cpunodebind=1 command [args]
Bind process to specified ccNUMA node(s)
```

- Many more options (e.g., interleave memory across nodes)
  - → see section on ccNUMA optimization
- Diagnostic command (see earlier): numactl --hardware
- Again, this is not suitable for a shared machine

# More thread/Process-core affinity ("pinning") options



- Highly OS-dependent system calls
  - But available on all systems

Linux: sched setaffinity(), PLPA (see below) → hwloc

Solaris: processor bind()

Windows: SetThreadAffinityMask()

. . .

- Support for "semi-automatic" pinning in some compilers/environments
  - Intel compilers > V9.1 (KMP AFFINITY environment variable)
  - PGI, Pathscale, GNU
  - SGI Altix dplace (works with logical CPU numbers!)
  - Generic Linux: taskset, numactl, likwid-pin (see below)
- Affinity awareness in MPI libraries
  - SGI MPT
  - OpenMPI
  - Intel MPI

• ...

Example for program trolled affinity: Using PKIPPED affinity!

#### Overview



- Pins processes and threads to specific cores without touching code
- Directly supports pthreads, gcc OpenMP, Intel OpenMP
- Based on combination of wrapper tool together with overloaded pthread library → binary must be dynamically linked!
- Can also be used as a superior replacement for taskset
- Supports logical core numbering within a node and within an existing CPU set
  - Useful for running inside CPU sets defined by someone else, e.g., the MPI start mechanism or a batch system

#### Usage examples:

- likwid-pin -c 0,2,4-6 ./myApp parameters
- likwid-pin -c S0:0-3 ./myApp parameters

Example: Intel OpenMP



#### Running the STREAM benchmark with likwid-pin:

```
$ export OMP NUM THREADS=4
$ likwid-pin -s 0x1 -c 0,1,4,5 ./stream
[likwid-pin] Main PID -> core 0 - OK —
                                                           Main PID always
                                                                pinned
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
[... some STREAM output omitted ...]
The *best* time for each test is used
*EXCLUDING* the first and last iterations
[pthread wrapper] PIN MASK: 0->1 1->4 2->5
                                                          Skip shepherd
[pthread wrapper] SKIP MASK: 0x1 —
[pthread wrapper 0] Notice: Using libpthread.so.0
                                                              thread
       threadid 1073809728 -> SKIP
[pthread wrapper 1] Notice: Using libpthread.so.0
        threadid 1078008128 -> core 1 - OK
[pthread wrapper 2] Notice: Using libpthread.so.0
       threadid 1082206528 -> core 4 - OK
                                                            Pin all spawned
[pthread wrapper 3] Notice: Using libpthread.so.0
                                                             threads in turn
       threadid 1086404928 -> core 5 - OK
[... rest of STREAM output omitted ...]
```

#### Using logical core numbering



- Core numbering may vary from system to system even with identical hardware
  - Likwid-topology delivers this information, which can then be fed into likwidpin
- Alternatively, likwid-pin can abstract this variation and provide a purely logical numbering (physical cores first)



- Across all cores in the node:
  OMP NUM THREADS=8 likwid-pin -c N:0-7 ./a.out
- Across the cores in each socket and across sockets in each node:
  OMP NUM THREADS=8 likwid-pin -c S0:0-3@S1:0-3 ./a.out

## Using logical core numbering





# aprun on Cray



- See Cray workshop slides
- aprun supports only physical core numbering
  - This is OK since the cores are always numbered consecutively on Crays
  - Use -ss switch to restrict allocation to local NUMA domain (see later for more on ccNUMA)
  - Use -d \$OMP NUM THREADS or similar for MPI+OMP hybrid code
- See later on how using multiple cores per module/chip/socket affects performance



# Multicore performance tools: Probing performance behavior

likwid-perfctr

### Basic approach to performance analysis



- Runtime profile / Call graph (gprof)
- Instrument those parts which consume a significant part of runtime
- 3. Find performance signatures

### Possible signatures:

- Bandwidth saturation
- Instruction throughput limitation (real or language-induced)
- Latency impact (irregular data access, high branch ratio)
- Load imbalance
- ccNUMA issues (data access across ccNUMA domains)
- Pathologic cases (false cacheline sharing, expensive operations)

## **Probing performance behavior**



- How do we find out about the performance properties and requirements of a parallel code?
  - Profiling via advanced tools is often overkill
- A coarse overview is often sufficient
  - likwid-perfctr (similar to "perfex" on IRIX, "hpmcount" on AIX, "lipfpm" on Linux/Altix)
  - Simple end-to-end measurement of hardware performance metrics
  - "Marker" API for starting/stopping counters
  - Multiple measurement region support
  - Preconfigured and extensible metric groups, list with

```
likwid-perfctr -a
```

BRANCH: Branch prediction miss rate/ratio

CACHE: Data cache miss rate/ratio

CLOCK: Clock of cores

DATA: Load to store ratio

FLOPS\_DP: Double Precision MFlops/s FLOPS SP: Single Precision MFlops/s

FLOPS X87: X87 MFlops/s

L2: L2 cache bandwidth in MBytes/s L2CACHE: L2 cache miss rate/ratio L3: L3 cache bandwidth in MBytes/s L3CACHE: L3 cache miss rate/ratio

MEM: Main memory bandwidth in MBytes/s

TLB: TLB miss rate/ratio

### Example usage with preconfigured metric group



\$ env OMP NUM THREADS=4 likwid-perfctr -C N:0-3 -g FLOPS DP ./stream.exe CPU type: Intel Core Lynnfield processor CPU clock: 2.93 GHz **Configured metrics Always** Measuring group FLOPS DP (this group) measured YOUR PROGRAM OUTPUT Event core 0 core 1 core 2 core 3 INSTR RETIRED ANY 1.97463e+08 | 2.31001e+08 | 2.30963e+08 | 2.31885e+08 CPU CLK UNHALTED CORE 9.56999e+08 l 9.58401e+08 | 9.58637e+08 I 9.57338e+08 I FF COMP OF EYE SSE FF PACKED 4.00294e+07 l 3.08927e+07 | 3.08866e+07 | 3.08904e+07 FP COMP OPS EXE SSE FP SCALAR 882 FP COMP OPS EXE SSE SINGLE PRECISION FR COMP OPS EXE SSE DOUBLE PRECISION 4.00303e+07 | 3.08927e+07 | 3.08866e+07 | 3.08904e+07 core 0 Metric Runtime [s] 0.326242 0.326801 | 0.326358 **Derived** CPI 4.84647 4.15061 1 4.12849 | 4.14891 | DP MFlops/s (DP assumed) | 245.399 189.024 | 189.108 | 189.304 metrics Packed MUOPS/s 122.698 94.5121 94.6519 Scalar MUOPS/s 0.00270351 SP MUOPS/s DP MUOPS/s 94.5121 122.701

### Best practices for runtime counter analysis



# Things to look at (in roughly this order)

- Load balance (flops, instructions, BW)
- In-socket memory BW saturation
- Shared cache BW saturation
- Flop/s, loads and stores per flop metrics
- SIMD vectorization
- CPI metric
- # of instructions, branches, mispredicted branches

#### **Caveats**

- Load imbalance may not show in CPI or # of instructions
  - Spin loops in OpenMP barriers/MPI blocking calls
  - Looking at "top" or the Windows Task
     Manager does not tell you anything useful
- In-socket performance saturation may have various reasons
- Cache miss metrics are overrated
  - If I really know my code, I can often calculate the misses
  - Runtime and resource utilization is much more important

### Identify load imbalance...



- Instructions retired / CPI may not be a good indication of useful workload – at least for numerical / FP intensive codes....
- Floating Point Operations Executed is often a better indicator
- Waiting / "Spinning" in barrier generates a high instruction count



### ... and load-balanced codes



env OMP NUM THREADS=6 likwid-perfctr -C S0:0-5 -g FLOPS DP ./a.out

| 1                                                                                                                                                                                                  | L                                                                     |                                                                              | L                                                                                          | L                                                                                           |                                                  |                                                               |                                                                                            | - 4                |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------|--------------------|
| Event                                                                                                                                                                                              | core 0                                                                | core 1                                                                       | core 2                                                                                     | core 3                                                                                      | cor                                              | e 4                                                           | core 5                                                                                     | ļ                  |
| INSTR_RETIRED_ANY CPU_CLK_UNHALTED_CORE CPU_CLK_UNHALTED_REF FP_COMP_OPS_EXE_SSE_FP_PACKED FP_COMP_OPS_EXE_SSE_FP_SCALAR FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION | 2.24797e+10  <br>  2.04416e+10  <br>  3.45348e+09  <br>  2.93108e+07  | 1.74784e+10<br>2.23789e+10<br>2.03445e+10<br>3.43035e+09<br>3.06063e+07<br>0 | 1.68453e+10<br>2.23802e+10<br>2.03456e+10<br>3.37573e+09<br>2.9704e+07<br>0<br>3.40543e+09 | 1.66794e+10<br>2.23808e+10<br>2.03462e+10<br>3.39272e+00<br>2.96507e+00<br>0<br>3.42237e+00 | 0   2.237<br>0   2.034<br>0   3.261<br>7   2.411 | 85e+10  <br>99e+10  <br>53e+10  <br>32e+09  <br>41e+07  <br>0 | 1.91736e+10<br>2.23805e+10<br>2.03459e+10<br>3.2377e+09<br>2.37397e+07<br>0<br>3.26144e+09 | )  <br>            |
| Higher CPI but better performance                                                                                                                                                                  | Metric<br>  Metric<br>  Runtime [s]<br>  Clock [MHz]                  | core 0<br>  8.4293<br>  2932.7                                               | 8   8.39157<br>3   2933.5                                                                  | <br>  8.39206  <br>  2933.51                                                                | core 3<br>8.3923<br>2933.51                      | core 4<br>  core 4<br>  8.39193<br>  2933.51                  | . i 2933.51 i                                                                              | +<br> <br> -<br> - |
| !\$OMP PARALLEL DO DO I = 1, N                                                                                                                                                                     | DP MFlops/s   Packed MUOPS/   Scalar MUOPS/   SP MUOPS/s   DP MUOPS/s | 850.72<br>s 423.56                                                           | 7   845.212<br>6   420.729<br>4   3.75383<br>-06   0                                       | 831.703  <br>  414.03  <br>  3.64317  <br>  0                                               | 835.865<br>416.114<br>3.63663<br>0<br>419.751    | 802.952<br>  399.997<br>  2.95757<br>  0<br>  402.955         | 797.113  <br>  397.101  <br>  2.91165  <br>  0                                             |                    |

DO J = 1, N

x(I) = x(I) + A(J,I) \* y(J)

**ENDDO** 

**ENDDO** 

!\$OMP END PARALLEL DO

### **Detecting latency-bound codes**

Example: graph and tree data structures



| Metric                    | Red-Black tree | Optimized data structure |
|---------------------------|----------------|--------------------------|
| Instructions retired      | 1.34268e+11    | 1.28581e+11              |
| CPI                       | 9.0176         | 0.71887                  |
| L3-MEM data volume [GB]   | 301            | 3.22                     |
| TLB misses                | 3.71447e+09    | 4077                     |
| Branch rate               | 36%            | 8.5%                     |
| Branch mispredicted ratio | 7.8%           | 0.000013%                |
| Memory bandwidth [GB/s]   | 10.5           | 1.1                      |

Useful likwid-perfctr groups: L3, L3CACHE, MEM, TLB, BRANCH

High CPI, near perfect scaling if using SMT threads (Intel).

Note: Latency bound code can still produce significant aggregated bandwidth.

## Language-induced problems



- The object-oriented programming paradigm implements functionality resulting in many calls to small functions
- The ability of the compiler to inline functions (and still generate the best possible machine code) is limited
- Frequent pattern with complex C++ codes

### Symptoms:

- Low ("good") CPI
- Low resource utilization (Flops/s, bandwidth)
- Orders of magnitude more general purpose than arithmetic floating point instructions
- High branch rate

### Solution:

- Use basic data types and plain arrays in compute intensive loops
- Use plain C-like code
- Keep things simple do not obstruct the compiler's view on the code



# Microbenchmarking for architectural exploration

The vector triad Serial, throughput, and parallel benchmarks

### The parallel vector triad benchmark

A "swiss army knife" for microbenchmarking



### Simple streaming benchmark:

```
double precision, dimension(N) :: A,B,C,D
A=1.d0; B=A; C=A; D=A

do j=1,NITER
    do i=1,N
        A(i) = B(i) + C(i) * D(i)
    enddo
    if(.something.that.is.never.true.) then
        call dummy(A,B,C,D)
    endif
enddo
```

- Report performance for different N
- Choose NITER so that accurate time measurement is possible
- This kernel is limited by data transfer performance for all memory levels on all current architectures!

### A(:)=B(:)+C(:)\*D(:) on one Sandy Bridge core (3 GHz)





### A(:)=B(:)+C(:)\*D(:) on one Sandy Bridge core (3 GHz)





### The throughput-parallel vector triad benchmark



Every core runs its own, independent triad benchmark

```
double precision, dimension(:), allocatable :: A,B,C,D
!$OMP PARALLEL private(i,j,A,B,C,D)
allocate (A(1:N), B(1:N), C(1:N), D(1:N))
A=1.d0; B=A; C=A; D=A
do j=1, NITER
  do i=1,N
    A(i) = B(i) + C(i) * D(i)
  enddo
  if (.something.that.is.never.true.) then
    call dummy (A,B,C,D)
  endif
enddo
!$OMP END PARALLEL
```

pure hardware probing, no impact from OpenMP overhead

### Throughput vector triad on Sandy Bridge socket (3 GHz)





### The OpenMP-parallel vector triad benchmark



OpenMP work sharing in the benchmark loop

```
double precision, dimension(:), allocatable :: A,B,C,D
allocate (A(1:N), B(1:N), C(1:N), D(1:N))
A=1.d0; B=A; C=A; D=A
!$OMP PARALLEL private(i,j)
do j=1,NITER
!$OMP DO
  do i=1,N
    A(i) = B(i) + C(i) * D(i)
  enddo
                          Implicit barrier
!SOMP END DO
  if (.something.that.is.never.true.) then
    call dummy (A,B,C,D)
  endif
enddo
!$OMP END PARALLEL
```

### OpenMP vector triad on Sandy Bridge socket (3 GHz)







# OpenMP performance issues on multicore

Synchronization (barrier) overhead

### Welcome to the multi-/many-core era

Synchronization of threads may be expensive!



!\$OMP PARALLEL ...
...
!\$OMP BARRIER

!\$OMP DO

•••

!\$OMP ENDDO

!\$OMP END PARALLEL

Threads are synchronized at **explicit** AND **implicit** barriers. These are a main source of overhead in OpenMP progams.

Determine costs via modified OpenMP

Microbenchmarks testcase (epcc)

### On x86 systems there is no hardware support for synchronization!

- Next slide: Test OpenMP Barrier performance...
- for different compilers
- and different topologies:
  - shared cache
  - shared socket
  - between sockets
- and different thread counts
  - 2 threads
  - full domain (chip, socket, node)

## Thread synchronization overhead on Interlagos

Barrier overhead in CPU cycles



| 2 Threads    | Cray 8.03 | GCC 4.6.2   | PGI 11.8    | Intel 12.1.3 |
|--------------|-----------|-------------|-------------|--------------|
| Shared L2    | 258       | 3995        | 1503        | 128623       |
| Shared L3    | 698       | 2853        | 1076        | 128611       |
| Same socket  | 879       | 2785        | 1297        | 128695       |
| Other socket | 940       | 2740 / 4222 | 1284 / 1325 | 128718       |



Intel compiler barrier very expensive on Interlagos

OpenMP & Cray compiler •••



| Full domain | Cray 8.03 | GCC 4.6.2 | PGI 11.8 | Intel 12.1.3 |
|-------------|-----------|-----------|----------|--------------|
| Shared L3   | 2272      | 27916     | 5981     | 151939       |
| Socket      | 3783      | 49947     | 7479     | 163561       |
| Node        | 7663      | 167646    | 9526     | 178892       |

# Thread synchronization overhead on SandyBridge-EP Barrier overhead in CPU cycles



| 2 Threads    | Intel 13.1.0 | GCC 4.7.0 | GCC 4.6.1 |
|--------------|--------------|-----------|-----------|
| Shared L3    | 384          | 5242      | 4616      |
| SMT threads  | 2509         | 3726      | 3399      |
| Other socket | 1375         | 5959      | 4909      |





| Full domain | Intel 13.1.0 | GCC 4.7.0 | GCC 4.6.1 |
|-------------|--------------|-----------|-----------|
| Socket      | 1497         | 14546     | 14418     |
| Node        | 3401         | 34667     | 29788     |
| Node SMT    | 6881         | 59038     | 58898     |

### Thread synchronization overhead on Intel Xeon Phi

Barrier overhead in CPU cycles



2 threads on distinct cores: 1936

| 36        | SMT1  | SMT2  | SMT3  | SMT4  |
|-----------|-------|-------|-------|-------|
| One core  | n/a   | 1597  | 2825  | 3557  |
| Full chip | 10604 | 12800 | 15573 | 18490 |

That does not look bad for 240 threads!

Still the pain may be much larger, as more work can be done in one cycle on Phi compared to a full Sandy Bridge node

- 3.75 x cores (16 vs 60) on Phi
- 2 x more operations per cycle on Phi
- 2.7 x more barrier penalty (cycles) on Phi



7.5 x more work done on Xeon Phi per cycle

One barrier causes 2.7 x 7.5 = 20x more pain ©.



# Simple performance modeling: The Roofline Model

### The Roofline Model<sup>1,2</sup>



- 1.  $P_{\text{max}}$  = Applicable peak performance of a loop, assuming that data comes from L1 cache
- 2. I = Computational intensity ("work" per byte transferred) over the slowest data path utilized ("the bottleneck")
  - Code balance  $B_C = I^{-1}$
- 3.  $b_S = Applicable peak bandwidth of the slowest data path utilized$

### **Expected performance:**

$$P = \min(P_{\max}, I \cdot b_S)$$

<sup>1</sup>W. Schönauer: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. (2000)

<sup>&</sup>lt;sup>2</sup> S. Williams: <u>Auto-tuning Performance on Multicore Computers</u>. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)

## A simple Roofline example



Example: do i=1,N; s=s+a(i); enddo in double precision on hypothetical 3 GHz CPU, 4-way SIMD, N large



### Applicable peak for the summation loop



### Plain scalar code, no SIMD

```
LOAD r1.0 ← 0

i ← 1

loop:

LOAD r2.0 ← a(i)

ADD r1.0 ← r1.0+r2.0

++i →? loop

result ← r1.0
```



### Applicable peak for the summation loop



### Scalar code, 4-way unrolling

```
LOAD r1.0 \leftarrow 0
LOAD r2.0 \leftarrow 0
LOAD r3.0 \leftarrow 0
LOAD r4.0 \leftarrow 0
i ← 1
loop:
  LOAD r5.0 \leftarrow a(i)
  LOAD r6.0 \leftarrow a(i+1)
  LOAD r7.0 \leftarrow a(i+2)
  LOAD r8.0 \leftarrow a(i+3)
  ADD r1.0 \leftarrow r1.0+r5.0
  ADD r2.0 \leftarrow r2.0+r6.0
  ADD r3.0 \leftarrow r3.0+r7.0
  ADD r4.0 \leftarrow r4.0+r8.0
   i+=4 \rightarrow ? loop
result \leftarrow r1.0+r2.0+r3.0+r4.0
```

#### **ADD** pipes utilization:



→ 1/4 of ADD peak

### Applicable peak for the summation loop

result \( \tau \) r1.0+r1.1+...+r4.2+r4.3



### SIMD-vectorized, 4-way unrolled **ADD** pipes utilization: LOAD $[r1.0,...,r1.3] \leftarrow [0,0]$ LOAD $[r2.0,...,r2.3] \leftarrow [0,0]$ LOAD [r3.0,...,r3.3] $\leftarrow$ [0,0] LOAD $[r4.0,...,r4.3] \leftarrow [0,0]$ i ← 1 → ADD peak loop: LOAD $[r5.0,...,r5.3] \leftarrow [a(i),...,a(i+3)]$ LOAD $[r6.0, ..., r6.3] \leftarrow [a(i+4), ..., a(i+7)]$ LOAD $[r7.0, ..., r7.3] \leftarrow [a(i+8), ..., a(i+11)]$ LOAD [r8.0,...,r8.3] $\leftarrow$ [a(i+12),...,a(i+15)] ADD r1 $\leftarrow$ r1+r5 ADD r2 $\leftarrow$ r2+r6 ADD r3 ← r3+r7 ADD r4 $\leftarrow$ r4+r8

 $i+=16 \rightarrow ? loop$ 

### Input to the roofline model



... on the example of do i=1,N; s=s+a(i); enddo



### A very bandwidth-bound kernel



### Example: Vector triad A(:)=B(:)+C(:)\*D(:) on 2.3 GHz Interlagos

- $b_{\rm S} = 34 \, {\rm GB/s}$
- B<sub>c</sub> = (4+1) Words / 2 Flops = 2.5 W/F (including write allocate)

$$\rightarrow$$
 / = 0.4 F/W = 0.05 F/B

### Lightspeed:

 $l \cdot b_S = 1.7$  GF/s (1.2 % of peak performance)

### **Assumptions for the Roofline Model**



- The balance metric formalism is based on some (crucial) assumptions:
  - There is a clear concept of "work" vs. "traffic"
    - "work" = flops, updates, iterations...
    - "traffic" = required data to do "work"
  - Attainable bandwidth of code = input parameter! Determine effective bandwidth via simple streaming benchmarks to model more complex kernels and applications
  - Data transfer and core execution overlap perfectly!
  - Slowest data path is modeled only; all others are assumed to be infinitely fast
  - If data transfer is the limiting factor, the bandwidth of the slowest data path can be utilized to 100% ("saturation")
  - Latency effects are ignored, i.e. perfect streaming mode

### Factors to consider in the roofline model



### **Bandwidth-bound (simple case)**

- Accurate traffic calculation (writeallocate, strided access, ...)
- Practical ≠ theoretical BW limits
- Erratic access patterns



### **Core-bound (may be complex)**

- Multiple bottlenecks: LD/ST, arithmetic, pipelines, SIMD, execution ports
- See next slide...



### **Complexities of in-core execution**



### Multiple bottlenecks:

- L1 Icache bandwidth
- Decode/retirement throughput
- Port contention (direct or indirect)
- Arithmetic pipeline stalls (dependencies)
- Overall pipeline stalls (branching)
- L1 Dcache bandwidth (LD/ST throughput)
- Scalar vs. SIMD execution
- · ..
- Register pressure
- Alignment issues



### Shortcomings of the roofline model



### Saturation effects in multicore chips are not explained

- Reason: "saturation assumption"
- Cache line transfers and core execution do sometimes not overlap perfectly
- Only increased "pressure" on the memory interface can saturate the bus
   → need more cores!

### ECM model gives more insight

G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3180 (2013).





## Optimal utilization of parallel resources

Hardware-software interaction SIMD parallelism

### **Computer Architecture**

The evil of hardware optimizations





Flexible, but optimization is hard!

Architect's view:

Make the common case fast!



**EDSAC 1949** 



**ENIAC 1948** 

Provide improvements for relevant software What are the technical opportunities? Economical concerns Multi-way special purpose



What is your relevant aspect of the architecture?

## **Hardware- Software Co-Design?**

From algorithm to execution



Reality:

**Algorithm** 

**Programming language** 



Hardware = Black Box

## The machine view:

ISA (Machine code)





#### **Basic Resources**

## Instruction throughput and data movement



75

#### 1. Instruction execution

This is the primary resource of the processor. All efforts in hardware design are targeted towards increasing the instruction throughput.

#### 2. Data transfer bandwidth

Data transfers are a consequence of instruction execution and therefore a secondary resource. Maximum bandwidth is determined by the request rate of executed instructions and technical limitations (bus width, speed).

**Real machine:** Processors are imperfect and have technical limitations. This results in hazards preventing to fully exploit the elementary resources.

# Things to remember



## **Goals for optimization:**

- 1. Map your work to an instruction mix with highest throughput using the most effective instructions.
- 2. Reduce data volume over slow data paths fully utilizing available bandwidth.
- 3. Avoid possible hazards/overhead which prevent reaching goals one and two.



# Coding for SingleInstructionMultipleData-processing



- Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on "wide" registers.
- x86 SIMD instruction sets:
  - SSE: register width = 128 Bit → 2 double precision floating point operands
  - AVX: register width = 256 Bit → 4 double precision floating point operands
- Adding two registers holding double precision floating point operands





Steps (done by the compiler) for "SIMD processing"

```
for(int i=0; i<n;i++)
C[i]=A[i]+B[i];
```

"Loop unrolling"

```
for(int i=0; i<n;i+=4) {
        C[i] =A[i] +B[i];
        C[i+1]=A[i+1]+B[i+1];
        C[i+2]=A[i+2]+B[i+2];
        C[i+3]=A[i+3]+B[i+3];}
//remainder loop omitted</pre>
```

Load 256 Bits starting from address of A[i] to register R0

Add the corresponding 64 Bit entries in R0 and R1 and store the 4 results to R2

Store R2 (256 Bit) to address starting at C[i]

```
LABEL1:

VLOAD R0 ← A[i]

VLOAD R1 ← B[i]

V64ADD[R0,R1] → R2

VSTORE R2 → C[i]

i←i+4

i<(n-4)? JMP LABEL1

//remainder loop omitted
```



No SIMD-processing for loops with data dependencies

```
for(int i=0; i<n; i++)
A[i]=A[i-1]*s;
```

"Pointer aliasing" may prevent compiler from SIMD-processing

```
void scale_shift(double *A, double *B, double *C, int n) {
    for(int i=0; i<n; ++i)
        C[i] = A[i] + B[i];
}</pre>
```

- C/C++ allows that  $A \rightarrow \&C[-1]$  and  $B \rightarrow \&C[-2]$  $\rightarrow$  C[i] = C[i-1] + C[i-2]: dependency  $\rightarrow$  No SIMD-processing
- If no "Pointer aliasing" is used, tell the compiler, e.g. use
  -fno-alias switch for Intel compiler → SIMD-processing



## SIMD processing of a vector norm

```
s=0.0;
for(int i=0; i<n; i++)
    s = s + A[i]*A[i];</pre>
```

Data dependency on s must be resolved for SIMD-processing



```
s0=0.0;
s1=0.0;
s2=0.0;
s3=0.0;
for(int i=0; i<n; i+=4) {
    s0 = s0+ A[i] *A[i];
    s1 = s1+ A[i+1] *A[i+1];
    s2 = s2+ A[i+2] *A[i+2];
    s3 = s3+ A[i+3] *A[i+3];
}
    R0    R1    R2
//remainder omitted
s=s0+s1+s2+s3
```

Compiler does transformation -

if programmer allows it to do so! (-03 instead of -01)

```
...
V64MULT(R1,R2) → R1
V64ADD(R0,R1) → R0
...
```



# Reading x86 assembly code

## Basic approach to check the instruction code



Get the assembler code (Intel compiler):

```
icc -S -O3 -xHost triad.c -o triad.s
```

Disassemble Executable:

```
objdump -d ./cacheBench | less
```

- Things to check for:
  - Is the code vectorized? Search for pd/ps suffix.

```
mulpd, addpd, vaddpd, vmulpd
```

Is the data loaded with 16 byte moves?

```
movapd, movaps, vmovupd
```

For memory-bound code: Search for nontemporal stores:

```
movntpd, movntps
```

#### The x86 ISA is documented in:

Intel Software Development Manual (SDM) 2A and 2B AMD64 Architecture Programmer's Manual Vol. 1-5

#### Basics of the x86-64 ISA



- Instructions have 0 to 2 operands
- Operands can be registers, memory references or immediates
- Opcodes (binary representation of instructions) vary from 1 to 17 bytes
- There are two syntax forms: Intel (left) and AT&T (right)
- Addressing Mode: BASE + INDEX \* SCALE + DISPLACEMENT
- C: A[i] equivalent to \*(A+i) (a pointer has a type: A+i\*8)

```
movaps [rdi + rax*8+48], xmm3
add rax, 8
js 1b
```

```
movaps %xmm4, 48(%rdi,%rax,8)
addq $8, %rax
js ..B1.4
```

```
      401b9f: 0f 29 5c c7 30
      movaps %xmm3,0x30(%rdi,%rax,8)

      401ba4: 48 83 c0 08
      add $0x8,%rax

      401ba8: 78 a6
      js 401b50 <triad_asm+0x4b>
```

#### Basics of the x86-64 ISA II



```
16 general Purpose Registers (64bit):
```

```
rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8-r15
alias with eight 32 bit register set:
eax, ebx, ecx, edx, esi, edi, esp, ebp
```

#### Floating Point SIMD Registers:

```
xmm0-xmm15 SSE (128bit) alias with 256bit registers ymm0-ymm15 AVX (256bit)
```

#### SIMD instructions are distinguished by:

AVX (VEX) prefix: v

Operation: mul, add, mov

Modifier: non temporal (nt), unaligned (u), aligned (a), high (h)

Data type: single (s), double (d)

#### Basics of x86-64 ABI



- Regulations how functions are called on binary level
- Differs between 32 bit / 64 bit and Operating Systems

#### **x86-64 on Linux:**

- Integer or address parameters are passed in the order:
  rdi, rsi, rdx, rcx, r8, r9
- Floating Point parameters are passed in the order xmm0-xmm7
- Registers which must be preserved across function calls: rbx, rbp, r12-r15
- Return values are passed in rax/rdx and xmm0/xmm1

# **Case Study: summation**



```
float sum = 0.0;

for (int j=0; j<size; j++) {
    sum += data[j];
}</pre>
```

To get object code use objdump -d on object file or executable or compile with -s

#### **Instruction code:**

```
401d08: f3 0f 58 04 82
401d0d: 48 83 c0 01
401d11: 39 c7
401d13: 77 f3

Instruction address

Opcodes
```

```
addss (%rdx,%rax,4),%xmm0
add $0x1,%rax
cmp %eax,%edi
ja 401d08

Assembly code
```

## **Summation code variants**



```
1:
        xmm0, [rsi + rax * 4]
addss
                                      1:
add
        rax, 1
                     3 cycles add
                                      addps xmm0, [rsi + rax * 4]
       eax,edi
cmp
                     pipeline
                     latency
                                      addps xmm1, [rsi + rax * 4 + 16]
js 1b
                                      addps xmm2, [rsi + rax * 4 + 32]
     Unrolling with sub sums to break up
                                      addps xmm3, [rsi + rax * 4 + 48]
     register dependency
                                      add rax, 16
                                           eax,edi
                                       cmp
1:
                                       js 1b
addss xmm0, [rsi + rax * 4]
                                     SSE SIMD vectorization
addss xmm1, [rsi + rax * 4 + 4]
addss xmm2, [rsi + rax * 4 + 8]
addss xmm3, [rsi + rax * 4 + 12]
add
      rax, 4
cmp eax,edi
js 1b
```

# SIMD-processing – Sequential





## SIMD-processing – Full chip (all cores)

Influence of SMT



Bandwidth saturation is the primary performance limitation on the chip level!

Full scaling using SMT due to bubbles in pipeline

Conclusion: If the code saturates the bottleneck, all variants are acceptable!

All variants saturate the memory bandwidth



## **How to leverage SIMD**



- The compiler does it for you (aliasing, alignment, language)
- Compiler directives (pragmas)
- Alternative programming models for compute kernels (OpenCL, ispc)
- Intrinsics (restricted to C/C++)
- Implement directly in assembler

To use intrinsics the following headers are available. To enable instruction sets often additional flags are necessary:

- xmmintrin.h (SSE)
- pmmintrin.h (SSE2)
- immintrin.h (AVX)
- \*86intrin.h (all instruction set extensions)
- See next slide for an example

## **Example: array summation using C intrinsics**



```
m128 sum0, sum1, sum2, sum3;
                                   sum0 = mm add ps(sum0, sum1);
m128 t0, t1, t2, t3;
                                   sum0 = mm add ps(sum0, sum2);
float scalar sum;
                                    sum0 = mm add ps(sum0, sum3);
sum0 = mm setzero ps();
                                   sum0 = mm hadd ps(sum0, sum0);
sum1 = mm setzero ps();
                                    sum0 = mm hadd ps(sum0, sum0);
sum2 = mm setzero ps();
                                   mm store ss(&scalar sum, sum0);
sum3 = mm setzero ps();
for (int j=0; j<size; j+=16) {
   t0 = mm loadu ps(data+j);
   t1 = mm loadu ps(data+j+4);
   t2 = mm loadu ps(data+j+8);
   t3 = mm loadu ps(data+j+12);
   sum0 = mm add ps(sum0, t0);
   sum1 = mm add ps(sum1, t1);
   sum2 = mm add ps(sum2, t2);
   sum3 = mm add ps(sum3, t3);
```

#### **Example: array summation from intrinsics, instruction code**



```
14:
      0f 57 c9
                                       %xmm1,%xmm1
                                xorps
17:
      31 c0
                                xor
                                       %eax,%eax
19:
      0f 28 d1
                                movaps %xmm1, %xmm2
      0f 28 c1
                                movaps %xmm1, %xmm0
1c:
      0f 28 d9
                                movaps %xmm1, %xmm3
1f:
22:
      66 Of 1f 44 00 00
                                       0x0(%rax,%rax,1)
                                nopw
28:
      0f 10 3e
                                movups (%rsi),%xmm7
      Of 10 76 10
                                movups 0x10(%rsi),%xmm6
2b:
2f:
      Of 10 6e 20
                                movups 0x20(%rsi),%xmm5
      Of 10 66 30
                                movups 0x30(%rsi),%xmm4
33:
      83 c0 10
                                add
                                       $0x10, %eax
37:
3a:
      48 83 c6 40
                                add
                                       $0x40,%rsi
      0f 58 df
                                addps %xmm7,%xmm3
3e:
      0f 58 c6
                                addps %xmm6,%xmm0
41:
                                addps %xmm5, %xmm2
      0f 58 d5
44:
47:
      0f 58 cc
                                addps %xmm4,%xmm1
      39 c7
                                       %eax,%edi
4a:
                                cmp
      77 da
                                       28 <compute sum SSE+0x18>
4c:
                                jа
                                                                          Loop body
      0f 58 c3
                                addps
                                       %xmm3,%xmm0
<u>4e:</u>
      0f 58 c2
51:
                                addps
                                       %xmm2,%xmm0
      0f 58 c1
                                addps
                                       %xmm1,%xmm0
54:
57:
      f2 Of 7c c0
                                haddps %xmm0, %xmm0
      f2 Of 7c c0
5b:
                                haddps %xmm0, %xmm0
5f:
      с3
                                retq
```

# Vectorization and the Intel compiler



- Intel compiler will try to use SIMD instructions when enabled to do so
  - "Poor man's vector computing"
  - Compiler will emit messages about vectorized loops:

```
plain.c(11): (col. 9) remark: LOOP WAS VECTORIZED.
```

- Use option -vec\_report3 to get full compiler output about which loops were vectorized and which were not and why (data dependencies!)
- Some obstructions will prevent the compiler from applying vectorization even if it is possible
- You can use source code directives to provide more information to the compiler

## **Vectorization compiler options**



- The compiler will vectorize starting with -02.
- To enable specific SIMD extensions use the –x option:
  - -xSSE2 vectorize for SSE2 capable machines Available SIMD extensions: SSE2, SSE3, SSE4.1, SSE4.2, AVX
  - -xAVX on Sandy Bridge processors

#### Recommend option:

-xHost will optimize for the architecture you compile on

On AMD Opteron: use plain -o3 as the -x options may involve CPU type checks.

# **Vectorization compiler options**



### Controlling non-temporal stores

-opt-streaming-stores always|auto|never

**always** use NT stores, assume application is memory

bound (use with caution!)

auto compiler decides when to use NT stores

**never** do not use NT stores unless activated by

source code directive

## Rules for vectorizable loops



- 1. Countable
- 2. Single entry and single exit
- 3. Straight line code
- 4. No function calls (exception intrinsic math functions)

## **Better performance with:**

- 1. Simple inner loops with unit stride
- 2. Minimize indirect addressing
- 3. Align data structures (SSE 16 bytes, AVX 32 bytes)
- 4. In C use the restrict keyword for pointers to rule out aliasing

#### **Obstacles for vectorization:**

- Non-contiguous memory access
- Data dependencies

## **Vectorization source code directives**



- Fine-grained control of loop vectorization
- Use !DEC\$ (Fortran) or #pragma (C/C++) sentinel to start a compiler directive
- #pragma vector always vectorize even if it seems inefficient (hint!)
- #pragma novector do not vectorize even if possible
- #pragma vector nontemporal use NT stores when allowed (i.e. alignment conditions are met)
- #pragma vector aligned specifies that all array accesses are aligned to 16-byte boundaries (DANGEROUS! You must not lie about this!)

### **User mandated vectorization**



- Starting with Intel Compiler 12.0 the simd pragma is available
- #pragma simd enforces vectorization where the other pragmas fail
- Prerequesites:
  - Countable loop
  - Innermost loop
  - Must conform to for-loop style of OpenMP worksharing constructs
- There are additional clauses: reduction, vectorlength, private
- Refer to the compiler manual for further details
- NOTE: Using the #pragma simd the compiler may generate incorrect code if the loop violates the vectorization rules!

```
#pragma simd reduction(+:x)
  for (int i=0; i<n; i++) {
      x = x + A[i];
  }</pre>
```

## **x86 Architecture:**

## SIMD and Alignment



#### Alignment issues

- Alignment of arrays in SSE calculations should be on 16-byte boundaries to allow packed loads and NT stores (for Intel processors)
  - AMD has a scalar nontemporal store instruction
- Otherwise the compiler will revert to unaligned loads and not use NT stores – even if you say vector nontemporal
- How is manual alignment accomplished?
- Dynamic allocation of aligned memory (align = alignment boundary):



# Efficient parallel programming on ccNUMA nodes

Performance characteristics of ccNUMA nodes
First touch placement policy
C++ issues
ccNUMA locality and dynamic scheduling
ccNUMA locality beyond first touch

## **ccNUMA** performance problems

"The other affinity" to care about



#### ccNUMA:

- Whole memory is transparently accessible by all processors
- but physically distributed
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- How do we make sure that memory access is always as "local" and "distributed" as possible?



 Page placement is implemented in units of OS pages (often 4kB, possibly more)

### **Cray XE6 Interlagos node**

4 chips, two sockets, 8 threads per ccNUMA domain



## ccNUMA map: Bandwidth penalties for remote access

- Run 8 threads per ccNUMA domain (1 chip)
- Place memory in different domain → 4x4 combinations
- STREAM triad benchmark using nontemporal stores





#### ccNUMA locality tool numactl:

How do we enforce some locality of access?



numactl can influence the way a binary maps its memory pages:

```
numactl --membind=<nodes> a.out # map pages only on <nodes>
--preferred=<node> a.out # map pages on <node>
# and others if <node> is full
--interleave=<nodes> a.out # map pages round robin across
# all <nodes>
```

#### Examples:

But what is the default without numactl?

# ccNUMA default memory locality



"Golden Rule" of ccNUMA:

# A memory page gets mapped into the local memory of the processor that first touches it!

- Except if there is not enough local memory available
- This might be a problem, see later
- Caveat: "touch" means "write", not "allocate"
- Example:

Memory not mapped here yet

It is sufficient to touch a single item to map the entire page

# **Coding for ccNUMA data locality**



## Most simple case: explicit initialization

```
integer, parameter :: N=10000000
double precision A(N), B(N)
A=0.d0
!$OMP parallel do
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end parallel do
```

```
integer, parameter :: N=10000000
double precision A(N),B(N)
!$OMP parallel
!$OMP do schedule(static)
do i = 1, N
 A(i) = 0.d0
end do
!$OMP end do
!$OMP do schedule(static)
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end do
!$OMP end parallel
```

# Coding for ccNUMA data locality



 Sometimes initialization is not so obvious: I/O cannot be easily parallelized, so "localize" arrays before I/O



```
integer, parameter :: N=10000000
double precision A(N),B(N)
!$OMP parallel
!$OMP do schedule(static)
do i = 1, N
 A(i) = 0.d0
end do
!$OMP end do
!$OMP single
READ (1000) A
!$OMP end single
!$OMP do schedule(static)
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end do
!$OMP end parallel
```

# **Coding for Data Locality**



- Required condition: OpenMP loop schedule of initialization must be the same as in all computational loops
  - Only choice: static! Specify explicitly on all NUMA-sensitive loops, just to be sure...
  - Imposes some constraints on possible optimizations (e.g. load balancing)
  - Presupposes that all worksharing loops with the same loop length have the same thread-chunk mapping
    - Guaranteed by OpenMP 3.0 only for loops in the same enclosing parallel region and static schedule
    - In practice, it works with any compiler even across regions
  - If dynamic scheduling/tasking is unavoidable, more advanced methods may be in order
- How about global objects?
  - Better not use them
  - If communication vs. computation is favorable, might consider properly placed copies of global data
  - In C++, STL allocators provide an elegant solution (see hidden slides)

# **Diagnosing Bad Locality**



- If your code is cache-bound, you might not notice any locality problems
- Otherwise, bad locality limits scalability at very low CPU numbers (whenever a node boundary is crossed)
  - If the code makes good use of the memory interface
  - But there may also be a general problem in your code...
- Consider using performance counters
  - LIKWID-perfctr can be used to measure nonlocal memory accesses
  - Example for Intel Nehalem (Core i7):

```
env OMP NUM THREADS=8 likwid-perfctr -g MEM -C N:0-7 ./a.out
```

# Using performance counters for diagnosing bad ccNUMA access locality



#### Intel Nehalem EP node:



#### If all fails...



- Even if all placement rules have been carefully observed, you may still see nonlocal memory traffic. Reasons?
  - Program has erratic access patters → may still achieve some access parallelism (see later)
  - OS has filled memory with buffer cache data:

# numactl --hardware # idle node!

```
available: 2 nodes (0-1)
node 0 size: 2047 MB
node 0 free: 906 MB
node 1 size: 1935 MB
node 1 free: 1798 MB

top - 14:18:25 up 92 days, 6:07, 2 users, load average: 0.00, 0.02, 0.00
Mem: 4065564k total, 1149400k used, 2716164k free, 43388k buffers
Swap: 2104504k total, 2656k used, 2101848k free, 1038412k cached
```

#### ccNUMA problems beyond first touch:

#### Buffer cache



# OS uses part of main memory for disk buffer (FS) cache

- If FS cache fills part of memory, apps will probably allocate from foreign domains
- → non-local access!
- "sync" is not sufficient to drop buffer cache blocks



#### Remedies

- Drop FS cache pages after user job has run (admin's job)
  - seems to be automatic after aprun has finished on Crays
- User can run "sweeper" code that allocates and touches all physical memory before starting the real application
- numact1 tool or aprun can force local allocation (where applicable)
- Linux: There is no way to limit the buffer cache size in standard kernels

#### ccNUMA problems beyond first touch:

Buffer cache



# Real-world example: ccNUMA and the Linux buffer cache Benchmark:

1. Write a file of some size from LD0 to disk

2. Perform bandwidth benchmark using all cores in LD0 and maximum memory available in LD0

Result: By default, Buffer cache is given priority over local page placement

→ restrict to local domain if possible!



#### ccNUMA placement and erratic access patterns



 Sometimes access patterns are just not nicely grouped into contiguous chunks:

```
double precision :: r, a(M)
!$OMP parallel do private(r)
do i=1,N
    call RANDOM_NUMBER(r)
    ind = int(r * M) + 1
    res(i) = res(i) + a(ind)
enddo
!OMP end parallel do
```

Or you have to use tasking/dynamic scheduling:

```
!$OMP parallel
!$OMP single
do i=1,N
   call RANDOM_NUMBER(r)
   if(r.le.0.5d0) then
!$OMP task
     call do_work_with(p(i))
!$OMP end task
   endif
enddo
!$OMP end single
!$OMP end parallel
```

In both cases page placement cannot easily be fixed for perfect parallel access

#### ccNUMA placement and erratic access patterns



- Worth a try: Interleave memory across ccNUMA domains to get at least some parallel access
  - 1. Explicit placement:

```
!$OMP parallel do schedule(static,512)
do i=1,M
   a(i) = ...
enddo
!$OMP end parallel do
```

Observe page alignment of array to get proper placement!

2. Using global control via numactl:

```
numactl --interleave=0-3 ./a.out
```

This is for all memory, not just the problematic arrays!

Fine-grained program-controlled placement via libnuma (Linux) using, e.g., numa\_alloc\_interleaved\_subset(), numa\_alloc\_interleaved() and others

#### The curse and blessing of interleaved placement:

OpenMP STREAM on a Cray XE6 Interlagos node



- Parallel init: Correct parallel initialization
- LD0: Force data into LD0 via numactl -m 0
- Interleaved: numactl --interleave <LD range>





# Case study: A 3D Jacobi smoother

The basics in two dimensions

Roofline performance analysis and modeling

#### A Jacobi smoother



- Laplace equation in 2D:  $\Delta \Phi = 0$ 

Solve with Dirichlet boundary conditions using Jacobi iteration scheme:

```
double precision, dimension(0:imax+1,0:kmax+1,0:1) :: phi
   integer :: t0,t1
   t0 = 0; t1 = 1
   do it = 1, itmax ! choose suitable number of sweeps
     do k = 1, kmax
                                                            Reuse when computing
       do i = 1, imax
                                                            phi(i+2,k,t1)
          ! four flops, one store, four loads
          phi(i,k,t1) = (phi(i+1,k,t0) + phi(i-1,k,t0)
                           + phi(i, k+1, t0) + phi(i, k-1, t0) ) \star 0.25
       enddo
     enddo
                               Naive balance (incl. write allocate):
     ! swap arrays
          0 ; t0=t1 ; t1=i
                             phi(:,:,t0):3 LD +
   enddo
                               phi(:,:,t1):1 ST+1LD
                               \rightarrow B<sub>C</sub> = 5 W / 4 FLOPs = 1.25 W / F
WRITE ALLOCATE:
LD + ST phi(i,k,t1)
```

#### **Balance metric: 2 D Jacobi**



- Modern cache subsystems may further reduce memory traffic
  - → "layer conditions"



If cache is large enough to hold at least 2 rows (shaded region): Each phi(:,:,t0) is loaded once from main memory and re-used 3 times from cache:

```
phi(:,:,t0): 1 LD + phi(:,:,t1): 1 ST+ 1LD \rightarrow B<sub>C</sub> = 3 W / 4 F = 0.75 W / F
```

If cache is too small to hold one row: phi(:,:,t0): 2 LD + phi(:,:,t1): 1 ST+ 1LD $\rightarrow B_C = 5 W / 4 F = 1.25 W / F$ 

#### Performance metrics: 2D Jacobi



Alternative implementation ("Macho FLOP version")

```
do k = 1, kmax

do i = 1, imax

phi(i,k,t1) = 0.25 * phi(i+1,k,t0) + 0.25 * phi(i-1,k,t0)

+ 0.25 * phi(i,k+1,t0) + 0.25 * phi(i,k-1,t0)

enddo

enddo
```

- MFlops/sec increases by 7/4 but time to solution remains the same
- Better metric (for many iterative stencil schemes):
   Lattice Site Updates per Second (LUPs/sec)

2D Jacobi example: Compute LUPs/sec metric via

$$P[LUPs/s] = \frac{it_{\text{max}} \cdot i_{\text{max}} \cdot k_{\text{max}}}{T_{\text{wall}}}$$



3D sweep:

```
do k=1,kmax
  do j=1,jmax
    do i=1,imax
      phi(i,j,k,t1) = 1/6. *(phi(i-1,j,k,t0)+phi(i+1,j,k,t0) &
                            + phi(i,j-1,k,t0)+phi(i,j+1,k,t0) &
                            + phi(i,j,k-1,t0)+phi(i,j,k+1,t0))
    enddo
  enddo
enddo
Best case balance: 1 LD
                                              phi(i,j,k+1,t0)
                     1 ST + 1 write allocate phi(i,j,k,t1)
                     6 flops
\rightarrow B<sub>c</sub> = 0.5 W/F (24 bytes/LUP)
```

- No 2-layer condition but 2 rows fit: B<sub>c</sub> = 5/6 W/F (40 bytes/LUP)
- Worst case (2 rows do not fit): B<sub>c</sub> = 7/6 W/F (56 bytes/LUP)

#### 3D Jacobi solver

Performance of vanilla code on one Interlagos chip (8 cores)





# **Conclusions from the Jacobi example**



- We have made sense of the memory-bound performance vs. problem size
  - "Layer conditions" lead to predictions of code balance
  - Achievable memory bandwidth is input parameter

- The model works only if the bandwidth is "saturated"
  - In-cache modeling is more involved

- Optimization == reducing the code balance by code transformations
  - See below



# **Data access optimizations**

**Case study: Optimizing the 3D Jacobi solver** 

# Remember the 3D Jacobi solver on Interlagos?





# Jacobi iteration (2D): No spatial Blocking



#### Assumptions:

- cache can hold 32 elements (16 for each array)
- Cache line size is 4 elements
- Perfect eviction strategy for source array



This element is needed for three more updates; but 29 updates happen before this element is used for the last time

# Jacobi iteration (2D): No spatial blocking



#### Assumptions:

- cache can hold 32 elements (16 for each array)
- Cache line size is 4 elements
- Perfect eviction strategy for source array



This element is needed for three more updates but has been evicted

# Jacobi iteration (2D): Spatial Blocking



- divide system into blocks
- update block after block
- same performance as if three complete rows of the systems fit into cache



# Jacobi iteration (2D): Spatial Blocking



- Spatial blocking reorders traversal of data to account for the data update rule of the code
- →Elements stay sufficiently long in cache to be fully reused
- → Spatial blocking improves temporal locality!

(Continuous access in inner loop ensures spatial locality)



This element remains in cache until it is fully used (only 6 updates happen before last use of this element)

#### Jacobi iteration (3D): Spatial blocking



#### Guidelines:

- Blocking of inner loop levels (traversing continuously through main memory)
- Blocking sizes large enough to fulfill "layer condition"
- Cache size is a hard limit!
- Blocking loops may have some impact on ccNUMA page placement

#### 3D Jacobi solver (problem size 400<sup>3</sup>)

Blocking different loop levels (8 cores Interlagos)





#### 3D Jacobi solver (problem size 400<sup>3</sup>)

Calculating the optimal block size (8 cores Interlagos)



- Interlagos chip: aggregate L2 size of 8 MB (say 4 MB to be safe)
- Static OpenMP scheduling → 0.5 MB cache per core
- Layer condition with j-loop blocking:

2 layers of size N x b<sub>i</sub> must fit into the cache



# Jacobi iteration (3D): Nontemporal stores



- Intel x86: NT stores are packed SIMD stores with 16-byte aligned address
  - Sometimes hard to apply
- AMD x86: Scalar NT stores without alignment restrictions available
- Options for using NT stores
  - Let the compiler decide → unreliable
  - Use compiler options
    - Intel: -opt-streaming-stores never|always|auto
  - Use compiler directives
    - Intel: !DIR\$ vector [non]temporal
    - Cray: !DIR\$ LOOP\_INFO cache[\_nt](...)
- Compiler must be able to "prove" that the use of SIMD and NT stores is "safe"!
  - "line update kernel" concept: Make critical loop its own subroutine

#### Jacobi iteration (3D): Nontemporal stores for Cray



Line update kernel (separate compilation unit or -fno-inline):

```
subroutine jacobi line(d,s,top,bottom,front,back,n)
  integer :: n,i,start
  double precision, dimension(*) :: d,s,top,bottom,front,back
 double precision, parameter :: oos=1.d0/6.d0
!DIR$ LOOP INFO cache nt(d)
   do i=2,n-1
      d(i) = oos*(s(i-1)+s(i+1)+top(i)+bottom(i)+front(i)+back(i))
    enddo
end subroutine
  Main loop:
do joffset=1,jmax,jblock
  do k=1,kmax
    do j=joffset, min(jmax,joffset+jblock-1)
      call jacobi line (phi (1,j,k,t1), phi (1,j,k,t0), phi (1,j,k-1,t0), &
                         phi(1,j,k+1,t0), phi(1,j-1,k,t0), phi(1,j+1,k,t0)
                         ,size)
    enddo
  enddo
enddo
```

#### 3D Jacobi solver

Spatial blocking + nontemporal stores





#### **Conclusions from the Jacobi optimization example**



- "What part of the data comes from where" is a crucial question
- Avoiding slow data paths == re-establishing the layer condition
- Improved code showed the speedup predicted by the model
- Optimal blocking factor can be predicted
  - Be guided by the cache size the layer condition
  - No need for exhaustive scan of "optimization space"



# Case study: OpenMP-parallel sparse matrix-vector multiplication

A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory

# Sparse matrix-vector multiply (spMVM)



- Key ingredient in some matrix diagonalization algorithms
  - Lanczos, Davidson, Jacobi-Davidson
- Important for sparse solvers (CG,...)
- Store only N<sub>nz</sub> nonzero elements of matrix and RHS, LHS vectors with N<sub>r</sub> (number of matrix rows) entries
- "Sparse": N<sub>nz</sub> ~ N<sub>r</sub>



# **CRS** matrix storage scheme





- val[] stores all the nonzeros (length N<sub>nz</sub>)
- col\_idx[] stores the column index of each nonzero (length N<sub>nz</sub>)
- row\_ptr[] stores the starting index of each new row in val[] (length: N<sub>r</sub>)



# **Case study: Sparse matrix-vector multiply**



- Important kernel in many applications (matrix diagonalization, solving linear systems)
- Strongly memory-bound for large data sets
  - Streaming + partially indirect access:

```
!$OMP parallel do
do i = 1,Nr
do j = row_ptr(i), row_ptr(i+1) - 1
   c(i) = c(i) + val(j) * b(col_idx(j))
   enddo
enddo
!$OMP end parallel do
```

- Usually many spMVMs required to solve a problem
- Code balance / computational intensity? (erratic RHS access!)
- Saturation / scaling behavior?

Strong scaling on one XE6 Magny-Cours node



#### **Case 1: Large matrix**



Strong scaling on one XE6 Magny-Cours node



**Case 1: Large matrix** 





Strong scaling on one XE6 Magny-Cours node



#### Case 2: Medium size



Strong scaling on one Magny-Cours node



#### Case 3: Small size



# Roofline performance model for CRS spMVM



# Code balance for double precision FP and 4-byte index:

do i = 1,
$$N_r$$
  
do j = row\_ptr(i), row\_ptr(i+1) - 1  
 $C(i) = C(i) + val(j) * B col_idx(j)$   
enddo  
enddo



M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop: *A unified sparse matrix data format for modern processors with wide SIMD units*. Submitted. Preprint: <u>arXiv:1307.6209</u>

# The $\alpha$ parameter



#### **Corner case scenarios:**

- 1.  $\alpha = 0$   $\rightarrow$  RHS in cache
- 2.  $\alpha = \frac{1}{N_{nzc}}$   $\rightarrow$  Load RHS vector exactly once

If 
$$N_{nzc} \gg 1$$
, RHS traffic is insignificant:  $\bar{P} = \frac{b\beta}{6 \text{ bytes/flop}}$ 

- 3.  $\alpha \approx 1$   $\rightarrow$  Each RHS load goes to memory
- 4.  $\alpha > 1$   $\rightarrow$  Houston, we've got a problem  $\odot$

Determine  $\alpha$  by measuring actual spMVM memory traffic (HPM)

#### **Determine RHS traffic**



# $V_{meas}$ is the measured overall memory data traffic (using, e.g., likwid-perfctr)

#### Determine $\alpha$ :

$$\alpha = \frac{1}{4} \left( \frac{V_{meas}}{N_{nz} \cdot 2 \text{ bytes}} - 6 - \frac{8}{N_{nzr}} \right)$$

#### Example: kkt\_power matrix on one Intel SNB socket

$$N_{nz} = 14.6 \cdot 10^6$$
,  $N_{nzr} = 7.1$ 

■ 
$$V_{meas} \approx 258 \text{ MB}$$

• 
$$\rightarrow \alpha = 0.43, \, \alpha N_{nzr} = 3.1$$

- → RHS is loaded 3.1 times from memory
- and:



$$\frac{B_{CRS}^{DP}(\alpha)}{B_{CRS}^{DP}(1/N_{nzc})} = 1.15$$

15% extra traffic

→ optimization potential!

# **Conclusions from the spMVM example**



- spMVM shows "typical" bandwidth-bound scaling behavior
- Roofline is good for a first shot at modeling
- Deviations are to be expected
  - Erratic RHS access
  - Saturation bandwidth is lower than the maximum.
- Deviations can be used to learn more about the code execution
  - How much excess memory traffic is generated from the indirect access?

#### **Conclusions**



# There is no alternative to knowing what is going on between your code and the hardware

Without performance modeling, optimizing code is like stumbling in the dark