



### Part I: Introduction to compute node architecture

General Concept Single Core: Pipelining, Superscalarity, SMT, SIMD Memory hierarchy and Data Transfers GPU vs. CPU



- Xeon "Skylake SP" (Platinum/Gold/Silver/Bronze):
  - Up to 28 cores running at 2+ GHz (+ "Turbo Mode": 3.8+ GHz)
    - Reincarnated as "Cascade Lake" in 2018

- Simultaneous Multithreading (SMT)
   → reports as 56-way chip
- 8 Billion Transistors / 14 nm
- Die size: ~500 mm<sup>2</sup>





- Cluster on Die (analogous to SNC)
- Up to 24 cores

#### Node-Level Performance Engineering





#### General-purpose cache based microprocessor core





Stored-program computer

- 1. Implements "Stored Program Computer" concept (Turing 1936)
- 2. Similar designs on all modern systems
- 3. (Still) multiple potential bottlenecks

The clock cycle is the "heartbeat" of the core







## Part I: Introduction to compute node architecture

Single core: Pipelining, Superscalarity, SMT, SIMD





### Idea:

- Split complex instruction into several simple / fast steps (stages)
- Each step takes the same time, e.g., one cycle
- Execute different steps on different instructions at the same time (in parallel)
- Allows for shorter cycle times (simpler logic circuits), e.g.:
  - floating point multiplication takes 5 cycles, but
  - processor can work on 5 different multiplications simultaneously
  - one result at each cycle after the pipeline is full
- Drawback:
  - Pipeline must be filled sufficient # of independent instructions required
  - Requires complex instruction scheduling by compiler/hardware
    - software-pipelining / out-of-order execution

### Pipelining is widely used in modern computer architectures





First result is available after 5 cycles (=latency of pipeline)! Wind-up/-down phases: Empty pipeline stages



- Multiple units enable use of Instruction Level Parallelism (ILP): Instruction stream is "parallelized" on the fly
- Instructions from different loop iterations retired at the same time



- Issuing m concurrent instructions per cycle: m-way superscalar
- Modern processors are 4- to 6-way superscalar & can perform 2 floating-point instructions per cycle







- x86 SIMD instruction sets:
  - SSE: register width = 128 Bit  $\rightarrow$  2 double precision floating point operands
  - AVX(/2): register width = 256 Bit  $\rightarrow$  4 double precision floating point operands
  - AXV512: you get it.
- Adding two registers holding double precision floating point operands



FRIEDRICH-ALEXAN





### A "simple" example: The sum reduction



s = 0.0
do i = 1,N
 s = s + a(i)
enddo

...In single precision (32 Bit) on an AVXcapable core (ADD latency = 3 cy)

How fast can this loop possibly run with data in the L1 cache?

- Loop-carried dependency on summation variable
- Execution stalls at every ADD until previous ADD is complete
- → No pipelining? → No SIMD?



### Plain scalar code, no SIMD

do i = 1,N
 s = s + a(i)
enddo



#### **ADD** pipes utilization:



 $\rightarrow$  1/24 of ADD peak

#### Applicable peak for the sum reduction (II)





do i = 1,N,3
 s1 = s1 + a(i+0)
 s2 = s2 + a(i+1)
 s3 = s3 + a(i+2)
enddo
s = s + s1+s2+s3



 $\rightarrow$  1/8 of ADD peak



```
do i = 1, N, 24
SIMD-vectorization (8-way MVE) x
                                                        s10=s10+a(i+0); s20=s20+a(i+8) ; s30=s30+a(i+16)
                                                        s11=s11+a(i+1); s21=s21+a(i+9); s31=s31+a(i+17)
         pipelining (3-way MVE)
                                                        s12=s12+a(i+2); s22=s22+a(i+10); s32=s32+a(i+18)
                                                        s13=s13+a(i+3); s23=s23+a(i+11); s33=s33+a(i+19)
                                                        s14=s14+a(i+4); s24=s24+a(i+12); s34=s34+a(i+20)
                                                        s15=s15+a(i+5); s25=s25+a(i+13); s35=s35+a(i+21)
                                                        s16=s16+a(i+6); s26=s26+a(i+14); s36=s36+a(i+22)
LOAD [r1.0,...,r1.7] \leftarrow [0,...,0]
                                                        s17=s17+a(i+7); s27=s27+a(i+15); s37=s37+a(i+23)
LOAD [r2.0,...,r2.7] \leftarrow [0,...,0]
                                                       enddo
LOAD [r3.0, ..., r3.7] \leftarrow [0, ..., 0]
                                                       s = s + s10 + s11 + \dots + s37
i \leftarrow 1
                                                                                   s10
                                                                                               s30
                                                                                         s20
loop:
                                                                                         s21
                                                                                               s31
                                                                                    s11
   LOAD [r4.0,...,r4.7] \leftarrow [a(i),...,a(i+7)]
                                                              # SIMD LOAD
   LOAD [r5.0,...,r5.7] \leftarrow [a(i+8),...,a(i+15)]
                                                                                   s12
                                                                                         s22
                                                                                               s32
                                                              #
                                                                 SIMD
                                                                              peak
  LOAD [r6.0, ..., r6.7] \leftarrow [a(i+16), ..., a(i+23)]
                                                              #
                                                                 STMD
                                                                                   s13
                                                                                         s23
                                                                                               s33
                                                                              ADD
   ADD r1 \leftarrow r1 + r4
                             # STMD ADD
                                                                                   s14
                                                                                         s24
                                                                                               s34
   ADD r2 \leftarrow r2 + r5 \# SIMD ADD
                                                                              \mathbf{\Lambda}
   ADD r3 \leftarrow r3 + r6
                             # SIMD ADD
                                                                                    s15
                                                                                         s25
                                                                                               s35
                                                                                    s16
                                                                                         s26
                                                                                               s36
   i+=24 →? loop
result \leftarrow r1.0+r1.1+...+r3.6+r3.7
                                                                                               s37
                                                                                    s17
                                                                                         s27
```



### Questions

- When can this performance actually be achieved?
  - No data transfer bottlenecks
  - No other in-core bottlenecks
    - Need to execute (3 LOADs + 3 ADDs + 1 increment + 1 compare + 1 branch) in 3 cycles
- What does the compiler do?
  - If allowed and capable, the compiler will do this automatically
- Is the compiler allowed to do this at all?
  - Not according to language standards
  - High optimization levels can violate language standards
- What about the "accuracy" of the result?
  - Good question ;-)





## Part I: Introduction to compute node architecture

Memory Hierarchy and Data Transfers





How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except L1 ←→registers)
- Cache MISS: Load or store instruction does not find the data in a cache level
   → CL transfer required
- Example: Array copy A(:)=C(:)
- "Store miss"  $\rightarrow$  write allocate (WA)
- Techniques exists to avoid WA



#### Putting the cores & caches together AMD Epyc 7742 64-Core Processor («Rome»)



#### Processor "Chip" (Socket) Memory Memory ΤΤΤΤΤΤΤ т т т т т т т т P P Ρ Ρ Ρ P Ρ 16 MIB L 3 8 8 d d d d d d 1 1 1 1 ΤΤΤΤΤΤ ΤΤΤΤΤΤ Ρ Ρ Ρ Ρ Ρ Ρ 16 MiB L3 16 MiB L3 8 8 10 MIB F3 ET BIM 91 d d d d d d d 1 1 Ρ Ρ Ρ Ρ Ρ Ρ 16 MiB L3 16 MIB L3 8 8 ET RIM 91 d d d d d d 1 1 1 1 111111 TTTTTT TTTTTT Ρ Ρ Р PP Р 16 MIB L3 8 8 16 MiB I 3 d d d d d d d

#### Core features:

- Two-way SMT
- Two 256-bit SIMD FMA units (AVX2)
   →16 flops/cycle
- 32 KiB L1 data cache per core
- 512 KiB L2 cache per core
- 64 cores per socket hierarchically built up from
  - I6 CCX with 4 cores and 16 MB of L3 cache
  - 2 CCX form 1 CCD (silicon die)
  - 8 CCDs connected to IO device "Infinity Fabric" (memory controller & PCIe)
- 8 channels of DDR4-3200 per IO device
  - MemBW: 8 ch x 8 byte x 3.2 GHz = 204.8 GB/s
- NUMA-feature (Boot time option):
  - Node Per Socket (NPS)=1, 2 or 4
  - NPS=4  $\rightarrow$  4 UMA domains

Memory

IW

IW

Memory





## Part I: Introduction to compute node architecture

GPU vs CPU



### Nvidia A100 "Ampere" SXM4 specs



#### Architecture

- 54.2 B Transistors
- ~ 1.4 GHz clock speed
- ~ 108 "SM" units \_\_\_\_
  - 64 SP "cores" each (FMA)
  - 32 DP "cores" each (FMA)
  - 4 "Tensor Cores" each
  - 2:1 SP:DP performance
- 9.7 TFlop/s DP peak (FP64)
- 40 MiB L2 Cache
- 40 GB (5120-bit) HBM2
- MemBW ~ 1555 GB/s (theoretical)
- MemBW ~ 1400 GB/s (measured)

 Image: Contract Context Contract Contract Contract Contract Co

© Nvidia

$$P_{peak}^{DP} = n_{SM} \cdot n_{core} \cdot n_{FP} \cdot f$$

$$n_{SM} = 108$$

$$n_{core} = 32$$

$$n_{FP} = 2\frac{\text{flops}}{\text{cy}}$$

$$f = 1.4\frac{\text{Gcy}}{\text{S}}$$

# Trading single thread performance for parallelism: *GPGPUs vs. CPUs*



| GPU vs. CPU<br>light speed estimate<br>(per processor chip) | Control                  | ALU<br>ALU    | ALU<br>ALU                    |  |
|-------------------------------------------------------------|--------------------------|---------------|-------------------------------|--|
| MemBW $\sim 7 - 10x$ Peak $\sim 4 - 8x$                     | СРО                      |               | GPU                           |  |
|                                                             | 2 x AMD EPYC 7742 "Rome" |               | NVidia Tesla A100<br>"Ampere" |  |
| Cores@Clock                                                 | 2 x 64 @ 2.25 GHz        |               | 108 SMs @ ~1.4 GHz            |  |
| FP32 Performance/core                                       | 72 GFlop/s               |               | ~179 GFlop/s                  |  |
| Threads@STREAM                                              | ~16                      |               | > 100000                      |  |
| FP32 peak9.2 TFlop/s                                        |                          | ~19.5 TFlop/s |                               |  |
| Stream BW (meas.)                                           | 2 x 190 GB/s             |               | 1400 GB/s                     |  |
| Transistors / TDP                                           | ~2x40 Billion / 2x225 W  |               | 54 Billion/400 W              |  |



- Modern computer architecture has a rich "topology"
- Node-level hardware parallelism takes many forms
  - Sockets/devices CPU: 1-8, GPGPU: 1-6
  - Cores moderate (CPU: 4-64) to massive (GPGPU: 10's-100's)
  - SIMD moderate (CPU: 2-8) to massive (GPGPU: 10's-100's)
  - Superscalarity (CPU: 2-6)
- Exploiting performance: parallelism + bottleneck awareness
  - "High Performance Computing" == computing at a bottleneck

Note:

- Performance features are largely independent of the programming model
- But programming models are sensitive to architecture
  - Topology/affinity influences overheads
  - Standards do not contain (many) topology-aware features