### ERLANGEN REGIONAL COMPUTING CENTER



# Performance Engineering for Stencil Updates on Modern Processors

Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein

Erlangen Regional Computing Center & Dept. of Computer Science Friedrich-Alexander-University Erlangen-Nuremberg, Germany



Partially funded by DFG Priority Programme1648



RIEDRICH-ALEXANDER NIVERSITÄT RLANGEN-NÜRNBERG

## **Motivation**

- There are so many
  - potential stencil structures (2D/3D, long-/short-range,...),
  - text book optimizations (register / spatial / temporal blocking,...),
  - parameters for optimization (blocking / unrolling factor,...),
  - parameters for execution (OpenMP schedule, clock speed, #cores).
- Basic questions addressed by ECM model
  - What is the bottleneck of my stencil implementation?
     → Choose appropriate optimization technique
  - What is the next bottleneck I will hit and "how far" is it from the first?
     → This yields the performance potential of the optimization
  - Impact of processor frequency and #cores used on performance





# Agenda

- ECM model basics
  - STREAM like kernel
    - Why does a single core not get full BW?
  - Case study 1: 5-pt 2D stencil
- Case study 2: Long range 3D stencil
- Case study 3: 3D stencil with DIVide
- Summary



# **Execution-Cache-Memory (ECM) model**

#### Assumptions

- Single core execution time is determined by
  - In-core execution time &
  - **Data delay** in memory hierarchy
- Socket scaling in socket is linear until relevant shared bottleneck is hit

#### Insights

- Single-core performance & socket scaling
- Relevant bottlenecks & performance impact

#### Input

- Similar as for roofline (e.g. STREAM)
- Data transfer times of all memory levels



## **Example: Schönauer Vector Triad in L2 cache**

REPEAT[A(:) = B(:) + C(:) \* D(:)] @ double precision Analysis: Sandy Bridge core w/ AVX (unit of work: 1 Cache Line)



#### **Example: ECM model for Schönauer Vector Triad** A(:) = B(:) + C(:) \* D(:) with AVX



# Multicore scaling in the ECM model

#### Identify relevant **bandwidth (shared) bottlenecks**

- L3 cache
- Memory interface

#### Scale single-thread performance until first bottleneck is hit: P(t)=min(t\*P<sub>0</sub>,P<sub>roof</sub>), with P<sub>roof</sub>=min(P<sub>max</sub>,I \*b<sub>s</sub>)

Sandy Bridge: Scalable L3

 $\rightarrow$  Bottleneck (P<sub>roof</sub>): Memory









### ECM prediction vs. measurements for A(:)=B(:)+C(:)\*D(:)



### **Case study 1**

#### ECM modelling for stencil structures: 2D Jacobi (double precision) with SSE2 on SNB







| 1 | // Jacobi 2D line update                                    |
|---|-------------------------------------------------------------|
| 2 | <pre>for(int j=start; j<end; j++){<="" pre=""></end;></pre> |
| 3 | t1[i][j]= ( <u>t0[i-1][j]</u> +                             |
| 4 | t0[i+1][j] +                                                |
| 5 | t0[i][j-1] +                                                |
| 6 | t0[i][j+1] ) * 0.25;                                        |
| 7 | }                                                           |



#### 4-way unrolling → 8 LUP / iteration



**Instruction count** 

- 13 LOAD
- 4 STORE

- 12 ADD
- 4 MUL

| 1 | movups | (%rbp, %r15, 8), %xmml            |
|---|--------|-----------------------------------|
| 2 | movups | 16(%rbp,%r15,8), %xmm3            |
| 3 | movups | 32(%rbp,%r15,8), %xmm5            |
| 4 | movups | 48(%rbp,%r15,8), %xmm7            |
| 5 | addpd  | (%r9,%r15,8), %xmm1               |
| 6 | addpd  | 16(%r9,%r15,8), %xmm3             |
| 7 | addpd  | 32(%r9,%r15,8), %xmm5             |
| 8 | addpd  | 48(%r9,%r15,8), %xmm7             |
| 9 | addpd  | -8(%r10,%r15,8), %xmm1            |
| 0 | movups | 8(%r10,%r15,8), %xmm2             |
| 1 | movups | 24(%r10,%r15,8), %xmm4            |
| 2 | movups | 40(%r10,%r15,8), %xmm6            |
| 3 | addpd  | <pre>%xmm2, %xmm3</pre>           |
| 4 | addpd  | <pre>%xmm4, %xmm5</pre>           |
| 5 | addpd  | <pre>%xmm6, %xmm7</pre>           |
| 6 | addpd  | <pre>%xmm2, %xmm1</pre>           |
| 7 | addpd  | %xmm4, %xmm3                      |
| 8 | addpd  | <pre>%xmm6, %xmm5</pre>           |
| 9 | addpd  | 56(%r10,%r15,8), %xmm7            |
| 0 | mulpd  | %xmm0, %xmm1                      |
| 1 | mulpd  | %xmm0, %xmm3                      |
| 2 | mulpd  | %xmm0, %xmm5                      |
| 3 | mulpd  | %xmm0, %xmm7                      |
| 4 | movups | <pre>%xmm1, (%r11,%r15,8)</pre>   |
| 5 | movups | <pre>%xmm3, 16(%r11,%r15,8)</pre> |
| 6 | movups | <pre>%xmm5, 32(%r11,%r15,8)</pre> |
| 7 | movups | \$xmm7, 48(%r11,%r15,8)           |



#### Port utilization in cycles



2.3 GLUP/s

1.9 GLUP/s

- Situation 1: Data set fits into L1 cache
  - ECM prediction: (8 LUP / 12 cy) \* 3.5 GHz =
  - Measurement: 2.2 GLUP/s



- Situation 2: Data set fits into L2 cache
  - 3 transfer streams from L2 to L1 (LD T<sub>0</sub> + LD/ST T<sub>1</sub>)
  - ECM prediction: (8 LUP / (12+6) cy) \* 3.5 GHz = 1.5 GLUP/s
  - Measurement:





1.9 GLUP/s

"If the model fails, we learn something"

- ↓ L1 ↓ L2
- No concurrent data transfer between memory levels
- Data transfer may overlap with other in-core instructions

Measurement:



## Case study 2: Long range 3D stencil

ECM modelling for stencil structures: Long range 3D 25-pt stencil

- Background:
  - Three-dimensional Seismic simulations in oil industry
  - Finite-Difference-Time-Domain (FDTD) discretization

3

4

5

6

7

8

9 10

11

12 13

14

15

16

17

- Stencil operator is
  - Isotropic (symmetric),
  - With constant coefficients,
  - 2<sup>nd</sup> order in time
  - 8<sup>th</sup> order in space

Collaboration with D. Keyes & T. Malas (KAUST)

```
// 3D long-range line update (single precision)
for(i=4; i<nnx-4; i++) {</pre>
  lap = coef0
              * V(i,j,k)
      + coef[1] * ( V(i+1,j ,k ) + V(i-1,j ,k ) )
      + coef[1] * ( V(i , j+1, k ) + V(i , j-1, k
                                                ))
      + coef[1] * ( V(i ,j ,k+1) + V(i ,j ,k-1) )
      + coef[2] * ( V(i+2,j ,k ) + V(i-2,j ,k
                                                ))
      + coef[2] * ( V(i , j+2, k ) + V(i , j-2, k
                                                ))
      + coef[2] * ( V(i ,j ,k+2) + V(i ,j ,k-2) )
      + coef[3] * ( V(i+3,j ,k ) + V(i-3,j ,k
      + coef[3] * ( V(i ,j+3,k ) + V(i ,j-3,k
      + coef[3] * ( V(i ,j ,k+3) + V(i ,j ,k-3) )
      + coef[4] * ( V(i+4,j ,k ) + V(i-4,j ,k
                                                ))
      + coef[4] * ( V(i , j+4, k ) + V(i , j-4, k
      + coef[4] * ( V(i ,j ,k+4) + V(i ,j ,k-4) );
  U(i,j,k) = 2.f * V(i,j,k) - U(i,j,k) + ROC2(i,j,k) * lap;
```

#### Innermost loop shown only

## Case study 2: Long range 3D stencil



### Case study 2: Long range 3D stencil



# Case study 3: 3D stencil with DIVide

ECM modelling for stencil structures: Impact of complex arithmetic

#### Background:

- Three-dimensional Seismic simulations
- Unelastic Wave Propagation Code (http://hpgeoc.sdsc.edu/personnel.html)
- Finite-Difference-Time-Domain (FDTD) discretization



**DIV** operation

### **Case study 3: 3D stencil with DIV**



### Conclusions

- ECM allows good estimate/prediction of single core performance and socket scalability for "streaming kernels"
- ECM reveals computational/hardware bottlenecks and quantifies their impact on performance
- ECM allows to quantify impact of clock speed or vectorization
- ECM could be integrated into autotuning frameworks or DSLs

Work is funded by



DFG Priority Programme1648



**Bavarian Network for HPC** 

High Performance Computing is Computing at a Bottleneck!\*



## **References – ECM model basics & applications**

- G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. (accepted) Concurrency and Computation: Practice and Experience, <u>DOI: 10.1002/cpe.3180</u> (2013). Preprint: <u>arXiv:1208.2908</u>
- M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: *Chip-level and multinode analysis of energy-optimized lattice-Boltzmann CFD simulations*. Submitted. Preprint: <u>arXiv:1304.7664</u>
- J. Treibig, G. Hager, and G. Wellein: Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering. Proc. 5<sup>th</sup> Workshop on Productivity and Performance (PROPER 2012) at Euro-Par 2012. Euro-Par 2012: Parallel Processing Workshops, LNCS 7640, 451-460 (2013), Springer, ISBN 978-3-642-36948-3. DOI: 10.1007/978-3-642-36949-0\_50. Preprint: arXiv:1206.3738



