#### ERLANGEN REGIONAL COMPUTING CENTER



## MPI+X Programming Models on Future Systems – the Search for Lowest-Order Effects

Georg Hager Erlangen Regional Computing Center (RRZE)

Programming Models on the Road to Exascale ISC High Performance 2015 July 13, 2015, Frankfurt, Germany



FRIEDRICH-ALEXANDER UNIVERSITÄT ERLANGEN-NÜRNBERG

#### Outline

- Resource-aware software engineering
  - Hardware bottlenecks
  - What we need and what we get resource balance (Kung)
  - Lowest-order thinking (excavator aerodynamics)
- MPI+X programming models
  - X = {}, threading, accelerator
  - Opportunities for addressing the lowest order
  - How to find the lowest order?





Resources are means to an end





Resources are means to an end















# Resource balance: what we need and what we get

Initial idea: code balance vs. machine balance







# Resource balance: what we need and what we get

Initial idea: code balance vs. machine balance



H.T. Kung: Memory requirements for balanced computer architectures. Proc. ISCA'86, DOI: 10.1145/17356.17362





#### Generalization of the balance concept: Lowest order

Limited resources impose upper (lower) performance (runtime) limits







#### Generalization of the balance concept: Lowest order

<u>0"</u>,

Limited resources impose upper (lower) performance (runtime) limits



DOI: <u>10.1145/1498765.1498785</u>

Simple balance picture does not hold due to non-overlap



DOI: <u>10.1007/978-3-642-14390-8\_64</u>

Stengel, Treibig, Hager, Wellein (2015), DOI: <u>10.1145/2751205.2751240</u>



Simple balance picture does not hold due to non-overlap



DOI: <u>10.1007/978-3-642-14390-8\_64</u>

Stengel, Treibig, Hager, Wellein (2015), DOI: <u>10.1145/2751205.2751240</u>





Simple balance picture does not hold due to non-overlap





Simple balance picture does not hold due to non-overlap







Simple balance picture does not hold due to non-overlap





Potential of overlap *is* limited by the minimum requirements of the software w.r.t. the hardware

| Computation     |                      |                       |          |
|-----------------|----------------------|-----------------------|----------|
| Memory transfer | Inter-cache transfer | Network communication | Disk I/O |





Potential of overlap *is* limited by the minimum requirements of the software w.r.t. the hardware

| Computation           |                      |  |  |
|-----------------------|----------------------|--|--|
| Memory transfer       | Inter-cache transfer |  |  |
| Network communication |                      |  |  |







Potential of overlap *is* limited by the minimum requirements of the software w.r.t. the hardware







Potential of overlap *is* limited by the minimum requirements of the software w.r.t. the hardware



... and it should be limited/guided by lowest-order thinking!





Potential of overlap *is* limited by the minimum requirements of the software w.r.t. the hardware



... and it should be limited/guided by lowest-order thinking!

#### **Resource optimization ==**





Potential of overlap *is* limited by the minimum requirements of the software w.r.t. the hardware



... and it should be limited/guided by lowest-order thinking!

exposing the lowest-order bottleneck

Resource optimization == <





Potential of overlap is limited by the minimum requirements of the software w.r.t. the hardware



... and it should be limited/guided by lowest-order thinking!

Resource optimization == { exposing the lowest-order bottleneck reducing the impact of the bottleneck



























Getting to lowest order is only useful if it promises a significant return



... even if your programming model allows it!





Georg Hager | MPI+X

#### Now what about the X?

X =

FRIEDRICH AL EXANDER UNIVERSITÄT ERLANGEN-NÜRNBERG



Now what about the X?

 $X = \{ \}$ 

OpenMP, TBB, OmpSs, pthreads, Cilk(+)

CUDA, OpenCL

OpenACC, OpenMP4



some-library-that-does-the-trick





FRIEDRICH-AL EXANDER UNIVERSITÄT ERLANGEN-NÜRNBERG



Georg Hager | MPI+X

30

1. If a programming model (i.e., some X) lets me expose the lowest order, I'm fine with it





- 1. If a programming model (i.e., some X) lets me expose the lowest order, I'm fine with it
- 2. If some X allows me to expose the lowest order better than any other, it may be a better candidate

Examples:





- 1. If a programming model (i.e., some X) lets me expose the lowest order, I'm fine with it
- 2. If some X allows me to expose the lowest order better than any other, it may be a better candidate

Examples:

OMP tasking for comm./comp. overlap





- 1. If a programming model (i.e., some X) lets me expose the lowest order, I'm fine with it
- 2. If some X allows me to expose the lowest order better than any other, it may be a better candidate

Examples:

OMP tasking for comm./comp. overlap

DSL for exposing data parallelism and data flow in stencil algorithms





- 1. If a programming model (i.e., some X) lets me expose the lowest order, I'm fine with it
- 2. If some X allows me to expose the lowest order better than any other, it may be a better candidate

#### Examples:

OMP tasking for comm./comp. overlap

DSL for exposing data parallelism and data flow in stencil algorithms

OmpSs for extracting the critical path





- 1. If a programming model (i.e., some X) lets me expose the lowest order, I'm fine with it
- 2. If some X allows me to expose the lowest order better than any other, it may be a better candidate







- 1. If a programming model (i.e., some X) lets me expose the lowest order, I'm fine with it
- 2. If some X allows me to expose the lowest order better than any other, it may be a better candidate



FRIEDRICH-AL EXANDER UNIVERSITÄT ERLANGEN-NÜRNBERG



 Profiling? Hardware counter measurements? Autmatic tuning advice?

File foo.cc, line 56: Loop shows 50.0% L1 cache hit rate – consider optimization





 Profiling? Hardware counter measurements? Autmatic tuning advice?

File foo.cc, line 56: Loop shows 50.0% L1 cache hit rate – consider optimization

Performance modeling of hardware-software interaction!





 Profiling? Hardware counter measurements? Autmatic tuning advice?

File foo.cc, line 56: Loop shows 50.0% L1 cache hit rate – consider optimization

- Performance modeling of hardware-software interaction!
  - Roofline model, ECM model, LogP model, ...





 Profiling? Hardware counter measurements? Autmatic tuning advice?

File foo.cc, line 56: Loop shows 50.0% L1 cache hit rate – consider optimization

- Performance modeling of hardware-software interaction!
  - Roofline model, ECM model, LogP model, ...
  - Performance patterns (Treibig Hager, Wellein (2012), DOI: <u>10.1007/978-3-642-36949-0\_50</u>)

Visit our ISC15 workshop Performance Modeling: Methods & Applications (Marriott, Room Gold 1+2)



#### **Take-home messages**

- If it does the trick, it is a candidate
  - The trick being the full utilization of a bottleneck
- If it does the trick better than anything else, it may be worth serious consideration
- If it is sustainable, take it.
- What is the trick?
  - $\rightarrow$  A performance model will probably guide you!





#### ERLANGEN REGIONAL COMPUTING CENTER





**DFG Priority Programme1648** 

KONWIHR 

#### **Bavarian Network for HPC**

#### Thank You.

