

## Writing Efficient Programs in Fortran, C and C++: Selected Case Studies

<u>Georg Hager</u> Frank Deserno Dr. Frank Brechtefeld Dr. Gerhard Wellein

Regionales Rechenzentrum Erlangen HPC Services

## Agenda



 Case Study: Optimization of a Monte Carlo spin system simulation

#### Classic data access optimizations

- Case Study: Optimization of kernel loops
- Case Study: Optimization and parallelization of a Strongly Implicit Solver

#### Advanced Parallelization

 Case Study: Parallelization of a C++ sparse matrix-vector multiplication HPC





**Original Code** 

|   | Program Kernel:                                                                                                                       | cx<br>HP <b>C</b> |
|---|---------------------------------------------------------------------------------------------------------------------------------------|-------------------|
|   | IA=IZ(KL,KM,KN)<br>IL=IZ(KLL,KM,KN)<br>IR=IZ(KL,KMO,KN)<br>IO=IZ(KL,KMO,KN)<br>Load neighbours of a<br>random spin                    |                   |
|   | IU=IZ(KL,KMU,KN)<br>IS=IZ(KL,KM,KNS)<br>IN=IZ(KL,KM,KNN) calculate magnetic field                                                     |                   |
| C | edelz=iL+iR+iU+iO+iS+iN<br>CRITERION FOR FLIPPING THE SPIN                                                                            |                   |
|   | <pre>BF= 0.5d0*(1.d0+tanh(edelz/tt)) IF(YHE.LE.BF) then iz(kl,km,kn)=1 else iz(kl,km,kn)=-1 endif</pre> decide about spin orientation |                   |







## Optimization of Kernel Loops







21.07.2003

Georg Hager, RRZE

Optimization Case Studies



#### Optimization of Kernel Loops: Why preload is not always beneficial







- Now full register reuse for F() possible
- F() is loaded only once from memory
- Downside: small inner loop length







\_ remainder loop omitted!



#### Optimization of Kernel Loops: Architectural Issues



21.07.2003

Georg Hager, RRZE

Optimization Case Studies

23

#### Optimization of Kernel Loops: Architectural Issues

IA32: Optimal strategy is the same as for SR8000

- Very limited FP register set, stack-oriented
- Few integer registers
- Long P4 pipeline, but good performance with short loops
  - due to special L1-ICache (decoded instructions)?

#### IA64: Optimal strategy is the same as for SR8000

- Very bad performance for naive strategy
- Further unrolling (by 4) of middle loop helps
- But: Naive optimization with middle loop unrolling (16-fold) is also very close to optimum
  - Also some benefit on IA32, but not that much

# . .

| Op<br>Co  | otimization of Kernel Loops:<br>onclusions                                                                                                                                                                                                                                                                                                                                                                                                      |                                                                                                            | 2      |
|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|--------|
|           | <ul> <li>SR8000 is a RISC architecture features</li> <li>Vectorization abilities</li> <li>16 outstanding prefetches</li> <li>128 outstanding preloads</li> <li>Large bandwidth</li> <li>Long FP pipelines</li> <li>Careful data stream analysis is than on other RISC systems</li> <li>Sometimes PVP gets in the wat MIPS behaviour is as expected IA32/IA64 is still a mystery</li> <li>Complicated architecture (CIS deficiencies)</li> </ul> | , but has some particular<br>s more important on SR8000<br>ay of performance<br>d for typical RISC machine | r<br>r |
| 21.07.200 | 03 Georg Hager, RRZE                                                                                                                                                                                                                                                                                                                                                                                                                            | Optimization Case Studies                                                                                  | 25     |
|           | Case Study:<br>Optimization and P<br>of a Strongly Implic                                                                                                                                                                                                                                                                                                                                                                                       | arallelization                                                                                             |        |

### **CFD kernel: Strongly Implicit Solver**































21.07.2003



21.07.2003

Georg Hager, RRZE

**Optimization Case Studies** 



## Case Study: Parallelization of a Sparse MVM in C++

## Sparse MVM Procedure in DMRG

#### DMRG

- Density-Matrix Renormalization Group Algorithm
- Used for solving quantum problems in solid state physics and theoretical chemistry
- Alternative to expensive (resource-consuming) Exact Diagonalization
- Core of DMRG: Sparse matrix-vector multiplication (in Davidson diagonalization)
  - Dense matrices as matrix and vector components
  - Dominant operation at lowest level: dense matrix-matrix multiply (use optimized Level 3 BLAS!)
- Parallelization approaches:
  - Use parallel BLAS (no code changes)
  - Parallelize sparse MVM using OpenMP











## DMRG:

```
OpenMP Parallelization
    Implementation of parallel sparse MVM – pseudocode
    (prologue loops)
    // store all block references in block_array
    ics=0;
    for (\alpha=0; \alpha < number of hamiltonian terms; \alpha++) {
       term = hamiltonian_terms[α];
       for (k=0 ; k < term.number_of_blocks; k++) {</pre>
           block_array[ics]=&term[q];
           ics++;
       }
    }
    icsmax=ics;
    // set up lock lists
    for(i=0; i < MAX_NUMBER_OF_THREADS; i++)</pre>
       mm[i] = new Matrix // temp.matrix
    for (i=0; I < MAX NUMBER OF LOCKS; i++) {</pre>
       locks[i] = new omp_lock_t;
       omp_init_lock(locks[i]);
    }
21.07.2003
                   Georg Hager, RRZE
                                                 Optimization Case Studies
                                                                            47
```

```
DMRG:
OpenMP Parallelization
Implementation of parallel sparse MVM – pseudocode (main loop)
                                                                 HPC
// W: wavevector ; R: result
#pragma omp parallel private(mymat, li, ri, myid, ics)
{
   myid = omp get thread num();
   mytmat = mm[myid]; // temp thread local matrix
#pragma omp for
                                                Fused (\alpha_{k}) loop
   for (ics=0; ics< icsmax; ics++) {</pre>
      li = block_array[ics]->left_index;
      ri = block_array[ics]->right_index;
      mytmat = block_array[ics]->B.transpose() * W[ri];
      omp_set_lock(locks[li]);
                                                Protect each block of
      R[li] += block_array[ics]->A * mytmat;
                                               result vector R with
      omp_unset_lock(locks[li]);
                                                locks
   }
}
```

#### DMRG: OpenMP Parallelization



The parallel code is compliant to the OpenMP standard



 However: NO system did compile and produce correct results with the initial MVM implementation!

| IBM xIC V6.0            | OpenMP locks prevent omp for parallelization                                      | Fixed by IBM                               |
|-------------------------|-----------------------------------------------------------------------------------|--------------------------------------------|
| Intel efc V7<br>ifc V7  | Severe problems with orphaned omp<br>critical directives in class<br>constructors | Does not work                              |
| SUN forte7              | Does not allow<br>omp critical inside<br>C++ classes!                             | Does not work<br>(Forte 8 EA does<br>work) |
| SGI MIPSpro<br>7.3.1.3m | Complex data structures<br>can not be allocated inside<br>omp parallel regions    | Allocate everything outside loop           |

21.07.2003

Georg Hager, RRZE

Optimization Case Studies







