KONWIHR

Kompetenznetzwerk für wissenschaftliches Höchstleistungsrechnen in Bayern

Inhalt

Performance tuning of high-order discontinuous Galerkin solvers for SuperMUC-NG

Antragssteller

Dr. Martin Kronbichler
Institute for Computational Mechanics
Technical University of Munich
Boltzmannstr. 15
85747 Garching b. München

Projektübersicht

We propose new algorithmic components for our high-order discontinuous Galerkin codes with a much better performance on the SuperMUC-NG system. The discontinuous Galerkin method is used within a highly sophisticated incompressible flow solver designed for complex geometries with advanced modeling features for turbulence. Our implementation relies on matrix-free evaluation of finite element operators that have a much better application performance than traditional matrix-based kernels due to a higher arithmetic intensity. Nonetheless, the current implementation has come close to the hardware limits: Optimal arithmetic complexity by sum factorization techniques as well as optimizations that we have developed over the last years almost saturate the full memory bandwidth on current Intel Haswell and Broadwell server processors. Besides operator evaluation, the other major cost factor are BLAS-1 type vector operations that are naturally memory bound. At the same time, the projected SuperMUC-NG system with 48 cores per node will increase arithmetic throughput by around 3.3× as compared to SuperMUC Phase 2, whereas memory bandwidth only increases by 1.8×. Thus, an even larger share of our code is expected to be memory bound on the new system. In this project we propose to develop new algorithms that relax this limit by loop fusion over several algorithmic components and by increasing the amount of computation on the fly, using novel approaches for the geometry representation, which is often the most pressing component. We expect an improvement of run time by around 2.5× per node on the new system with these developments, or 1.5× per core, whereas the current code would likely not get any increase in throughput per core.