Two-part minisymposium at the SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP14), Portland, OR, February 18-21, 2014:

Optimizing Stencil-Based Algorithms

Abstract:

Stencil or stencil-like algorithms are the core of many numerical solvers and simulation codes. There is vast literature on parallelizing and optimizing stencil codes on modern computer architectures, and work is ongoing in many directions. Hardware features like wide SIMD parallelism, (massive) threading, multi-level caches, and increasing core counts complicate matters and fuel the trend towards software abstractions and automatic tuning frameworks. We bring together experts who provide a comprehensive overview of the state of the art and ongoing work. Various approaches, from domain-specific languages to performance models, and from auto-tuning to hardware-specific optimizations, will be covered.

Organizers:

David E. Keyes, King Abdullah University of Science & Technology (KAUST), Saudi Arabia
Jan Treibig, Erlangen Regional Computing Center, Germany
Georg Hager, Erlangen Regional Computing Center, Germany
Gerhard Wellein, Erlangen Regional Computing Center, Germany

Part 1 (MS58)

Introduction to the minisymposium
David E. Keyes, King Abdullah University of Science & Technology (KAUST), Saudi Arabia
Alleviating memory bandwidth pressure with wavefront temporal blocking and diamond tiling
Tareq Malas, King Abdullah University of Science & Technology (KAUST), Saudi Arabia; Georg Hager and Gerhard Wellein, Erlangen Regional Computing Center, Germany; David E. Keyes, King Abdullah University of Science & Technology (KAUST), Saudi Arabia
Performance Engineering for Stencil Updates on Modern Processors
Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein, Erlangen Regional Computing Center, Germany
Abstract: We apply the recently introduced ECM (Execution-Cache-Memory) performance model for multicore processors on stencil- and stencil-like algorithms. ECM is an extension of the Roofline Model and allows a prediction of single-core performance and saturation properties of streaming codes on multicore chips. This leads to deeper insight into performance behavior and energy consumption and enables a model-guided performance engineering approach, in which the concept of “optimal performance” is well-defined. Case studies for several short- and long-range stencil codes are presented.
Compiler-Automated Communication-Avoiding Optimization of Geometric Multigrid
Protonu Basu, University of Utah, USA; Samuel Williams and Brian Van Straalen, Lawrence Berkeley National Laboratory, USA; Anand Venkat, University of Utah, USA; Leonid Oliker, Lawrence Berkeley National Laboratory, USA; Mary Hall, University of Utah, USA
Abstract: We describe a compiler approach to introducing communication-avoiding optimizations in geometric multigrid (GMG), one of the most popular methods for solving partial differential equations. Communication-avoiding optimizations reduce vertical communication through the memory hierarchy and horizontal communication across processes or threads, usually at the expense of introducing redundant computation. We focus on applying these optimizations to the smooth operator, which successively reduces the error and accounts for the largest fraction of the GMG execution time. Our compiler technology applies a set of novel and known transformations to derive an implementation comparable to hand-written optimizations. An underlying autotuning system explores the tradeoff between reduced communication and increased computation, as well as trade offs in threading schemes, to automatically identify the best implementation for a particular architecture.
Automatic Generation of Algorithms and Data Structures for Geometric Multigrid
Harald Koestler and Sebastian Kuckuk, Universität Erlangen-Nürnberg, Germany
Abstract: Multigrid is one of the most efficient methods for numerical solution of PDEs. However, the concrete multigrid algorithm and its implementation highly depends on the underlying problem and hardware. Therefore, many different variants are necessary to cover all relevant cases. We try to generalize the data structures and multigrid components required to solve elliptic PDEs on Hierarchical Hybrid Grids (HHG) in order to formulate them in an intuitive domain specific language and automatically generate them.

Part 2 (MS66)

Stencil Computations: From Academia to Industry
Raul de la Cruz, Mauricio Hanzich, and Jose Maria Cela, Barcelona Supercomputing Center, Spain
Abstract: Synthetic benchmarks used in academia may expose false impressions of optimization techniques. This is not an exception on stencil computations. When these techniques are applied to industrial codes their impact can be deceiving. Furthermore, some academia optimizations cannot be easily implemented for industrial numerical codes. The industrial complexity can range considerably: from a simple 2nd order spatial stencil to burdensome cases with 8th order spatial stencils, dozens of variables and staggered grids with many parameters.
Evaluating Compiler-driven Parallelization of Stencil Micro-applications on a GPU-enabled Cluster
Dmitry Mikushin and Olaf Schenk, Universita’ della Svizzera Italiana, Switzerland
Abstract: In this talk we will demonstrate how parallelization and further optimization of stencil codes for GPUs could be automated by compiler toolchains. By example of wave equation stencil, hand-written naive and optimized for locality versions will be compared against compiler-generated parallel code, presenting the roofline performance, efficiency of tiling, JIT-compilation and other properties. The results of benchmarking KernelGen and PPCG auto-parallelizing compilers as well as one commercial OpenACC compiler will be presented on a set of 10 stencil micro-applications. Finally, we will show how automatically parallelized code of a very large wave propagation problem performs on the Piz Daint supercomputer.
Firedrake: a Multilevel Domain Specific Language Approach to Unstructured Mesh Stencil Computations
Gheorghe-Teodor Bercea, David Ham, Paul Kelly, Nicolas Loriant, Fabio Luporini, Lawrence Mitchell, and Florian Rathgeber, Imperial College London, United Kingdom
Abstract: How do we enable scientists to specify in a simple and mathematically expressive way the simulation they wish to perform, but still make efficient use of the diverse massively parallel hardware which will dominate supercomputing over the coming years? Firedrake is a multilayer abstraction package which generates unstructured mesh numerical partial differential equation solvers. It builds on the success of the Unified Form Language as an expressive symbolic language for the finite element method, and combines it with the runtime targeting of different parallel architectures provided by the PyOP2 system.
Tuning Sparse and Dense Matrix Operators in SeisSol
Alexander Breuer, Sebastian Rettenberger, and Alexander Heinecke, Technische Universität München, Germany; Christian Pelties, Ludwig-Maximilians-Universität München, Germany; Michael Bader, Technische Universität München, Germany
Abstract: In this talk we show recent advances in increasing the computational performance of the software package SeisSol, one of the leading codes for the simulation of earthquake scenarios. SeisSol uses the discontinuous Galerkin method combined with flexible unstructured tetrahedral meshes for spatial and Arbitrary high order DERivatives (ADER) for time discretization. We present our node-level optimization strategy based on hardware-aware code generation for element-local, low-rank matrix operations ranging from sparse to near dense on Intel Sandy Bridge and Xeon Phi. The involved matrix sparsity patterns are known a priori, which allows us to eliminate all indirect matrix accesses and generate AVX vector instructions in the pre-compile phase utilising the hardware at high performance.

Georg Hager's Blog

Random thoughts on High Performance Computing

Content

SIAM PP14 MS