Georg Hager's Blog

Random thoughts on High Performance Computing

Content

ISC12

Half-day tutorial at the International Supercomputing Conference 2012 (ISC12), Hamburg, Germany, June 17-21, 2012:

Performance-oriented programming on multicore-based Clusters with MPI, OpenMP, and hybrid MPI/OpenMP

 

Slides: ISC12-Tutorial-MC-hybrid-ALL-final.pdf

 

Authors

Georg Hager, Gabriele Jost, Rolf Rabenseifner, Jan Treibig

Erlangen Regional Computing Center
University of Erlangen-Nuremberg
Germany
{georg.hager,jan.treibig}@rrze.uni-erlangen.de

Advanced Micro Devices
USA
gabriele.jost@amd.com

High Performance Computing Center Stuttgart
Germany
rabenseifner@hlrs.de

 

Abstract

Most HPC systems are clusters of multicore, multisocket nodes. These systems are highly hierarchical, and there are several possible programming models; the most popular ones being shared memory parallel programming with OpenMP within a node, distributed memory parallel programming with MPI across the cores of the cluster, or a combination of both. Obtaining good performance for all of those models requires considerable knowledge about the system architecture and the requirements of the application. The goal of this tutorial is to provide insights about performance limitations and guidelines for program optimization techniques on all levels of the hierarchy when using pure MPI, pure OpenMP, or a combination of both.

We cover peculiarities like shared vs. separate caches, bandwidth bottlenecks, and ccNUMA locality. Typical performance features like synchronization overhead, intranode MPI bandwidths and latencies, ccNUMA locality, and bandwidth saturation (in cache and memory) are discussed in order to pinpoint the influence of system topology and thread affinity on the performance of parallel programming constructs. Techniques and tools for establishing process/thread placement and measuring performance metrics are demonstrated in detail. We also analyze the strengths and weaknesses of various hybrid MPI/OpenMP programming strategies. Benchmark results and case studies on several platforms are presented.

Detailed description

Tutorial goals

Conventional programming models like MPI and OpenMP were not designed with highly hierarchical systems in mind. Straightforward porting of existing applications to clusters of complex multicore/multisocket shared memory nodes often leads to unsatisfactory and/or nonreproducible performance results. This is due to fact that ignoring the specific features of those systems almost inevitably has a negative impact on program performance. Attendees will

  • gain awareness of the typical multicore-specific bottlenecks and obstacles one has to face when writing efficient parallel software with MPI and/or OpenMP
  • gain awareness of the “anisotropy” and “asymmetry” issues of multicore/multisocket nodes and the necessity to write efficient parallel software already on the node level
  • learn about a thorough methodology to deduce the qualitative performance behavior of applications from cache/socket topologies and low-level benchmark data
  • learn about tools for probing multicore and socket topology and about how to establish proper thread-core mapping for OpenMP, MPI, and hybrid MPI/OpenMP
  • learn about tools for measuring important application performance metrics in order to assess the resource requirements of a running application (bandwidths, flops,  cycles per instruction, load balance,…), and the correct and useful interpretation of such data
  • learn about how to leverage the shared cache feature of modern multicore chips
  • get an overview of the thread-safety level of MPI libraries
  • get a systematic introduction into the options for running parallel programs on modern clusters, including a “beginner’s how-to” and examples from real systems
  • learn about sensitive pitfalls and opportunities when employing hybrid MPI/OpenMP
  • see sample applications, benchmark codes, and performance data for different hardware platforms to illustrate all topics
  • get an overview of the basic architectural features of the latest Intel and AMD Multicore processors,  for a first insight to their efficiency in different application areas

Targeted audience

Everyone who writes efficient parallel software (MPI, OpenMP, hybrid) or runs computationally intensive parallel applications in typical multicore/multisocket environments

Content level

50% Introductory, 25% Intermediate, 25% Advanced

Audience prerequisites

Some knowledge about parallel programming with MPI and OpenMP in one of the dominant HPC languages (C/C++ or Fortran)

General content and detailed outline of the tutorial

There is a common misconception, especially among application programmers and software users, that the building blocks of modern HPC clusters are symmetric multiprocessors. The hierarchical and possibly asymmetric nature of these systems is often overlooked. The hierarchies arise from the inner structure of multicore sockets, the way the sockets are connected to each other to form shared memory nodes and how the nodes are connected to each other. With this tutorial we aim to raise awareness of performance issues that arise from socket “anisotropy” and node topology and cluster topology. The necessary knowledge to enable attendees to write efficient software and circumvent performance bottlenecks will be conveyed.

We start by giving an overview of multicore processor architecture, as far as it is relevant in HPC. While putting multiple cores on a chip or in a package without shared resources apart from the memory connection is nothing but a peak-performance/bottleneck game, it introduces a new hierarchy level into system design, and has to be considered by programmers and software users alike. Additionally, the shared cache feature of more modern processors has brought about a real opportunity for advanced code optimizations. The tutorial will cover the following relevant issues:

 

  • Bandwidth bottlenecks. What are the consequences of having more than one core use the bandwidth of a single memory bus? Low-level benchmarks and application data will be used to demonstrate the typical bandwidth saturation effects in current multicore processors.
  • ccNUMA locality. There are hardly any UMA-type compute nodes left in current HPC systems. Although not strictly a multicore issue, ccNUMA locality is an important prerequisite of good node performance for memory-bound code. We will show how to employ ccNUMA page placement via the first-touch policy, external tools, and under program control. The question how to deal with ccNUMA if dynamic scheduling or tasking cannot be avoided is given due attention.
  • Shared vs. separate caches, node topology. What are the pros and cons of having a shared cache used by multiple cores in a package, and how can system topology impact program performance? We will show the influence of shared caches and overall system topology on OpenMP overheads (synchronization, dynamic scheduling) and on intra-node MPI behavior, and point out how topology awareness can pay off. With a growing number of cores and (possibly) MPI processes per node, there is a lot of optimization potential in the mapping between MPI ranks and sub-domains when using domain decomposition; we will point out how this mapping can be improved if the MPI library is not aware of this problem. The influence and significance of hardware threading (also called “simultaneous multithreading”) is demonstrated. An often-used argument against shared caches is the bandwidth bottleneck they create. We will thoroughly analyze when and why a shared cache may pose a bandwidth problem on current multicore processors.
  • System topology and affinity control. How can the arrangement of hardware threads, cores, caches, and sockets (“packages”) be properly probed, and how can affinity (“pinning”) be enforced for OpenMP threads, MPI processes, and both at the same time? We will discuss tools and techniques to probe and node topology and establish thread/core (or software thread/hardware thread) affinity
  • Leveraging shared caches for performance. Can shared caches be used to boost parallel performance beyond improved OpenMP and MPI overheads? We will demonstrate a strategy (coined “pipelined temporal blocking”) that explicitly uses shared caches on multicore chips to overlap memory traffic and useful computation, and show how mutual thread synchronization can be used to work around the large penalties for global OpenMP barriers.
  • Hybrid MPI/OpenMP programming. Parallel programming of multi-core node systems must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside of each node. This tutorial analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Various hybrid MPI/OpenMP programming models are compared with pure MPI. Benchmark results on several platforms are presented. Bandwidth and latency is shown for intra-socket, inter-socket and inter-node communication.  The affinity of processes and their threads and memory is a key factor. The thread safety status of several existing MPI libraries is also discussed. Case studies and benchmarks are presented.
  • Hybrid programming pitfalls and opportunities: Possible pitfalls and opportunities of hybrid MPI/OpenMP programming are identified on a general level. The important role of functional decomposition for achieving explicitly asynchronous MPI communication is pointed out: Sacrificing a single core for communication could be tolerable if communication overhead can be hidden behind useful computation.
  • Performance analysis. There is an abundance of complex performance tools whose massive functionality all too often overwhelms the average user. We present simple tools that can be used to get a quick overview on the requirements of an application with respect to the underlying hardware, and show how to correctly interpret the data obtained.
  • Case studies. Numerous case studies, using low-level code and application benchmarks, are used to get our messages across.