Georg Hager's Blog

Random thoughts on High Performance Computing


HPC Book

Introduction to High Performance Computing for Scientists and Engineers

Introduction to High Performance Computing for Scientists and Engineers

This page provides accompanying information for the book “Introduction to High Performance Computing for Scientists and Engineers” by Georg Hager and Gerhard Wellein, published by CRC Press, ISBN 978-1439811924, in CRC’s Computational Science Series.

The book

Teaching material

We are aware that our book is now recommended reading for many courses on HPC and related topics. Most of the material (lecture slides, exercises, solution slides) we use in our own lectures and courses is available for download via our “Moodle” learning management system. Access to the actual courses is password protected only if you want to enrol; if you just want to access the material, you can log in as a guest user. You may also want to switch the site language to English (upper right corner).

There is also a list of all courses, lectures, and other teaching activities conducted by the authors.

List of errata

  • Clarification: Page 7, Section 1.2.2, first paragraph: G. Moore’s original statement in his 1965 paper was that the increase in transistor count would be a factor of two every 12 months; he later revised that estimate to be 24 months. See also the comment to bibliography entry [R35].
  • Error: Page 48, Section 2.3.2, lower code example: Line 3 should read
        if( then
  • Error: Page 90, Section 3.6.2, code example: Line 12 should read
        C(i) = C(i)+val(offset+i)*B(col_idx(offset+i))
  • Typo: Page 131, Section 5.3.7, second paragraph, end of sixth line: Read “… and even to the complex cache group …”.
  • Error: Page 145, Section 6.1.1., second paragraph, third line after the first code box: Read “… there are include files omp_lib.h and omp.h, respectively”.
  • Error: Page 149, Section 6.1.4, code example: The code of the subroutine func() is buggy. It should look like this:
      double precision FUNCTION func(v)
      double precision :: v
    !$OMP CRITICAL (prand)
      func = v + random_func()
    !$OMP END CRITICAL (prand)
      END FUNCTION func
  • Error: Page 163, Problem 6.3, last line in paragraph: Read “… would have made the names obsolete?”.
  • Typo: Page 193, Section 8.3.1, fourth line after code example: Missing space after “Figure”.
  • Error: Listing on page 228, line 41: Extra “0” parameter to MPI_Allreduce() is wrong and should be omitted.
  • Error: Page 240, Figure 10.5: Rank 0 does not call MPI_Recv().
  • Typo: Color inset after page 262, caption of Figure 7.1: “{STATIC}” should be “STATIC”.
  • Typo: Page 314, bibliography entry [O60]: Read “Poisson”.
  • Typo: Page 319, bibliography entry [W118]: Read “Evaluation”.

Sample code

With these sample codes we want to give readers an impression of how we do benchmarking. They are provided as-is, without any warranties. Getting a program straight so that you’re not embarrassed to show it around is harder than one might think!

ZIP file with all sources:

File(s) Description
timing.h, timing.c Wallclock timing function in C, using the gettimeofday() POSIX-compliant library call
triad.f90 Vector triad benchmark code, including Intel-specific compiler directives to
optimize performance (array alignment, nontemporal stores).
dummy.c Dummy function, taking four double* arguments. Used to fool compiler.
Jacobi_MPI_vectormode.F Fortran source for the 3D pure MPI or hybrid “vector mode” Jacobi solver. Compile with thread interoperability enabled if you want to run multi-threaded MPI processes (Intel MPI wrapper option -mt_mpi). Requires “long source lines” (Intel compiler option -132).
Jacobi_MPI_taskmode.F Fortran source for the 3D pure MPI or hybrid “task mode” Jacobi solver. Compile with thread interoperability enabled (Intel MPI wrapper option -mt_mpi). Requires “long source lines” (Intel compiler option -132). Caveat: The performance numbers in the book have been obtained in a very controlled and simple environment. You need to know what you are doing in order to get reasonable results. Take special care of thread/process affinity control (you may want to consider using likwid-mpirun from the LIKWID tool suite).
input.1200 Input file for the Jacobi solver code, to be piped into stdin of MPI rank 0. Some MPI implementations may need an option for this upon startup. Three lines (one for each spatial dimension), each has 3 entries: number of grid points in this dimension, number of processes along this dimension (0 = let MPI decide), and the flag for periodic boundary conditions (t = periodic, f = open).

Annotated bibliography

The following list should enable readers to find original publications more quickly. It corresponds almost exactly to the bibliography in the book. We have added links, abstracts (where applicable) and occasional comments to make it easier to judge the relevance of each reference. Where applicable, links to online or original resources are given. Note that a link to the arXiv preprint server can point to an early version of a paper that does not reflect the final state as published in a journal or book.

Some references have been added after the book went to print, but the original numbering has been retained (i.e., additional references are either labeled with a letter like, e.g., [P17a], or appended at the end).

Quick links to the bibliography entries
Bibliography sections Entries
Standard works [S1] [S1a] [S2] [S3] [S4] [S5] [S5a] [S5b] [S6] [S7]
Parallel programming [P8] [P9] [P10] [P11] [P12] [P13] [P14] [P15] [P16] [P17] [P17a] [P18]
Tools [T19] [T20] [T21] [T22] [T23] [T24] [T25] [T26] [T27] [T28] [T29] [T30] [T31] [T32] [T33]
Computer architecture and design [R34] [R35] [R36] [R37] [R38] [R39] [R40]
Performance modeling [M41] [M42] [M43] [M44] [M45] [M46] [M47] [M48]
Numerical techniques and libraries [N49] [N50] [N51]
Optimization techniques [O52] [O53] [O54] [O55] [O56] [O57] [O58] [O59] [O60] [O61] [O62] [O63] [O64] [O65] [O65a] [O66] [O67] [O68] [O69] [O70] [O71] [O72] [O73] [O74] [O75]
Large-scale parallelism [L76] [L77] [L78]
Applications [A79] [A80] [A81] [A82] [A83] [A84] [A85] [A86] [A87] [A88] [A89] [A90]
C++ references [C91] [C92] [C93] [C94] [C95] [C96] [C96a] [C96b] [C97] [C98] [C99] [C100] [C101] [C102] [C103]
Vendor-specific information and documentation [V104] [V105] [V106] [V107] [V108] [V109] [V110] [V111] [V112] [V113] [V114] [V115] [V116] [V117]
Web sites and online resources [W118] [W119] [W120] [W121] [W122] [W123] [W123a] [W123b] [W124] [W125] [W126] [W127] [W128]
Computer history [H129] [H130] [H131]
Miscellaneous [132] [133] [134] [135] [136] [137] [138] [139]

Standard works

[S1] S. Goedecker and A. Hoisie. Performance Optimization of Numerically Intensive Codes (SIAM), 2001. ISBN 978-0898714845.

[S1a] The SGI Origin 2000 and Onyx2 Performance Tuning and Optimization Guide (SGI), 2001. SGI document number 007-3430-003. URL,19541

Comment: This is still a great resource for learning the basics of serial and parallel code optimization, although those machines have long since disappeared. Just ignore the system-specific stuff.

[S2] R. Gerber, A. J. C. Bik, K. Smith and X. Tian. The Software Optimization Cookbook (Intel Press), 2nd ed., 2005. ISBN 978-0976483212.

[S3] K. Dowd and C. Severance. High Performance Computing (O’Reilly & Associates, Inc., Sebastopol, CA, USA), 1998. ISBN 156592312X.

[S4] K. R. Wadleigh and I. L. Crawford. Software Optimization for High-Performance Computing (Prentice Hall), 2000. ISBN 978-0130170088.

[S5] W. Schönauer. Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers (Self-edition), 2000. URL

Comment: This book, although outdated in terms of the covered system architectures, is still a fine reference for the use of simple performance models. The vector computers described have long since disappeared. The following two papers, although they are by no means “standard works”, provide a more current view on the performance properties of vector processors in comparison to “commodity” clusters.

[S5a] T. Zeiser, G. Hager and G. Wellein. Vector Computers in a World of Commodity Clusters, Massively Parallel Systems and Many-Core Many-Threaded CPUs: Recent Experience Based on an Advanced Lattice Boltzmann Flow Solver. In: W. E. Nagel et al., High Performance Computing in Science and Engineering ’08 (Springer-Verlag, Berlin, Heidelberg), 333–347, ISBN 978-3540883012. DOI: 10.1007/978-3-540-88303-6_24

Abstract: This report summarizes experience gained during the last year using the NEC SX-8 at HLRS and its wide range of competitors: commodity clusters with Infiniband interconnect, massively parallel systems (Cray XT4, IBM BlueGene L/P) and emerging many-core many-threaded CPUs (SUN Niagara2 processor). The observations are based on low-level benchmarks and the use of an advanced lattice Boltzmann flow solver developed in the framework of an international development consortium (ILBDC).

[S5b] T. Zeiser, G. Hager and G. Wellein. Benchmark analysis and application results for lattice Boltzmann simulations on NEC SX vector and Intel Nehalem systems. Parallel Processing Letters 19(4), (2009) 491–511. DOI: 10.1142/S0129626409000389

Abstract: Classic vector systems have all but vanished from recent TOP500 lists. Looking at the recently introduced NEC SX-9 series, we benchmark its memory subsystem using the low level vector triad and employ the kernel of an advanced lattice Boltzmann flow solver to demonstrate that classic vectors still combine excellent performance with a well-established optimization approach. To investigate the multi-node performance, the flow field in a real porous medium is simulated using the hybrid MPI/OpenMP parallel ILBDC lattice Boltzmann application code. Results for a commodity Intel Nehalem-based cluster are provided for comparison. Clusters can keep up with the vector systems, however, require massive parallelism and thus much more effort to provide a good domain decomposition.

[S6] T. G. Mattson, B. A. Sanders and B. L. Massingill. Patterns for Parallel Programming (Addison-Wesley), 2004. ISBN 978-0321228116.

[S7] D. H. Bailey. Highly parallel perspective: Twelve ways to fool the masses when giving performance results on parallel computers. Supercomputing Review 4(8), (1991) 54–55. ISSN 1048-6836. URL

Abstract: Many of us in the field of highly parallel scientific computing recognize that it is often quite difficult to match the run time performance of the best conventional supercomputers. This humorous article outlines twelve ways commonly used in scientific papers and presentations to artificially boost performance rates and to present these results in the “best possible light” compared to other systems.

Comment: Recently we have tried to update the points that Bailey made, to make them more compatible with today’s HPC landscape. You can read about it in Georg Hager’s blog, or take a look at a recent talk.

Parallel programming

[P8] S. Akhter and J. Roberts. Multi-Core Programming: Increasing Performance through Software Multithreading (Intel Press), 2006. ISBN 0976483246.

[P9] D. R. Butenhof. Programming with POSIX Threads (Addison-Wesley), 1997. ISBN 978-0201633924.

[P10] J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism (O’Reilly), 2007. ISBN 978-0596514808.

[P11] The OpenMP API specification for parallel programming. URL

[P12] B. Chapman, G. Jost and R. van der Pas. Using OpenMP (MIT Press), 2007. ISBN 978-0262533027.

[P13] W. Gropp, E. Lusk and A. Skjellum. Using MPI (MIT Press), 2nd ed., 1999. ISBN 0262571323.

[P14] W. Gropp, E. Lusk and R. Thakur. Using MPI-2 (MIT Press), 1999. ISBN 0262571331.

[P15] MPI: A Message-Passing Interface Standard. Version 2.2, September 2009. URL

[P16] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek and V. Sunderam. PVM: Parallel Virtual Machine (MIT Press), 1994. ISBN 0262571080. URL

[P17] R. W. Numrich and J. Reid. Co-array Fortran for Parallel Programming. SIGPLAN Fortran Forum 17(2), (1998) 1–31. ISSN 1061-7264. DOI: 10.1145/289918.289920

Abstract: Co-Array Fortran, formerly known as F, is a small extension of Fortran 95 for parallel processing. A Co-Array Fortran program is interpreted as if it were replicated a number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is termed an image. The array syntax of Fortran 95 is extended with additional trailing subscripts in square brackets to give a clear and straightforward representation of any access to data that is spread across images.References without square brackets are to local data, so code that can run independently is uncluttered. Only where there are square brackets, or where there is a procedure call and the procedure contains square brackets, is communication between images involved.There are intrinsic procedures to synchronize images, return the number of images, and return the index of the current image.We introduce the extension; give examples to illustrate how clear, powerful, and flexible it can be; and provide a technical definition.

Comment: See also the next entry (not listed in the book’s bibliography), which describes changes from the original definition of Co-Array Fortran. An interesting collection of documents and other info regarding CAF can be found at

[P17a] R. W. Numrich and J. Reid. Co-arrays in the next Fortran Standard. SIGPLAN Fortran Forum 24(2), (2005), 4–17. DOI: 10.1145/1080399.1080400

Abstract: The WG5 committee, at its meeting in Delft, May 2005, decided to include co-arrays in the next Fortran Standard. A special issue of Fortran Forum in August 1998 explained the feature, but since many of the details of the version adopted by WG5 differ from the 1998 version, it seems appropriate to describe it afresh in a self-contained article.A Fortran program containing co-arrays is interpreted as if it were replicated a fixed number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is called an image. The array syntax of Fortran is extended with additional trailing subscripts in square brackets to give a clear and straightforward representation of access to data on other images.References without square brackets are to local data, so code that can run independently is uncluttered. Only where there are square brackets, or where there is a procedure call and the procedure contains square brackets, is communication between images involved.The additional syntax requires support in the compiler, but it has been designed to be easy to implement and to give the compiler scope both to apply its optimizations within each image and to optimize the communication between images.The extension includes intrinsic procedures to synchronize images, to return the number of images, to return the index of the current image, and to perform collective actions.

[P18] W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks and K. Warren. Introduction to UPC and language specification. Tech. rep., IDA Center for Computing Sciences, Bowie, MD, 1999. URL

Abstract: UPC is a parallel extension of the C programming language intended for multiprocessors with a common global address space. A descendant of Split-C [CDG 93], AC [CaDr 95], and PCP [BrWa 95], UPC has two primary objectives: 1) to provide efficient access to the underlying machine, and 2) to establish a common syntax and semantics for explicitly parallel programming in C. The quest for high performance means in particular that UPC tries to minimize the overhead involved in communication among cooperating threads. When the underlying hardware enables a processor to read and write remote memory without intervention by the remote processor (as in the SGI/Cray T3D and T3E), UPC provides the programmer with a direct and easy mapping from the language to low-level machine instructions. At the same time, UPC?s parallel features can be mapped onto existing message-passing software or onto physically shared memory to make its programs portable from one parallel architecture to another. As a consequence, vendors who wish to implement an explicitly parallel C could use the syntax and semantics of UPC as a basis for a standard.


[T19] OProfile — A system profiler for Linux. URL

Abstract from the Web site: OProfile is a system-wide profiler for Linux systems, capable of profiling all running code at low overhead. OProfile is released under the GNU GPL. It consists of a kernel driver and a daemon for collecting sample data, and several post-profiling tools for turning data into information. OProfile leverages the hardware performance counters of the CPU to enable profiling of a wide variety of interesting statistics, which can also be used for basic time-spent profiling. All code is profiled: hardware and software interrupt handlers, kernel modules, the kernel, shared libraries, and applications. OProfile is currently in alpha status; however it has proven stable over a large number of differing configurations; it is being used on machines ranging from laptops to 16-way NUMA-Q boxes. As always, there is no warranty.

[T20] J. Treibig, G. Hager and G. Wellein. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. Accepted for the First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010), URL

Abstract: Exploiting the performance of today’s processors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command-line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and toggling hardware prefetchers. An API for using the performance counting features from user code is also included. We clearly state the differences to the widely used PAPI interface. To demonstrate the capabilities of the tool set we show the influence of thread pinning on performance using the well-known OpenMP STREAM triad benchmark, and use the affinity and hardware counter tools to study the performance of a stencil code specifically optimized to utilize shared caches on multicore chips.

[T21] Intel VTune Performance Analyzer. URL

[T22] PAPI: Performance Application Programming Interface. URL

Abstract from the Web site: The Performance API (PAPI) project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count Events, occurrences of specific signals related to the processor’s function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis including hand tuning, compiler optimization, debugging, benchmarking, monitoring and performance modeling. In addition, it is hoped that this information will prove useful in the development of new compilation technology as well as in steering architectural development towards alleviating commonly occurring bottlenecks in high performance computing.

[T23] Intel Thread Profiler. URL

[T24] D. Skinner. Performance monitoring of parallel scientific applications, 2005. URL

Abstract: This paper introduces an infrastructure for efficiently collecting performance profiles from parallel HPC codes. Integrated Performance Monitoring (IPM) brings together multiple sources of performance metrics into a single profile that characterizes the overall performance and resource usage of the application. IPM maintains low overhead by using a unique hashing approach which allows a fixed memory footprint and minimal CPU usage. IPM is open source, relies on portable software technologies and is scalable to thousands of tasks.

[T25] O. Zaki, E. Lusk, W. Gropp and D. Swider. Toward scalable performance visualization with Jumpshot. International Journal of High Performance Computing Applications 13(3), (1999) 277–288. DOI: 10.1177/109434209901300310

Abstract: Jumpshot is a graphical tool for understanding the performance of parallel programs. It is in the tradition of the upshot tool but contains a number of extensions and enhancements that make it suitable for large-scale parallel computations. Jumpshot takes as input a new, more flexible logfile format and comes with a library for generating such logfiles. An MPI profiling library is also included, enabling the automatic generation of such logfiles from MPI programs. Jumpshot is written in Java and can easily be integrated as an applet into browser-based computing environments. The most novel feature of Jumpshot is its automatic detection of anomalous durations, drawing the user’s attention to problem areas in a parallel execution. This capability is particularly useful in large-scale parallel computations containing many events.

[T26] Intel Trace Analyzer and Collector. URL

[T27] VAMPIR Performance optimization for MPI. URL

[T28] M. Geimer, F. Wolf, B. J. Wylie, E. Ábrahám, D. Becker and B. Mohr. The SCALASCA performance toolset architecture. Concurrency and Computation: Practice and Experience 22, (2010) 702–719. DOI: 10.1002/cpe.1556 URL (2008 workshop paper)

Abstract: SCALASCA is a performance toolset that has been specifically designed to analyze parallel application execution behavior on large-scale systems. It offers an incremental performance analysis procedure that integrates runtime summaries with in-depth studies of concurrent behavior via event tracing, adopting a strategy of successively refined measurement configurations. Distinctive features are its ability to identify wait states in applications with very large numbers of processes and combine these with efficiently summarized local measurements. In this article, we review the current toolset architecture, emphasizing its scalable design and the role of the different components in transforming raw measurement data into knowledge of application execution behavior. The scalability and effectiveness of SCALASCA are then surveyed from experience measuring and analyzing real-world applications on a range of computer systems.

[T29] M. Gerndt, K. Fürlinger and E. Kereku. Periscope: Advanced techniques for performance analysis. In: G. R. Joubert et al. (eds.), Parallel Computing: Current and Future Issues of High-End Computing (Proceedings of the International Conference ParCo 2005), vol. 33 of NIC Series. ISBN 3000173528. URL

Abstract: Performance analysis of applications on supercomputers require scalable tools. The Periscope environment applies a distributed automatic online analysis and thus scales to thousands of processors. This article gives an overview of the Periscope system, from the performance property specification, via the search process, to the integration with two monitoring systems. We also present first experimental results.

[T30] T. Klug, M. Ott, J. Weidendorfer and C. Trinitis. autopin — Automated optimization of thread-to-core pinning on multicore systems. Transactions on High-Performance Embedded Architectures and Compilers 3(4), (2008) 1–18. URL

Abstract: In this paper we present a framework for automatic detection and application of the best binding between threads of a running parallel application and processor cores in a shared memory system, by making use of hardware performance counters. This is especially important within the scope of multicore architectures with shared cache levels. We demonstrate that many applications from the SPEC OMP benchmark show quite sensitive runtime behavior depending on the thread/core binding used. In our tests, the proposed framework is able to find the best binding in nearly all cases. The proposed framework is intended to supplement job scheduling systems for better automatic exploitation of systems with multicore processors, as well as making programmers aware of this issue by providing measurement logs.

[T31] M. Meier. Pinning OpenMP threads by overloading pthread_create(). URL

Comment: This is the source code of the tool that inspired likwid-pin from the LIKWID tool suite.

[T32] Portable Linux processor affinity. URL
Comment: PLPA has been superseded by the new hwloc library, which provides topology functions as well.

[T33] Portable hardware locality (hwloc). URL

Computer architecture and design

[R34] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach (Morgan Kaufmann), 4th ed., 2006. ISBN 978-0123704900.

[R35] G. E. Moore. Cramming more components onto integrated circuits. Electronics 38(8), (1965) 114–117. URL

Abstract: With unit cost falling as the number of components per circuit rises, by 1975 economics may dictate squeezing as many as 65,000 components on a single silicon chip.

Comment: Moore’s Law from his original 1965 paper states that the number of transistors per chip should double every 12 months. Moore later corrected this to 24 months; however, he was well aware that the type of “chip” under investigation influences the prediction quite a lot. The following document provides a detailed analysis of the history of Moore’s Law and the development of semiconductor transistor density on different types of devices: R. P. Rumelt and O. Costa: Gordon Moore’s Law. (2002)

[R36] W. D. Hillis. The Connection Machine (MIT Press), 1989. ISBN 978-0262580977.

[R37] N. R. Mahapatra and B. Venkatrao. The Processor-Memory Bottleneck: Problems and Solutions. Crossroads 5, (1999) 2. ISSN 1528-4972. DOI: 10.1145/357783.331677

Abstract: The rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM (Dynamic Random Access Memory) speed. So although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one. Hence computer designers are faced with an increasing Processor – Memory Performance Gap [R34], which now is the primary obstacle to improved computer system performance. This article examines this problem as well as its various solutions.

[R38] M. J. Flynn. Some computer organizations and their effectiveness. IEEE Trans. Comput. C-21, (1972) 948. DOI: 10.1109/TC.1972.5009071

Abstract: A hierarchical model of computer organizations is developed, based on a tree model using request/service type resources as nodes. Two aspects of the model are distinguished: logical and physical. General parallel- or multiple-stream organizations are examined as to type and effectiveness¿especially regarding intrinsic logical difficulties. The overlapped simplex processor (SISD) is limited by data dependencies. Branching has a particularly degenerative effect. The parallel processors [single-instruction stream-multiple-data stream (SIMD)] are analyzed. In particular, a nesting type explanation is offered for Minsky’s conjecture [that] the performance of a parallel processor increases as log M instead of M (the number of data stream processors). Multiprocessors (MIMD) are subjected to a saturation syndrome based on general communications lockout. Simplified queuing models indicate that saturation develops when the fraction of task time spent locked out (L/E) approaches 1/n, where n is the number of processors. Resources sharing in multiprocessors can be used to avoid several other classic organizational problems.

[R39] R. Kumar, D. M. Tullsen, N. P. Jouppi and P. Ranganathan. Heterogeneous chip multiprocessors. IEEE Computer 38(11), (2005) 32–38. DOI: 10.1109/MC.2005.379

Abstract: Heterogeneous (or asymmetric) chip multiprocessors present unique opportunities for improving system throughput, reducing processor power, and mitigating Amdahl’s law. On-chip heterogeneity allows the processor to better match execution resources to each application’s needs and to address a much wider spectrum of system loads — from low to high thread parallelism — with high efficiency.

[R40] D. P. Siewiorek, C. G. Bell and A.Newell (eds.). Computer Structures: Principles and Examples (McGraw-Hill), 2nd ed., 1982. ISBN 978-0070573024. URL

Comment: This book would probably better fit into the Computer history section.

Performance modeling

[M41] J. Treibig, G. Hager and G. Wellein. Multi-core architectures: Complexities of performance prediction and the impact of cache topology. In: S. Wagner et al. (eds.), High Performance Computing in Science and Engineering, Garching/Munich 2009 (Springer-Verlag, Berlin, Heidelberg). To appear. URL

Abstract: The balance metric is a simple approach to estimate the performance of bandwidth-limited loop kernels. However, applying the method to in-cache situations and modern multi-core architectures yields unsatisfactory results. This paper analyzes the influence of cache hierarchy design on performance predictions for bandwidth-limited loop kernels on current mainstream processors. We present a diagnostic model with improved predictive power, correcting the limitations of the simple balance metric. The importance of code execution overhead even in bandwidth-bound situations is emphasized. Finally we analyze the impact of synchronization overhead on multi-threaded performance with a special emphasis on the influence of cache topology.

[M42] S. Williams, A. Waterman and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), (2009) 65–76. ISSN 0001-0782. DOI: 10.1145/1498765.1498785

[M43] P. F. Spinnato, G. van Albada and P. M. Sloot. Performance modeling of distributed hybrid architectures. IEEE Trans. Parallel Distrib. Systems 15(1), (2004) 81–92. DOI: 10.1109/TPDS.2004.1264788

Abstract: Hybrid architectures are systems where a high performance general purpose computer is coupled to one or more Special Purpose Devices (SPDs). Such a system can be the optimal choice for several fields of computational science. Configuring the system and finding the optimal mapping of the application tasks onto the hybrid machine often is not straightforward. Performance modeling is a tool to tackle and solve these problems. We have developed a performance model to simulate the behavior of a hybrid architecture consisting of a parallel multiprocessor where some nodes are the host of a GRAPE board. GRAPE is a very high performance SPD used in computational astrophysics. We validate our model on the architecture at our disposal, and show examples of predictions that our model can produce.

[M44] J. Treibig and G. Hager. Introducing a performance model for bandwidth-limited loop kernels. In: Proceedings of PPAM 2009, the Eighth International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, September 13–16, 2009. To appear. URL

Abstract: We present a performance model for bandwidth limited loop kernels which is founded on the analysis of modern cache based microarchitectures. This model allows an accurate performance prediction and evaluation for existing instruction codes. It provides an in-depth understanding of how performance for different memory hierarchy levels is made up. The performance of raw memory load, store and copy operations and a stream vector triad are analyzed and benchmarked on three modern x86-type quad-core architectures in order to demonstrate the capabilities of the model.

[M45] G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS ’67(Spring): Proceedings of the April 18-20, 1967, Spring Joint Computer Conference (ACM, New York, NY, USA), 483–485. DOI: 10.1145/1465482.1465560

Abstract: For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.

[M46] J. L. Gustafson. Reevaluating Amdahl’s law. Commun. ACM 31(5), (1988) 532–533. ISSN 0001-0782. DOI: 10.1145/42411.42415

Abstract: At Sandia National Laboratories, we are currently engaged in research involving massively parallel processing. There is considerable skepticism regarding the viability of massive parallelism; the skepticism centers around Amdahl’s law, an argument put forth by Gene Amdahl in 1967 [M45] that even when the fraction of serial work in a given problem is small, say, s, the maximum speedup obtainable from even an infinite number of parallel processors is only 1/s. We now have timing results for a 1024-processor system that demonstrate that the assumptions underlying Amdahl’s 1967 argument are inappropriate for the current approach to massive ensemble parallelism.

[M47] M. D. Hill and M. R. Marty. Amdahl’s Law in the multicore era. IEEE Computer 41(7), (2008) 33–38. DOI: 10.1109/MC.2008.209

Abstract: Augmenting Amdahl’s law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores. Obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster.

[M48] X.-H. Sun and Y. Chen. Reevaluating Amdahl’s Law in the multicore era. Journal of Parallel and Distributed Computing 70(2), (2010) 183–188. ISSN 0743-7315. DOI: 10.1016/j.jpdc.2009.05.002

Abstract: Microprocessor architecture has entered the multicore era. Recently, Hill and Marty presented a pessimistic view of multicore scalability. Their analysis was based on Amdahl’s law (i.e. fixed-workload condition) and challenged readers to develop better models. In this study, we analyze multicore scalability under fixed-time and memory-bound conditions and from the data access (memory wall) perspective. We use the same hardware cost model of multicore chips used by Hill and Marty, but achieve very different and more optimistic performance models. These models show that there is no inherent, immovable upper bound on the scalability of multicore architectures. These results complement existing studies and demonstrate that multicore architectures are capable of extensive scalability.

Numerical techniques and libraries

[N49] R. Barrett, M. Berry, T. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine and H. van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods (SIAM), 1993. ISBN 978-0898713282.

[N50] C. L. Lawson, R. J. Hanson, D. R. Kincaid and F. T. Krogh. Basic Linear Algebra Subprograms for Fortran usage. ACM Transactions on Mathematical Software 5(3), (1979) 308–323. ISSN 0098-3500. DOI: 10.1145/355841.355847

Abstract: A package of 38 low level subprograms for many of the basic operations of numerical linear algebra is presented. The package is intended to be used with Fortran. The operations In the package include dot product, elementary vector operation, Givens transformation, vector copy and swap, vector norm, vector scaling, and the determination of the index of the vector component of largest magnitude. The subprograms and a test driver are avadable in portable Fortran. Versions of the subprograms are also provided in assembly language for the IBM 360/67, the CDC 6600 and CDC 7600, and the Univac 1108.

[N51] W. H. Press, B. P. Flannery, S. A. Teukolsky and W. T. Vetterling. Numerical Recipes in FORTRAN77: The Art of Scientific Computing (v.1) (Cambridge University Press), 2nd ed., September 1992. ISBN 052143064X. URL

Optimization techniques

[O52] G. Wellein, G. Hager, T. Zeiser, M. Wittmann and H. Fehske. Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. Annual International Computer Software and Applications Conference (COMPSAC09) 1, (2009) 579–586. ISSN 0730-3157.

Abstract: We present a pipelined wavefront parallelization approach for stencil-based computations. Within a fixed spatial domain successive wavefronts are executed by threads scheduled to a multicore processor chip with a shared outer level cache. By re-using data from cache in the successive wavefronts this multicore-aware parallelization strategy employs temporal blocking in a simple and efficient way. We use the Jacobi algorithm in three dimensions as a prototype for stencil-based computations and prove the efficiency of our approach on the latest generations of Intel’s x86 quad- and hexa-core processors.

[O53] M. Wittmann, G. Hager and G. Wellein. Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory. In: Workshop on Large-Scale Parallel Processing 2010 (IPDPS2010), Atlanta, GA, April 23, 2010. URL DOI: 10.1109/IPDPSW.2010.5470813

Abstract: New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.

[O54] G. Hager, T. Zeiser, J. Treibig and G. Wellein. Optimizing performance on modern HPC systems: Learning from simple kernel benchmarks. In: Proceedings of 2nd Russian-German Advanced Research Workshop on Computational Science and High Performance Computing, Stuttgart 2005 (Springer-Verlag, Berlin, Heidelberg). DOI: 10.1007/3-540-31768-6_23

Abstract: We discuss basic optimization and parallelization strategies for current cache-based microprocessors (Intel Itanium2, Intel Netburst and AMD64 variants) in single-CPU and shared memory environments. Using selected kernel benchmarks representing data intensive applications we focus on the effective bandwidths attainable, which is still suboptimal using current compilers. We stress the need for a subtle OpenMP implementation even for simple benchmark programs, to exploit the high aggregate memory bandwidth available nowadays on ccNUMA systems. If the quality of main memory access is the measure, classical vector systems such as the NEC SX6+ are still a class of their own and are able to sustain the performance level of in-cache operations of modern microprocessors even with arbitrarily large data sets.

[O55] A. Fog. Agner Fog’s software optmization resources. URL

[O56] G. Schubert, G. Hager and H. Fehske. Performance limitations for sparse matrix-vector multiplications on current multicore environments. In: S. Wagner et al. (eds.), High Performance Computing in Science and Engineering, Garching/Munich 2009 (Springer-Verlag, Berlin, Heidelberg). To appear. URL

Abstract: The increasing importance of multicore processors calls for a reevaluation of established numerical algorithms in view of their ability to profit from this new hardware concept. In order to optimize the existent algorithms, a detailed knowledge of the different performance-limiting factors is mandatory. In this contribution we investigate sparse matrix-vector multiplication, which is the dominant operation in many sparse eigenvalue solvers. Two conceptually different storage schemes and computational kernels have been conceived in the past to target cache-based and vector architectures, respectively. Starting from a series of microbenchmarks we apply the gained insight on optimized sparse MVM implementations, whose serial and OpenMP-parallel performance we review on state-of-the-art multicore systems.

[O57] D. J. Kerbyson, M. Lang and G. Johnson. Infiniband routing table optimizations for scientific applications. Parallel Processing Letters 18(4), (2008) 589–608.

Abstract: The achievable performance on Infiniband networks is governed by the latencies and bandwidths of communication channels as well as by contention within the network. Currently Infiniband uses static routing to transfer messages and thus does not take into account dynamic loading of the channels. By interrogating the network routing tables we quantify the contention that occurs for a number of communication patterns using a large-scale (1024 processor) system. Empirical data confirms our contention calculation almost exactly. Custom routing tables are defined that provide both optimum and worst-case performance for a large-range of communication patterns. Performance differences can be as large as 12× (from optimum to worst-case). Two large-scale applications show a run-time improvement of between 10–20% and up to a 40% improvement in just their communication time when using optimized routing tables. The approach taken is applicable to many Infiniband systems, and we expect the performance improvements to be even greater on larger-scale systems.

[O58] M. Wittmann and G. Hager. A proof of concept for optimizing task parallelism by locality queues. URL

Abstract: Task parallelism as employed by the OpenMP task construct, although ideal for tackling irregular problems or typical producer/consumer schemes, bears some potential for performance bottlenecks if locality of data access is important, which is typically the case for memory-bound code on ccNUMA systems. We present a programming technique which ameliorates adverse effects of dynamic task distribution by sorting tasks into locality queues, each of which is preferably processed by threads that belong to the same locality domain. Dynamic scheduling is fully preserved inside each domain, and is preferred over possible load imbalance even if non-local access is required. The effectiveness of the approach is demonstrated using a blocked six-point stencil solver as a toy model.

[O59] G. Hager, F. Deserno and G. Wellein. Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture. In: S. Wagner et al. (eds.), High Performance Computing in Science and Engineering Munich 2002 (Springer-Verlag, Berlin, Heidelberg), 425–442. ISBN 3540004742. URL

Abstract: We demonstrate optimization techniques for the Hitachi SR8000 architecture on CPU and SMP level using selected case studies from real-world codes. Special emphasis is given to a comparison with performance characteristics of other modern RISC and vector CPUs.

[O60] D. Barkai and A. Brandt. Vectorized multigrid Poisson solver for the CDC Cyber 205. Applied Mathematics and Computation 13, (1983) 215–227. DOI: 10.1016/0096-3003(83)90013-9

Abstract: The full multigrid (FMG) method is applied to the two dimensional Poisson equation with Dirichlet boundary conditions. This has been chosen as a relatively simple test case for examining the efficiency of fully vectorizing of the multigrid method. Data structure and programming considerations and techniques are discussed, accompanied by performance details.

[O61] M. Kowarschik. Data Locality Optimizations for Iterative Numerical Algorithms and Cellular Automata on Hierarchical Memory Architectures (SCS Publishing House), 2004. ISBN 3936150397. URL

Abstract: In order to mitigate the impact of the constantly widening gap between processor speed and main memory performance on the runtimes of application codes, today’s computer architectures commonly employ hierarchical memory designs including several levels of cache memories. Efficient program execution can only be expected if the underlying hierarchical memory architecture is respected. This is particularly true for numerically intensive codes.

Unfortunately, current compilers are unable to introduce sophisticated cache-based code transformations. As a consequence, much of the tedious and error-prone optimization effort is left to the software developer. In the case of a parallel numerical application running on a cluster of workstations, for example, hierarchical memory optimization represents a crucial task that must be addressed in addition to typical parallelization objectives such as efficient load balancing and the minimization of communication overhead.

The tuning of the utilization of the memory hierarchy covers a wide spectrum of techniques ranging from hardware-based technologies (e.g., data prefetching mechanisms) to compiler-based code transformations (e.g., restructuring loops) as well as fundamental algorithmic modifications. The latter may treat increased computational work for reduced memory costs, for example, and are a long way from being introduced automatically by a compiler. In general, achieving both high runtime efficiency and code portability represents a software engineering challenge that must be faced whenever designing numerical libraries.

In this thesis, we will present approaches towards the optimization of the data locality of implementations of grid-based numerical algorithms. In particular, we will concentrate on multigrid methods based on structured meshes as well as cellular automata in both 2D and 3D. As a representative example of cellular automata, we will consider the lattice Boltzmann method for simulating fluid flow problems.

Furthermore, we will discuss a novel approach towards inherently cache-aware multigrid methods. Our algorithm, which is based on the theory of the fully adaptive multigrid method, employs a patch-adaptive processing strategy. It is thus characterized by a high utilization of the cache hierarchy. Moreover, due to its adaptive update strategy, this scheme performs more robustly than classical multigrid algorithms in the case of singularly perturbed problems.

[O62] K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalfand K. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51(1), (2009) 129–159. DOI: 10.1137/070693199

Abstract: Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance models to analytically guide our optimizations. Our work targets cache reuse methodologies across single and multiple stencil sweeps, examining cache-aware algorithms as well as cache-oblivious techniques on the Intel Itanium2, AMD Opteron, and IBM Power5. Additionally, we consider stencil computations on the heterogeneous multicore design of the Cell processor, a machine with an explicitly managed memory hierarchy. Overall our work represents one of the most extensive analyses of stencil optimizations and performance modeling to date. Results demonstrate that recent trends in memory system organization have reduced the efficacy of traditional cache-blocking optimizations. We also show that a cache-aware implementation is significantly faster than a cache-oblivious approach, while the explicitly managed memory on Cell enables the highest overall efficiency: Cell attains 88% of algorithmic peak while the best competing cache-based processor achieves only 54% of algorithmic peak performance.

[O63] J. Treibig, G. Wellein and G. Hager. Efficient multicore-aware parallelization strategies for iterative stencil computations. Submitted. URL

Abstract: Stencil computations consume a major part of runtime in many scientific simulation codes. As prototypes for this class of algorithms we consider the iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient parallel implementations for cache-based multicore architectures. Temporal cache blocking is a known advanced optimization technique, which can reduce the pressure on the memory bus significantly. We apply and refine this optimization for a recently presented temporal blocking strategy designed to explicitly utilize multicore characteristics. Especially for the case of Gauss-Seidel smoothers we show that simultaneous multi-threading (SMT) can yield substantial performance improvements for our optimized algorithm.

[O64] M. Müller. Some simple OpenMP optimization techniques. In: OpenMP Shared Memory Parallel Programming: International Workshop on OpenMP Applications and Tools, WOMPAT 2001, West Lafayette, IN, USA, July30-31, 2001: Proceedings. 31–39. DOI: 10.1007/3-540-44587-0_4

Abstract: The purpose of this benchmark is to test the existence of certain optimization techniques in current OpenMP compilers. Examples are the removal of redundant synchronization constructs and effective constructs for alternative code. The effectiveness of the compiler generated code is measured by comparing different OpenMP constructs and compilers. If possible, we also compare with the hand coded “equivalent” solution.

[O65] G. Hager, T. Zeiser and G. Wellein. Data access optimizations for highly threaded multi-core CPUs with multiple memory controllers. In: Workshop on Large-Scale Parallel Processing 2008 (IPDPS2008), Miami, FL, April 18, 2008. DOI: 10.1109/IPDPS.2008.4536341 URL

Abstract: Processor and system architectures that feature multiple memory controllers are prone to show bottlenecks and erratic performance numbers on codes with regular access patterns. Although such effects are well known in the form of cache thrashing and aliasing conflicts, they become more severe when memory access is involved. Using the new Sun UltraSPARC T2 processor as a prototypical multi-core design, we analyze performance patterns in low-level and application benchmarks and show ways to circumvent bottlenecks by careful data layout and padding.

Comment: The following reference is an extended version of this paper and covers the T2+ dual-socket system as well.

[O65a] G. Hager, T. Zeiser and G. Wellein. Data access characteristics and optimizations for Sun UltraSPARC T2 and T2+ systems. Parallel Processing Letters 18(4), (2008) 471–490. DOI: 10.1142/S0129626408003521

Abstract: Processor and system architectures that feature multiple memory controllers and/or ccNUMA characteristics are prone to show bottlenecks and erratic performance numbers on scientific codes. Although cache thrashing, aliasing conflicts, and ccNUMA locality and contention problems are well known for many types of systems, they take on peculiar forms on the new Sun UltraSPARC T2 and T2+ processors, which we use here as prototypical multi-core designs. We analyze performance patterns in low-level and application benchmarks and put some emphasis on a comparison of performance features between T2 and its successor. Furthermore we show ways to circumvent bottlenecks by careful data layout, placement and padding.

[O66] S. Williams, L. Oliker, R. W. Vuduc, J. Shalf, K. A. Yelick and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing 35(3), (2009) 178–194. DOI: 10.1016/j.parco.2008.12.006

Abstract: We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) — one of the most heavily used kernels in scientific computing — across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

[O67] C. Terboven, D. an Mey, D. Schmidl, H. Jin and T. Reichstein. Data and thread affinity in OpenMP programs. In: MAW’08: Proceedings of the 2008 workshop on Memory access on future processors (ACM, NewYork, NY, USA). ISBN 978-1605580913, 377–384. DOI: 10.1145/1366219.1366222

Abstract: The slogan of last year’s International Workshop on OpenMP was “A Practical Programming Model for the Multi-Core Era”, although OpenMP still is fully hardware architecture agnostic. As a consequence the programmer is left alone with bad performance if threads and data happen to live apart. In this work we examine the programmer’s possibilities to improve data and thread affinity in OpenMP programs for several toy applications and present how to apply the lessons learned on larger application codes. We filled a gap by implementing explicit data migration on Linux providing a next touch mechanism.

[O68] B. Chapman, F. Bregier, A. Patil and A. Prabhakar. Achieving performance under OpenMP on ccNUMA and software distributed shared memory systems. Concurrency Comput.: Pract. Exper. 14, (2002) 713–739. DOI: 10.1002/cpe.646

Abstract: OpenMP is emerging as a viable high-level programming model for shared memory parallel systems. It was conceived to enable easy, portable application development on this range of systems, and it has also been implemented on cache-coherent Non-Uniform Memory Access (ccNUMA) architectures. Unfortunately, it is hard to obtain high performance on the latter architecture, particularly when large numbers of threads are involved. In this paper, we discuss the difficulties faced when writing OpenMP programs for ccNUMA systems, and explain how the vendors have attempted to overcome them. We focus on one such system, the SGI Origin 2000, and perform a variety of experiments designed to illustrate the impact of the vendor’s efforts. We compare codes written in a standard, loop-level parallel style under OpenMP with alternative versions written in a Single Program Multiple Data (SPMD) fashion, also realized via OpenMP, and show that the latter consistently provides superior performance. A carefully chosen set of language extensions can help us translate programs from the former style to the latter (or to compile directly, but in a similar manner). Syntax for these extensions can be borrowed from HPF, and some aspects of HPF compiler technology can help the translation process. It is our expectation that an extended language, if well compiled, would improve the attractiveness of OpenMP as a language for high-performance computation on an important class of modern architectures.

[O69] R. Rabenseifner and G. Wellein. Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures. Int. J. High Perform. Comp. Appl. 17(1), (2003) 49–62. DOI: 10.1177/1094342003017001005

Abstract: Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also on whether programming paradigms can separate the optimization of communication and computation. Benchmark results are presented for hybrid and pure MPI communication. This paper analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes.

[O70] R. Rabenseifner, G. Hager and G. Jost. Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In: D. E. Baz, F. Spies and T. Gross (eds.), Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP 2009, Weimar, Germany, 18–20 Febuary 2009 (IEEE Computer Society). ISBN 978-0769535449, 427–436. DOI: 10.1109/PDP.2009.43

Abstract: Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: Shared memory nodes with several multi-core CPUs are connected via a network infrastructure. Parallel programming must combine distributed memory parallelization on the node interconnect with shared memory parallelization inside each node. We describe potentials and challenges of the dominant programming models on hierarchically structured hardware: Pure MPI (Message Passing Interface), pure OpenMP (with distributed shared memory extensions) and hybrid MPI+OpenMP in several flavors. We pinpoint cases where a hybrid programming model can indeed be the superior solution because of reduced communication needs and memory consumption, or improved load balance. Furthermore we show that machine topology has a significant impact on performance for all parallelization strategies and that topology awareness should be built into all applications in the future. Finally we give an outlook on possible standardization goals and extensions that could make hybrid programming easier to do with performance in mind.

[O71] G. Hager, G. Jost and R. Rabenseifner. Communication characteristics and hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In: Proceedings of CUG 09, May 4–7 2009, Atlanta, GA. URL

Abstract: Hybrid MPI/OpenMP and pure MPI on clusters of multi-core SMP nodes involve several mismatch problems between the parallel programming models and the hardware architectures. Measurements of communication characteristics between cores on the same socket, on the same SMP node, and between SMP nodes on several platforms (including Cray XT4 and XT5) show that machine topology has a significant impact on performance for all parallelization strategies and that topology awareness should be built into all applications in the future. We describe potentials and challenges of the dominant programming models on hierarchically structured hardware. Case studies with the multizone NAS parallel benchmarks on several platforms demonstrate the opportunities of hybrid programming.

[O72] H. Stengel. Parallel programming on hybrid hardware: Models and applications. Master’s thesis, Georg Simon Ohm University of Applied Sciences, Nuremberg, 2010. URL

[O73] M. Frigo and V. Strumpen. Cache oblivious stencil computations. In: ICS ’05: Proceedings of the 19th annual international conference on Supercomputing (ACM, NewYork, NY, USA). ISBN 1595931678, 361–366. DOI: 10.1145/1088149.1088197

Abstract: We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an “ideal cache” of size Z, our algorithm saves a factor of O(Z1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy.

[O74] M. Frigo and V. Strumpen. The memory behavior of cache oblivious stencil computations. J. Supercomput. 39(2), (2007) 93–112. ISSN 0920-8542. DOI: 10.1007/s11227-007-0111-y

Abstract: We present and evaluate a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an “ideal cache” of size Z, our algorithm saves a factor of O(Z1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy. We evaluate our algorithm in terms of the number of cache misses, and demonstrate that the memory behavior agrees with our theoretical predictions. Our experimental evaluation is based on a finite-difference solution of a heat diffusion problem, as well as a Gauss-Seidel iteration and a 2-dimensional LBMHD program, both reformulated as cache oblivious stencil computations.

[O75] T. Zeiser, G. Wellein, A. Nitsure, K. Iglberger, U. Rüde and G. Hager. Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method. Progress in CFD 8(1–4), (2008) 179–188. DOI: 10.1504/PCFD.2008.018088

Abstract: In this report we propose a parallel cache oblivious spatial and temporal blocking algorithm for the lattice Boltzmann method in three spatial dimensions. The algorithm has originally been proposed by Frigo et al. (1999) and divides the space-time domain of stencil-based methods in an optimal way, independently of any external parameters, e.g., cache size. In view of the increasing gap between processor speed and memory performance this approach offers a promising path to increase cache utilisation. We find that even a straightforward cache oblivious implementation can reduce memory traffic at least by a factor of two if compared to a highly optimised standard kernel and improves scalability for shared memory parallelisation. Due to the recursive structure of the algorithm we use an unconventional parallelisation scheme based on task queuing.

Large-scale parallelism

[L76] A. Hoisie, O. Lubeck and H. Wasserman. Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wave-front applications. Int. J. High Perform. Comp. Appl. 14, (2000) 330. DOI: 10.1177/109434200001400405

Abstract: The authors develop a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message-passing environment. The model, based on a LogGP machine parameterization, combines the separate contributions of computation and communication wavefronts. The authors validate the model on three important supercomputer systems, on up to 500 processors. They use data from a deterministic particle transport application taken from the ASCI workload, although the model is general to any wavefront algorithm implemented on a 2-D processor domain. They also use the validated model to make estimates of performance and scalability of wavefront algorithms on 100 TFLOPS computer systems expected to be in existence within the next decade as part of the ASCI program and elsewhere. In this context, the authors analyze two problem sizes. Their model shows that on the largest such problem (1 billion cells), interprocessor communication performance is not the bottleneck. Single-node efficiency is the dominant factor.

[L77] F. Petrini, D. J. Kerbyson and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In: SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing (IEEE Computer Society, Washington, DC, USA). ISBN 1581136951, 55. URL

Abstract: In this paper we describe how we improved the effective performance of ASCI Q, the world’s second-fastest supercomputer, to meet our expectations. Using an arsenal of performance-analysis techniques including analytical models, custom microbenchmarks, full applications, and simulators, we succeeded in observing a serious — but previously undetected — performance problem. We identified the source of the problem, eliminated the problem, and “closed the loop” by demonstrating up to a factor of 2 improvement in application performance. We present our methodology and provide insight into performance analysis that is immediately applicable to other large-scale supercomputers.

[L78] D . J. Kerbyson and A. Hoisie. Analysis of wavefront algorithms on large-scale two-level heterogeneous processing systems. In: Proceedings of the Workshop on Unique Chips and Systems (UCAS2), IEEE Int. Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, 2006.


[A79] H. Fehske, R. Schneider and A. Weisse (eds.). Computational Many-Particle Physics, vol. 739 of Lecture Notes in Physics (Springer), 2008. ISBN 978-3540746850. DOI 10.1007/978-3-540-74686-7

Abstract: Complicated many-particle problems abound in nature and in research alike. Plasma physics, statistical physics and condensed matter physics, as primary examples, are all heavily dependent on efficient methods for solving such problems. Addressing graduate students and young researchers, this book presents an overview and introduction to state-of-the-art numerical methods for studying interacting classical and quantum many-particle systems. A broad range of techniques and algorithms are covered, and emphasis is placed on their implementation on modern high-performance computers.

[A80] A. Fava, E. Fava and M. Bertozzi. MPIPOV: A parallel implementation of POV-Ray based on MPI. In: Proc. EuroPVM/MPI’99, vol. 1697 of Lecture Notes in Computer Science (Springer), 426–433. DOI: 10.1007/3-540-48158-3_53

Abstract: The work presents an MPI parallel implementation of Pov-Ray, a powerful public domain ray tracing engine. The major problem in ray tracing is the large amount of CPU time needed for the elaboration of the image. With this parallel version it is possible to reduce the computation time or to render, with the same elaboration time, more complex or detailed images. The program was tested successfully on ParMa2, a low-cost cluster of personal computers running Linux operating system. The results are compared with those obtained with a commercial multiprocessor machine, a Silicon Graphics Onyx2 parallel processing system based on an Origin CC-NUMA architecture.

[A81] B. Freisleben, D. Hartmann and T. Kielmann. Parallel raytracing: A case study on partitioning and scheduling on workstation clusters. In: Proc. 30th International Conference on System Sciences 1997, Hawaii (IEEE), 596–605. DOI: 10.1109/HICSS.1997.10008

Abstract: In this paper, a case study is presented which is aimed at investigating the performance of several parallel versions of the POV-Ray raytracing package implemented on a workstation cluster using the MPI message passing library. Based on a manager/worker scheme, variants of workload partitioning and message scheduling strategies, in conjunction with different task granularities, are evaluated with respect to their runtime behavior. The results indicate that dynamic, adaptive strategies are required to cope with both the unbalanced workload characteristics of the parallel raytracing application and the different computational capabilities of the machines in a workstation cluster environment.

[A82] G. Wellein, G. Hager, A. Basermann and H. Fehske. Fast sparse matrix-vector multiplication for TFlops computers. In: J. Palma et al. (eds.), High Performance Computing for Computational Science — VECPAR2002, LNCS 2565 (Springer-Verlag, Berlin, Heidelberg). ISBN 3540008527, 287–301. DOI: 10.1007/3-540-36569-9_18

Abstract: Eigenvalue problems involving very large sparse matrices are common to various fields in science. In general, the numerical core of iterative eigenvalue algorithms is a matrix-vector multiplication (MVM) involving the large sparse matrix. We present three different programming approaches for parallel MVM on present day supercomputers. In addition to a pure message-passing approach, two hybrid parallel implementations are introduced based on simultaneous use of message-passing and shared-memory programming models. For a modern SMP cluster (HITACHI SR8000) performance and scalability of the hybrid implementations are discussed and compared with the pure message-passing approach on massively-parallel systems (CRAY T3E), vector computers (NEC SX5e) and distributed shared-memory systems (SGI Origin3800).

[A83] G. Hager, E. Jeckelmann, H. Fehske and G. Wellein. Parallelization strategies for density matrix renormalization group algorithms on shared-memory systems. J. Comput. Phys. 194(2), (2004) 795–808. DOI: 10.1016/

Abstract: Shared-memory (SMP) parallelization strategies for density matrix renormalization group (DMRG) algorithms enable the treatment of complex systems in solid state physics. We present two different approaches by which parallelization of the standard DMRG algorithm can be accomplished in an efficient way. The methods are illustrated with DMRG calculations of the two-dimensional Hubbard model and the one-dimensional Holstein-Hubbard model on contemporary SMP architectures. The parallelized code shows good scalability up to at least eight processors and allows us to solve problems which exceed the capability of sequential DMRG calculations.

[A84] M. Kinateder, G. Wellein, A. Basermann and H. Fehske. Jacobi-Davidson algorithm with fast matrix vector multiplication on massively parallel and vector supercomputers. In: E. Krause and W. Jäger (eds.), High Performance Computing in Science and Engineering ’00 (Springer-Verlag, Berlin, Heidelberg), 188–204. URL

Abstract: The exact diagonalization of very large sparse matrices is a numerical problem common to various fields in science and engineering. We present an advanced eigenvalue algorithm — the so-called Jacobi-Davidson algorithm — in combination with an efficient parallel matrix vector multiplication. This implementation allows the calculation of several specified eigenvalues with high accuracy on modern supercomputers, such as CRAY T3E and NEC SX-4. Exemplarily the numerical technique is applied to analyze the ground state and spectral properties of the three-quarter filled Peierls-Hubbard Hamiltonian in relation to recent resonant Raman experiments on MX chain [-PtCl-] complexes.

[A85] H. Fehske, A. Alvermann and G. Wellein. Quantum transport within a background medium: Fluctuations versus correlations. In: S.Wagner et al. (eds.), High Performance Computing in Science and Engineering, Garching/Munich 2007 (Springer-Verlag, Berlin, Heidelberg). ISBN 978-3540691815, 649–668. DOI: 10.1007/978-3-540-69182-2_50

Abstract: We investigate transport within some background medium by means of an effective lattice model with a novel form of fermion-boson coupling. The bosons correspond to local fluctuations of the background. The model captures the principal transport mechanisms that apply to a great variety of physical systems, and can be applied, e.g. in the one-particle sector, to describe the motion of lattice and spin polarons, or the dynamics of a particle coupled to a bath. Performing large-scale numerical simulations on the HLRB-II at LRZ Munich, based on highly efficient variational Lanczos and Chebyshev moment expansion techniques, we analyse the newly proposed model by exactly calculating the single quasiparticle effective mass, ground-state dispersion and spectral function, as well as the Drude weight and the optical conductivity for an infinite one-dimensional system. Moreover, for the half-filled band case, we establish a metal-insulator quantum phase transition by analysing the particle-particle/boson correlations and photoemission spectra.

[A86] T. Pohl, F. Deserno, N. Thürey, U. Rüde, P. Lammers, G. Wellein and T. Zeiser. Performance evaluation of parallel large-scale lattice Boltzmann applications on three supercomputing architectures. In: SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, Pittsburgh, PA. DOI: 10.1109/SC.2004.37

Abstract: Computationally intensive programs with moderate communication requirements such as CFD codes suffer from the standard slow interconnects of commodity “off the shelf” (COTS) hardware. We will introduce different large-scale applications of the Lattice Boltzmann Method (LBM) in fluid dynamics, material science, and chemical engineering and present results of the parallel performance on different architectures. It will be shown that a high speed communication network in combination with an efficient CPU is mandatory in order to achieve the required performance. An estimation of the necessary CPU count to meet the performance of 1 TFlop/s will be given as well as a prediction as to which architecture is the most suitable for LBM. Finally, ratios of costs to application performance for tailored HPC systems and COTS architectures will be presented.

[A87] C. Körner, T. Pohl, U. Rüde, N. Thürey and T. Zeiser. Parallel Lattice Boltzmann Methods for CFD Applications. In: Numerical Solution of Partial Differential Equations on Parallel Computers (Springer-Verlag, Berlin, Heidelberg). ISBN 3540290761, 439–465. DOI: 10.1007/3-540-31619-1_13

Abstract: The lattice Boltzmann method (LBM) has evolved to a promising alternative to the well-established methods based on finite elements/volumes for computational fluid dynamics simulations. Ease of implementation, extensibility, and computational efficiency are the major reasons for LBM\u2019s growing field of application and increasing popularity. In this paper we give a brief introduction to the involved theory and equations for LBM, present various techniques to increase the single-CPU performance, outline the parallelization of a standard LBM implementation, and show performance results. In order to demonstrate the straightforward extensibility of LBM, we then focus on an application in material science involving fluid flows with free surfaces. We discuss the required extensions to handle this complex scenario, and the impact on the parallelization technique.

[A88] G. Allen, T. Dramlitsch, I. Foster, N. T. Karonis, M. Ripeanu, E. Seidel and B. Toonen. Supporting efficient execution in heterogeneous distributed computing environments with Cactus and Globus. In: Supercomputing ’01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (ACM, New York, NY, USA). ISBN 158113293X, 52–52. DOI: 10.1145/582034.582086

Abstract: Improvements in the performance of processors and networks make it both feasible and interesting to treat collections of workstations, servers, clusters, and supercomputers as integrated computational resources, or Grids. However, the highly heterogeneous and dynamic nature of such Grids can make application development difficult. Here we describe an architecture and prototype implementation for a Grid-enabled computational framework based on Cactus, the MPICH-G2 Grid-enabled message-passing library, and a variety of specialized features to support efficient execution in Grid environments. We have used this framework to perform record-setting computations in numerical relativity, running across four supercomputers and achieving scaling of 88% (1140 CPU’s) and 63% (1500 CPUs). The problem size we were able to compute was about five times larger than any other previous run. Further, we introduce and demonstrate adaptive methods that automatically adjust computational parameters during run time, to increase dramatically the efficiency of a distributed Grid simulation, without modification of the application and without any knowledge of the underlying network connecting the distributed computers.

[A89] G. Hager, H. Stengel, T. Zeiser and G. Wellein. RZBENCH: Performance evaluation of current HPC architectures using low-level and application benchmarks. In: S.Wagner et al. (eds.),High Performance Computing in Science and Engineering, Garching/Munich 2007 (Springer-Verlag, Berlin, Heidelberg). ISBN 978-3540691815, 485–501. URL DOI: 10.1007/978-3-540-69182-2_39

Abstract: RZBENCH is a benchmark suite that was specifically developed to reflect the requirements of scientific supercomputer users at the University of Erlangen-Nuremberg (FAU). It comprises a number of application and low-level codes under a common build infrastructure that fosters maintainability and expandability. This paper reviews the structure of the suite and briefly introduces the most relevant benchmarks. In addition, some widely known standard benchmark codes are reviewed in order to emphasize the need for a critical review of often-cited performance results. Benchmark data is presented for the HLRB-II at LRZ Munich and a local InfiniBand Woodcrest cluster as well as two uncommon system architectures: A bandwidth-optimized InfiniBand cluster based on single socket nodes (“Port Townsend”) and an early version of Sun’s highly threaded T2 architecture (“Niagara 2”).

[A90] D. Kaushik, S. Balay, D. Keyes and B. Smith. Understanding the performance of hybrid MPI/OpenMP programming model for implicit CFD codes. In: Parallel CFD 2009 — 21st International Conference on Parallel Computational Fluid Dynamics, Moffett Field, CA, USA, May 18–22, 2009, Proceedings. ISBN 978-0578023335, 174–177.

C++ references

[C91] A. Fog. Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. URL

Comment: An extremely well written, comprehensive work about serial code optimization on modern (mostly x86) processors. It is also valuable for programmers who are not especially interested in C++, since many optimization techniques work independent of the programming language.

[C92] D. Bulka and D. Mayhew. Efficient C++: Performance Programming Techniques (Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA), 1999. ISBN 0201379503.

[C93] S. Meyers. Effective C++: 55 Specific Ways to Improve Your Programs and Designs (3rd Edition) (Addison-Wesley Professional), 2005. ISBN 0321334876.

[C94] S. Meyers. More Effective C++: 35 New Ways to Improve Your Programs and Designs (Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA), 1995. ISBN 020163371X.

[C95] S. Meyers. Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library (Addison-Wesley Longman Ltd., Essex, UK, UK), 2001. ISBN 0201749629.

[C96] T. Veldhuizen. Expression templates. C++ Report 7(5), (1995) 26–31.

Comment: This paper seems to have vanished from the Web, but it can be found in “C++ Gems” by S. Lippman (see next reference). The paper on Template Metaprogramming is also worth reading, since expression templates build on those concepts.

[C96a] T. Veldhuizen. Expression templates. In: S. Lippman (ed.), C++ Gems (Cambridge University Press, Cambridge, UK), 1998. ISBN 0135705819.

[C96b] T. Veldhuizen. Template Metaprogramming. <!– URL –>

[C97] J. Härdtlein, A. Linke and C. Pflaum. Fast expression templates. In: V. S. Sunderam, G. D. van Albada, P. M. Sloot and J. Dongarra (eds.), Computational Science – ICCS 2005, 5th International Conference, Atlanta, GA, USA, May 22–25, 2005, Proceedings, Part II. 1055–1063. DOI: 10.1007/11428848_133

Abstract: Expression templates (ET) can significantly reduce the implementation effort of mathematical software. For some compilers, especially for those of supercomputers, however, it can be observed that classical ET implementations do not deliver the expected performance. This is because aliasing of pointers in combination with the complicated ET constructs becomes much more difficult. Therefore, we introduced the concept of enumerated variables, which are provided with an additional integer template parameter. Based on this new implementation of ET we obtain a C++ code whose performance is very close to the handcrafted C code. The performance results of these so-called Fast ET are presented for the Hitachi SR8000 supercomputer and the NEC SX6, both with automatic vectorization and parallelization. Additionally we studied the combination of Fast ET and OpenMP on a high performance Opteron cluster.

[C98] A. Aue. Improving performance with custom pool allocators for STL. C/C++ Users’s Journal, (2005) 1–13. URL

Abstract: Anthony presents a highly flexible and configurable replacement for C++’s std::allocator for use with node-based standard containers.

[C99] M. H. Austern. Segmented iterators and hierarchical algorithms. In: M. Jazayeri, R. Loos and D. R. Musser (eds.), International Seminar on Generic Programming, Dagstuhl Castle, Germany, April 27–May 1, 1998, Selected Papers, vol. 1766 of Lecture Notes in Computer Science (Springer). ISBN 3540410902, 80–90. URL DOI: 10.1007/3-540-39953-4_7

Abstract: Many data structures are naturally segmented. Generic algorithms that ignore that feature, and that treat every data structure as a uniform range of elements, are unnecessarily inefficient. A new kind of iterator abstraction, in which segmentation is explicit, makes it possible to write hierarchical algorithms that exploit segmentation.

[C100] H. Stengel. C++-Programmiertechniken für High Performance Computing auf Systemen mit nichteinheitlichem Speicherzugriff unter Verwendung von OpenMP. Diploma thesis, Georg-Simon-Ohm University of Applied Sciences Nuremberg, 2007. URL

Comment: This thesis is in German, but the poster (also downloadable from the Web site) is in English.

[C101] C. Terboven and D. an Mey. OpenMP and C++. In: Proceedings of IWOMP2006 — International Workshop on OpenMP, Reims, France, June 12–15, 2006. DOI: 10.1007/978-3-540-68555-5_25 Talk at ParCo’07: URL

Abstract: In this paper we present our experiences parallelizing the C++ programs DROPS and FIRE with OpenMP. Though the OpenMP specification includes C++, several shortcomings and missing features can be noticed in both the current OpenMP compilers and the specification. We propose solutions of how to overcome these problems and formulate wishes for the future OpenMP specification 3.0.

Comment: The final OpenMP 3.0 specification has been published in 2008.

[C102] British Standards Institute. The C++ Standard: Incorporating Technical Corrigendum 1: BS ISO (John Wiley & Sons, New York, London, Sydney), 2nd ed., 2003. ISBN 0470846747.

[C103] M. H. Austern. What are allocators good for? C/C++ Users’s Journal, December 2000. URL

Abstract: Most of us who use the C++ Standard library tend to forget about allocators, those mysterious things specified by default template parameters for STL containers. In most situations you will not need to call an allocator explicitly or write one of your own. But there are occasions when you might want to substitute your own custom allocator for the default version, for example, to allocate objects from a special memory pool. In this column Matt Austern discusses what you can use allocators for and how you can define your own.

Vendor-specific information and documentation

Comment: The vendors’ manuals and optimization guides are constantly in a state of flux, since they have to reflect the latest hardware and software developments. Sometimes they are only available for customers.

[V104] Intel 64 and IA-32 Architectures Optimization Reference Manual (Intel Press),2009. URL

[V105] Software Optimization Guide for AMD64 Processors (AMD), 2005. URL

[V106] Software Optimization Guide for AMD Family 10h and 12h Processors (AMD), 2010. URL

[V107] A. J. C. Bik. The Software Vectorization Handbook: Applying Intel Multimedia Extensions for Maximum Performance (Intel Press), 2004. ISBN 978-0974364926.

[V108] Hyper-Threading technology. Intel Technology Journal 6(1), (2002) 1–66. ISSN 1535766X. URL

Comment: This is still a good read, although some details have changed over the years.

[V109] R. Gerber and A. Binstock. Programming with Hyper-Threading Technology (Intel Press), 2004. ISBN 0971786143.

[V110] SUPER-UX Performance Tuning Guide (NEC Corporation), 2006.

[V111] Optimizing Applications on Cray X1 Series Systems (Cray Inc.), 2007. URL

Comment: The full set of Cray’s hardware and software documentation can be found at

[V112] Intel C++ intrinsics reference, 2007. URL

[V113] W. A. Triebel, J. Bissell and R. Booth. Programming Itanium-Based Systems (Intel Press), 2001. ISBN 978-0970284624.

[V114] N. Adiga et al. An overview of the BlueGene/L supercomputer. In: SC ’02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing (IEEE Computer Society Press, Los Alamitos, CA, USA), 1–22. URL

Abstract: This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded research partnership between IBM and the Lawrence Livermore National Laboratory as part of the United States Department of Energy ASCI Advanced Architecture Research Program. Application performance and scaling studies have recently been initiated with partners at a number of academic and government institutions, including the San Diego Supercomputer Center and the California Institute of Technology. This massively parallel system of 65,536 nodes is based on a new architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second). The machine is scheduled to be operational in the 2004–2005 time frame, at price/performance and power consumption/performance targets unobtainable with conventional architectures.

[V115] IBM Journal of Research and Development staff. Overview of the IBM Blue Gene/P project. IBM J. Res. Dev. 52(1/2), (2008) 199–220. ISSN 00188646. URL

Abstract: On June 26, 2007, IBM announced the Blue Gene/P system as the leading offering in its massively parallel Blue Gene supercomputer line, succeeding the Blue Gene/L system. The Blue Gene/P system is designed to scale to at least 262,144 quad-processor nodes, with a peak performance of 3.56 petaflops. More significantly, the Blue Gene/P system enables this unprecedented scaling via architectural and design choices that maximize performance per watt, performance per square foot, and mean time between failures. This paper describes our vision of this petascale system, that is, a system capable of delivering more than a quadrillion (1015) floating-point operations per second. We also provide an overview of the system architecture, packaging, system software, and initial benchmark results.

[V116] C. Sosa and B. Knudson. IBM System Blue Gene Solution: Blue Gene/P Application Development. IBM Redbooks. 2009. ISBN 978-0738433332. URL

Abstract: This IBM Redbooks publication is one in a series of IBM books written specifically for the IBM System Blue Gene/P Solution. The Blue Gene/P system is the second generation of a massively parallel supercomputer from IBM in the IBM System Blue Gene Solution series. In this book, we provide an overview of the application development environment for the Blue Gene/P system. We intend to help programmers understand the requirements to develop applications on this high-performance massively parallel supercomputer.

In this book, we explain instances where the Blue Gene/P system is unique in its programming environment. We also attempt to look at the differences between the IBM System Blue Gene/L Solution and the Blue Gene/P Solution. In this book, we do not delve into great depth about the technologies that are commonly used in the supercomputing industry, such as Message Passing Interface (MPI) and Open Multi-Processing (OpenMP), nor do we try to teach parallel programming. References are provided in those instances for you to find more information if necessary.

Prior to reading this book, you must have a strong background in high-performance computing (HPC) programming. The high-level programming languages that we use throughout this book are C/C++ and Fortran95. Previous experience using the Blue Gene/L system can help you better understand some concepts in this book that we do not extensively discuss. However, several IBM Redbooks publications about the Blue Gene/L system are available for you to obtain general information about the Blue Gene/L system. We recommend that you refer to “IBM Redbooks” on page 371 for a list of those publications.

[V117] Cray XT5 Supercomputer. URL

Comment: The Cray XT line has in the meantime been developed further. See for more information.

Web sites and online resources

[W118] Standard Performance Evaluation Corporation. URL

[W119] J. D. McCalpin. STREAM: Sustainable memory bandwidth in high performance computers. Tech. rep., University of Virginia, Charlottesville, VA, 1991–2007. A continually updated technical report. URL

[W120] J. Treibig. LIKWID: Linux tools to support programmers in developing high performance multi-threaded programs. URL

[W121] Top500 supercomputer sites. URL

[W122] HPC Challenge Benchmark. URL

[W123] A. J. van der Steen and J. Dongarra. Overview of recent supercomputers, 2008. URL

Abstract: In this report we give an overview of high-performance computers which are currently available or will become available within a short time frame from vendors; no attempt is made to list all machines that are still in the development phase. The machines are described according to their macro-architectural class. Shared and distributed-memory SIMD an MIMD machines are discerned. The information about each machine is kept as compact as possible. Moreover, no attempt is made to quote price information as this is often even more elusive than the performance of a system. In addition, some general information about high-performance computer architectures and the various processors and communication networks employed in these systems is given in order to better appreciate the systems information given in this report.

Comment: This report is continuously updated. See the next references for more current versions.

[W123a] A. J. van der Steen Overview of recent supercomputers, 2009. URL, PDF version at: URL

[W123b] A. J. van der Steen Overview of recent supercomputers, 2010. PDF

[W124] Intel MPI benchmarks. URL

[W125] MPICH2 homepage. URL

[W126] OpenMPI: A high performance message passing library. URL

[W127] Intel MPI library. URL

[W128] MPI forum. URL

Computer history

[H129] R. Rojas and U. Hashagen (eds.). The First Computers: History and Architectures (MIT Press, Cambridge, MA, USA), 2002. ISBN 0262681374.

[H130] K. Zuse. The Computer — My Life (Springer), 1993. ISBN 978-3540564539.

[H131] P. E. Ceruzzi. A History of Modern Computing (MIT Press), 2nd ed., 2003. ISBN 978-0262532037.


[132] S. E. Raasch and S. K. Reinhardt. The impact of resource partitioning on SMT processors. In: International Conference on Parallel Architectures and Compilation Techniques, PACT 2003 (IEEE Computer Society, Los Alamitos, CA, USA). ISSN 1089-795X, 15. DOI: 10.1109/PACT.2003.1237998

Abstract: Simultaneous multithreading (SMT) increases processor throughput by multiplexing resources among several threads. Despite the commercial availability of SMT processors, several aspects of this resource sharing are not well understood. For example, academic SMT studies typically assume that resources are shared dynamically, while industrial designs tend to divide resources statically among threads. This study seeks to quantify the performance impact of resource partitioning policies in SMT machines, focusing on the execution portion of the pipeline. We find that for storage resources, such as the instruction queue and reorder buffer, statically allocating an equal portion to each thread provides good performance, in part by avoiding starvation. The enforced fairness provided by this partitioning obviates sophisticated fetch policies to a large extent. SMT’s potential ability to allocate storage resources dynamically across threads does not appear to be of significant benefit. In contrast, static division of issue bandwidth has a negative impact on throughput. SMT’s ability to multiplex bursty execution streams dynamically onto shared function units contributes to its overall throughput. Finally, we apply these insights to SMT support in clustered architectures. Assigning threads to separate clusters eliminates inter-cluster communication; however, in some circumstances, the resulting partitioning of issue bandwidth cancels out the performance benefit of eliminating communication.

[133] N. Anastopoulos and N. Koziris. Facilitating efficient synchronization of asymmetric threads on hyper-threaded processors. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS) 2008. ISSN 1530-2075, 1–8. DOI: 10.1109/IPDPS.2008.4536358

Abstract: So far, the privileged instructions MONITOR and MWAIT introduced with Intel Prescott core, have been used mostly for inter-thread synchronization in operating systems code. In a hyper-threaded processor, these instructions offer a “performance-optimized” way for threads involved in synchronization events to wait on a condition. In this work, we explore the potential of using these instructions for synchronizing application threads that execute on hyper-threaded processors, and are characterized by workload asymmetry. Initially, we propose a framework through which one can use MONITOR/MWAIT to build condition wait and notification primitives, with minimal kernel involvement. Then, we evaluate the efficiency of these primitives in a bottom-up manner: at first, we quantify certain performance aspects of the primitives that reflect the execution model under consideration, such as resource consumption and responsiveness, and we compare them against other commonly used implementations. As a further step, we use our primitives to build synchronization barriers. Again, we examine the same performance issues as before, and using a pseudo-benchmark we evaluate the efficiency of our implementation for fine-grained inter-thread synchronization. In terms of throughput, our barriers yielded 12% better performance on average compared to Pthreads, and 26% compared to a spin-loops-based implementation, for varying levels of threads asymmetry. Finally, we test our barriers in a real-world scenario, and specifically, in applying thread-level Speculative Precomputation on four applications. For this multithreaded execution scheme, our implementation provided up to 7% better performance compared to Pthreads, and up to 40% compared to spin-loops-based barriers.

[134] J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995. URL

Abstract: The ratio of cpu speed to memory speed in current high-performance computers is growing rapidly, with significant implications for the design and implementation of algorithms in scientific computing. I present the results of a broad survey of memory bandwidth and machine balance for a large variety current computers, including uniprocessors, vector processors, shared-memory systems, and distributed-memory systems. The results are analyzed in terms of the sustainable data transfer rates for uncached unit-stride vector operations for each machine, and for each class.

[135] D. Monniaux. The pitfalls of verifying floating-point computations. ACM Trans. Program. Lang. Syst. 30(3), (2008) 1–41. ISSN 0164-0925. URL DOI: 10.1145/1353445.1353446

Abstract: Current critical systems commonly use a lot of floating-point computations, and thus the testing or static analysis of programs containing floating-point operators has become a priority. However, correctly defining the semantics of common implementations of floating-point is tricky, because semantics may change with many factors beyond source-code level, such as choices made by compilers. We here give concrete examples of problems that can appear and solutions to implement in analysis software.

[136] M. Bull. Measuring synchronization and scheduling overheads in OpenMP. In: First European Workshop on OpenMP — EWOMP99, Lund University, Lund, Sweden, Sep 30–Oct 1, 1999. URL

Abstract: Overheads due to synchronisation and loop scheduling are an important factor in determining the performance of shared memory parallel programs. We present set of benchmarks to measure these classes of overhead for language constructs in OpenMP. Results are presented for three different hardware platforms, each with its own implementation of OpenMP. Significant differences are observed, which suggest possible means of improving performance.

[137] R. Thakur, R. Rabenseifner and W. Gropp. Optimization of collective communication operations in MPICH. Int. J. High Perform. Comp. Appl. 19(1), (2005) 49–66. DOI: 10.1177/1094342005051521

Abstract: We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI (Message Passing Interface) collective operations, because of limited space we describe only the algorithms for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM’s MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and nonpower-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. Our results indicate that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.

[138] B. Goglin. High Throughput Intra-Node MPI Communication with Open-MX. In: Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2009) (IEEE Computer Society Press, Weimar, Germany), 173–180. URL DOI: 10.1109/PDP.2009.20

Abstract: The increasing number of cores per node in high-performance computing requires an efficient intra-node MPI communication subsystem. Most existing MPI implementations rely on two copies across a shared memory-mapped file. Open-MX offers a single-copy mechanism that is tightly integrated in its regular communication stack, making it transparently available to the MX backend of many MPI layers. We describe this implementation and its offloaded copy backend using I/OAT hardware. Memory pinning requirements are then discussed, and overlapped pinning is introduced to enable the start of Open-MX intra-node data transfer earlier. Performance evaluation shows that this local communication stack performs better than MPICH2 and Open-MPI for large messages, reaching up to 70% better throughput in micro-benchmarks when using I/OAT copy offload. Thanks to a single-copy being involved, the Open-MX intra-node communication throughput also does not heavily depend on cache sharing between processing cores, making these performance improvements easier to observe in real applications.

[139] A. Kleen. Developer Central Open Source: numactl and libnuma. URL