Performance Engineering for HPC: Implementation, Processes, and Case Studies
June 22, 2017, 09:00-18:00
Frankfurt Marriott, Room “Gold 3”
Workshop co-chairs: Georg Hager (RRZE), Gerhard Wellein (FAU), Matthias S. Müller (RWTH Aachen)
The days of mystic “black-box” performance engineering (PE) of computer programs are gone. Modern tools have entered the scene, endowing developers with an unprecedented level of analysis of code performance. However, as we face more and more complex system architectures, HPC experts have an even more vital role to play when it comes to code optimization and parallelization. Making sense of performance data and taking the right action for the problem at hand are still daunting tasks. Automatic frameworks may provide local solutions but do not deliver deeper insight for long-term performance-aware code development in a universe of increasing hardware diversity and code intricacy. Consequently, computing centers and HPC developer communities provide human assistance to support end users at various levels of sophistication.
This workshop gives an overview of PE activities at computing centers and CSE research communities, highlighting structured, process-oriented approaches. The presentations span a range of topics, from structural issues in providing the right level of service to application programmers conducting actual performance optimizations. The workshop will thus be of wide interest to decision makers, HPC experts, tool developers, and programmers alike.
(click on titles for slides, where available)
|Torsten Hoefler, ETH Zürich
|Scientific benchmarking of parallel computer systems
|Jesus Labarta, Barcelona Supercomputing Center
|Performance POP up
|Christian Bischof, University of Darmstadt
|The Hessian competence center for high performance computing
|Jan Eitzinger, Erlangen Regional Computing Center
|Components for practical performance engineering in a computing center environment: The ProPE project
|Matthias Noack, Zuse Institut Berlin
|Performance engineering on Cray XC40 with Xeon Phi KNL
|Harald Köstler, Chair for System Simulation, FAU Erlangen-Nürnberg
|Metaprogramming for CSE Applications meets performance engineering
|Dirk Pleiter, Jülich Supercomputing Centre
|Performance engineering for lattice quantum chromodynamics simulations
|Robert Henschel, Indiana University
|From job submission support to advanced performance tuning of parallel applications, a case study from a university with an open access policy to high performance computing
|Cyrus Proctor, Texas Advanced Computing Center
|Pragmatic performance: A survey of optimization support at the Texas Advanced Computing Center
|Jens Jägersküpper, German Aerospace (DLR)
|Software challenges in high-performance computational fluid dynamics for industrial aircraft design
Torsten Hoefler (ETH Zürich): Scientific Benchmarking of Parallel Computing Systems
Measuring and reporting performance of parallel computers constitutes the basis for scientific advancement of high-performance computing (HPC). Most scientific reports show performance improvements of new techniques and are thus obliged to ensure reproducibility or at least interpretability. Our investigation of a stratified sample of 120 papers across three top conferences in the field shows that the state of the practice is not sufficient. For example, it is often unclear if reported improvements are in the noise or observed by chance. In addition to distilling best practices from existing work, we propose statistically sound analysis and reporting techniques and simple guidelines for experimental design in parallel computing. We aim to improve the standards of reporting research results and initiate a discussion in the HPC field. A wide adoption of this minimal set of rules will lead to better reproducibility and interpretability of performance results and improve the scientific culture around HPC.
Jesus Labarta (Barcelona Supercomputing Center): Performance POP up
The POP project aims at providing performance analysis services to a wide community of HPC developers and users. It also aims at helping code developers improve their codes, with a particular interest in leveraging new techniques such as task based models. The talk will present the project and its current status. We will then focus on the performance analysis methodology used in POP. I will present analyses performed within the project and elsewhere showing how the methodology helps better understand and improve how our systems behave.
Jesus Labarta is full professor on Computer Architecture at the Technical University of Catalonia (UPC) since 1990. Since 2005 he is responsible of the Computer Science Research Department within the Barcelona Supercomputing Center (BSC). His major directions of current work relate to performance analysis tools, programming models and resource management. His team distributes the Open Source BSC tools (Paraver and Dimemas) and performs research on increasing the intelligence embedded in the performance analysis tools. He is involved in the development of the OmpSs programming model and its different implementations for SMP, GPUs and cluster platforms.
Christian Bischof (University of Darmstadt): The Hessian Competence Center for High-Performance Computing
The scientists who develop simulation codes are typically experts in their scientific domain, but not experts in computer science. As a result, existing code, as well as newly programmed applications, often offer opportunities for improvement in serial or parallel performance. As parallel computing architectures are ubiquitous today, the performance of these codes on parallel platforms is important both from a perspective of scientific competitiveness and usage efficiency of parallel computing systems, form multicore compute servers to HPC compute systems.
To enable scientists for the development of efficient parallel programs, thus ensuring responsible usage of expensive HPC facilities, the Hessian competence center for high-performance computing (www.hpc-hessen.de) was founded as a distributed organization encompassing the five Hessian universities at Darmstadt, Frankfurt, Gießen, Marburg, and Kassel. Under its umbrella, educational offers, training and consulting for parallel programming and performance engineering are provided. In this talk, we give an overview of the activities of the center and its resonance in the scientific community, as well as some highlights of its impact on the scientific software infrastructure.
Born 1960 in Aschaffenburg, Christian Bischof began studies in Mathematics and Computer Science at the University of Würzburg. He moved to Cornell University, Ithaca, NY, supported by a Fulbright scholarship, where he received a Ph.D. degree in Computer science in 1988. After various scientific activities at the Mathematics and Computer Science Division of Argonne National Laboratory, Illinois, he moved to RWTH Aachen University in 1998, and took over the leadership of the computing center and the chair of the Institute for Scientific Computing. Since July 2011, he is responsible, in a similiar setup, for the university computing center (HRZ) and the Institute for Scientific Computing of TU Darmstadt.
Jan Eitzinger (Erlangen Regional Computing Center): Components for practical performance engineering in a computing center environment: The ProPE project
Large HPC systems are expensive, and so is their operation, which makes their efficient use a crucial goal. However, those systems are complex with regard to hardware architectures, network topologies, tool chains and software environments. Particularly in academic computing centers there is a vast variety of applications with very different hardware demands. Furthermore, small- to medium-sized HPC sites tend to have very limited resources for user sup- port and application performance tuning. For them, it is not feasible to manually ensure an efficient use of the systems. The DFG ProPE project is an effort to address critical components for an integrated nationwide Performance Engineering (PE) infrastructure. This involves a process that describes how to systematically handle and, if necessary, delegate a PE project within a network of HPC centers, but it also covers tools for job-specific application performance monitoring that assist the staff in detecting pathological jobs or jobs which expose a significant optimization potential. This talk will give a short overview of the ProPE project with a special focus on process and tools aspects.
Jan Eitzinger (formerly Treibig) holds a PhD in Computer Science from the University of Erlangen. He is now a postdoctoral researcher in the HPC Services group at Erlangen Regional Computing Center (RRZE). His current research revolves around architecture-specific and low-level optimization for current processor architectures, performance modeling on processor and system levels, and programming tools. He is the developer of LIKWID, a collection of lightweight performance tools. In his daily work he is involved in all aspects of user support in High Performance Computing: training, code parallelization, profiling and optimization, and the evaluation of novel computer architectures.
Alexander Reinefeld (Zuse Institut Berlin): Performance engineering on Cray XC40 with Xeon Phi KNL
With the arrival of the latest generation Intel Xeon Phi processor “Knights Landing” (KNL) a powerful 3 Tflops many-core CPU found its way into modern supercomputers. Unfortunately, the vast computing power can only be exploited by applications that are able to fully utilize the AVX-512 vector units and the newly introduced high-bandwidth memory. The performance engineering becomes even trickier when trying to port legacy applications with hybrid MPI + OpenMP codes onto supercomputer with KNL processors. In our talk, we present strategies that have been successfully used to port and optimize existing production codes of the North German Supercomputer Alliance (HLRN) onto our Cray XC40 with KNL processors.
Alexander Reinefeld heads the parallel and distributed computing department at Zuse Institute Berlin (ZIB) and holds a professorship at Humboldt University Berlin. He received a PhD in Computer Science in 1987 from the University of Hamburg, spent two years at the University of Alberta (Canada) and was managing director at the Paderborn Center for Parallel Computing. He co-founded the North German Supercomputing Alliance HLRN, the Global Grid Forum, and the German E-Science initiative D-Grid. His research interests include scalable and fault-tolerant algorithms, distributed data management and innovative computer architecture.
Harald Köstler (Chair for System Simulation, FAU Erlangen-Nürnberg): Software Engineering meets Performance Engineering
Recent advances in computer hardware technology make more and more realistic simulations possible; however, this progress has a price. On the one hand, the growing complexity of the physical and mathematical models requires the development of new and efficient numerical methods. On the other hand, the trend towards heterogeneous and highly parallel architectures increases the programming effort necessary to implement, develop, and maintain these models. These issues can be addressed by providing domain-specific languages that enable the users to formulate their problems on different levels of abstraction. From these formulations, efficient implementations can be generated and optimized automatically since application domain is known a priori. We will show how performance models can be integrated in a domain-specific compiler and then used to predict runtime of certain configurations and thus reduce the effort during the optimization process.
Harald Köstler got his Ph.D. in computer science in 2008 on variational models and parallel multigrid methods in medical image processing. In 2014 he finished his habilitation on “Efficient Numerical Algorithms and Software Engineering for High Performance Computing.” Currently, he works at the Chair for System Simulation at the University of Erlangen-Nuremberg in Germany. His research interests include variational methods in imaging, software engineering concepts especially using code generation for simulation software on HPC clusters, multigrid methods, and programming techniques for parallel hardware, especially GPUs.
Dirk Pleiter (Jülich Supercomputing Centre): Performance engineering for Lattice Quantum Chromodynamics simulations
Simulations the theory of strong interactions, namely Quantum Chromodynamics (QCD), on a lattice require petascale systems today and exascale systems in the future. For typical workloads the performance of a linear solver dominates the overall application performance. In this talk we will report on the expertise that has been built-up within this community on optimizing the performance of these solvers. Starting from an analysis of the application performance signatures we will show how good hardware utilization could be obtained on a broad range of architectures.
Prof. Dr. Dirk Pleiter research group leader at the Jülich Supercomputing Centre (JSC) and professor of theoretical physics at the University of Regensburg. At JSC he is leading the work on application oriented technology development. Currently he is principal investigator of the POWER Acceleration and Design Center, a center that is jointly run by IBM, JSC and NVIDIA. He has played a leading role in several projects for developing massively-parallel special purpose computers, including several generations of QPACE.
Robert Henschel (Indiana University): From job submission support to advanced performance tuning of parallel applications, a case study from a university with an open access policy to high performance computing
In 1997, Indiana University embarked on an initiative to increase IU’s local availability of supercomputers and increase its use of local and nationally-funded supercomputer resources. This was considered by many at the time to be laughable. Today, the Indiana University Pervasive Technology Institute (IU-PTI) is well regarded as a local service provider and a national leader in HPC. HPC use and resources serve to attract and retain top-quality faculty members, and support cutting edge research by faculty, staff researchers, and graduate students. Well over 10% of the faculty, staff, and student researchers at IU make some use of IU supercomputers, and more than half of the grant money that comes to Indiana University from the federal government goes to Principal Investigator / Co-PI groups in which at least one person makes some use of IU advanced computing systems. Over now more than 20 years of growth and evolution in our high performance computing systems we have evolved approaches to supporting high quality performance engineering for our advanced HPC users, and creating easy-to-use interfaces for less sophisticated users. The key to growth over time in use of effective performance engineering techniques has been to tolerate use of inefficient codes, as a starting point, and help researchers improve their codes. Sometimes such efforts have had national and international impact, such as IU-PTI performance engineers reworking substantial sections of the Trinity RNA sequencing code, which reduced execution time to 1/8th of the time required before optimization. Sometimes performance gains seem quite modest from an engineering standpoint, but increased ease of use through high quality user interfaces results in significant improvements in user productivity. Perhaps most significantly, over the course of now 20 years, a concerted effort to support use of HPC systems with excellent HPC performance engineering support has transformed the way the academic community of an entire university thinks about HPC and uses HPC in their own activities.
Robert Henschel received his M.Sc. from Technische Universität Dresden, Germany. He joined Indiana University in 2008, first as the manager of the Scientific Applications Group and since 2016 as the director for Science Community Tools. In this role he is responsible for leading teams that provide advanced scientific applications to researchers at Indiana University and the IU School of Medicine. As the chair of the High Performance Group (HPG) at Standard Performance Evaluation Corporation (SPEC) he works on developing benchmarks for HPC systems.
Cyrus Proctor (Texas Advanced Computing Center): Pragmatic Performance: A Survey of Optimization Support at the Texas Advanced Computing Center
Researchers from dozens of fields of science employ HPC to gain insight into problems of a scale intractable by other means. As the number, complexity, and specialization of hardware solutions grow, so too does the general knowledge gap to sustain satisfactory performance. At the Texas Advanced Computing Center, HPC consultants engage with a diverse user base and work to provide the most efficient compute cycles using (almost) any means necessary. This talk outlines practical support approaches taken when serving a community of thousands of researchers who coexist in a fragile, shared computing environment plagued with a myriad of potential performance pitfalls. The foundations for practical performance engineering begin with tapping into a spectrum of staff expertise, calling upon collections of in-house and commercial tools, plus providing numerous training and support outreach opportunities for users at all levels of sophistication. Our approach empowers users by providing the most common strategies and techniques to identify and rectify low-hanging performance issues while also being able to cater to capability-class researchers who seek the highest levels of performance.
Cyrus joined as a Research Associate in the Performance & Architecture Group at the Texas Advanced Computing Center (TACC) in 2014. Currently, he focuses on code optimization and parallelization techniques, software package design best practices, multi-factor authentication infrastructure, and system performance analysis and monitoring frameworks. Prior to his contributions at TACC, Cyrus worked as a research assistant professor of nuclear engineering at the University of South Carolina and as a postdoctoral fellow at his alma mater, North Carolina State University. His research interests have centered on the verification and validation of large-scale radiation transport computational models. His experience brings together a unique nuclear focus with a blend of engineering, computer science, physics, and applied mathematics. Code development of complex physical systems, and running them efficiently at scale, has been an integral part of his education and career for nearly a decade.
Jens Jägersküpper (German Aerospace Center): Software challenges in high-performance computational fluid dynamics for industrial aircraft design
After a brief introduction into what “Flucs”, DLR’s Flexible Unstructured CFD Software, is supposed to simulate, the conflicts that arise when doing performance engineering for a code that is primarily designed for testability, maintainability, and extensibility are discussed. With respect to performance engineering, the case of a code that is neither compute- nor bandwidth-bound, but latency-bound is pointed out.
Dr. Jens Jägersküpper is a staff scientist at the Center for Computer Applications in AeroSpace Science and Engineering (C²A²S²E) of the German Aerospace Center (DLR), which he joined in 2008. As a core developer of the next-generation Computations Fluid Dynamics (CFD) software “FLUCS”, he currently concentrates on High-Performance Computing (HPC) aspects of simulation software. He studied computer science at Dortmund Technical University, Germany, where he also received his PhD (Dr.rer.nat.) in 2006.