Georg Hager's Blog

Random thoughts on High Performance Computing

Content

ISC15 Workshop

ISC15 Workshop “Performance Modeling: Methods and Applications”

at ISC High Performance, July 12-16, 2015, Frankfurt, Germany

SPPEXA DFG Priority Programme 1648Understanding the performance characteristics of computer programs has been the subject of intense research over several decades. The growing heterogeneity, parallelism, and general complexity of computer architectures have made this venture more and more challenging. Numerous approaches provide insights into different aspects of performance; they range from resource-based analytical modeling of single loops to complex statistical models or machine learning concepts that deal with whole groups of full applications in production environments. In recent years, the energy consumption aspects of computing have received attention due to rising infrastructure costs and operational boundary conditions, adding even more complexity to the model-building process.

This workshop aims to give an overview of the state of the art in performance modeling of computer systems, focusing on methods and blueprints instead of bleeding-edge research. The invited speakers approach the subject from several very different angles.

Link to ISC15 Agenda Planner – Performance Modeling Workshop

 

Time Speaker Title & link to slides
9:00-10:00 Keynote: Bill Gropp
University of Illinois
Engineering for Performance in High Performance Computing
10:00-10:30 Martin Schulz
Lawrence Livermore National Laboratory
Modeling Performance Under a Power Bound: A Short Tour of the Near Future
10:30-11:00 Jeffrey S. Vetter
Oak Ridge National Laboratoy & Georgia Institute of Technology
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with Holistic Performance Prediction
11:00-11:30 Coffee
11:30-12:00 Laura C. Carrington
San Diego Supercomputing Center
Putting a Dent into the Memory Wall: Combined Power-Performance Modeling for Memory Systems
12:00-12:30 Nathan R. Tallent
Pacific Northwest National Laboratory
Palm: Easing the Burden of Analytical Performance Modeling
12:30-13:00 Felix Wolf
Technical University of Darmstadt
Mass-Producing Insightful Performance Models of Parallel Applications
13:00-14:00 Lunch
14:00-14:30 Dimitrios S. Nikolopulos
Queen’s University of Belfast
Server Resource Provisioning for Real-Time Analytics Using Iso-Metrics
14:30-15:00 Alexander Grebhahn
University of Passau
Performance-Influence Models
15:00-15:30 Robert W. Numrich
City University of New York
Computational time, energy, power and action
15:30-16:00 Rich Vuduc
Georgia Institute of Technology
Actively analyzing performance
16:00-16:30 Coffee
16:30-17:00 Brian Van Straalen
Lawrence Berkeley National Laboratory
The Empirical Roofline Toolkit
17:00-17:30 Georg Hager
Erlangen Regional Computing Center
Performance engineering via analytical models
17:30-18:00 Speakers & attendees Survey, open discussion, closing remarks

 

Detailed speaker info and abstracts:

  • Keynote speaker: Bill Gropp
    Director of the Parallel Computing Institute
    University of Illinois Urbana-Champaign
    Title: Engineering for Performance in High Performance Computing
    Abstract: Achieving good performance on any system requires balancing many competing factors.  More than just minimizing communication (or floating point or memory motion), for high end systems the goal is to achieve the lowest cost solution.  And while cost is typically considered in terms of time to solution, other metrics, including total energy consumed, are likely to be important in the future.
    Making effective use of the next generations of extreme scale systems requires rethinking the algorithms, the programming models, and the development process.  This talk will discuss these challenges and argue that performance modeling, combined with a more dynamic and adaptive style of programming, will be necessary for extreme scale systems.
    Speaker Bio: William Gropp received his B.S. in Mathematics from Case Western Reserve University in 1977, a MS in Physics from the University of Washington in 1978, and a Ph.D. in Computer Science from Stanford in 1982. In 2013, he was named the Thomas M. Siebel Chair in Computer Science at the University of Illinois at Urbana-Champaign. His research interests are in parallel computing, software for scientific computing, and numerical methods for partial differential equations. He has played a major role in the development of the MPI message-passing standard. He is co-author of the most widely used implementation of MPI, MPICH, and was involved in the MPI Forum as a chapter author for MPI-1, MPI-2, and MPI-3. He is also one of the designers of the PETSc parallel numerical library, and has developed efficient and scalable parallel algorithms for the solution of linear and nonlinear equations. Gropp is a Fellow of ACM, IEEE, and SIAM, and a member of the National Academy of Engineering. He received the Sidney Fernbach Award from the IEEE Computer Society in 2008, the IEEE TCSC Award for Excellence in Scalable Computing in 2010, and the SIAM-SC Career Award in 2014.
  • Nathan R. Tallent
    Pacific Northwest National Laboratory (PNNL)
    Title: Palm: Easing the Burden of Analytical Performance Modeling
    Abstract: Application models are hard to generate. Furthermore, models are frequently expressed in forms that are hard to distribute and validate. We created Palm (Performance and Architecture Lab Modeling tool) as a framework for tackling these problems. The modeler begins with Palm’s source code modeling annotation language. Not only does the modeling language divide the modeling task into sub problems, it formally links an application’s source code with its model. This link is important because it enables a model to capture behavior with respect to the application’s structure. Palm defines rules for generating models using the static and dynamic relationship of annotations. The model that Palm generates is an executable program whose constituent parts correspond to the modeled application.
    Speaker Bio: Nathan Tallent is an HPC computer scientist in the Advanced Computing, Mathematics, and Data Division at Pacific Northwest National Laboratory. His research is at the intersection of tools, application modeling, performance analysis, and parallelism. He currently works on techniques for performance measurement and analysis, model generation and representation, and dynamic program analysis.
  • Dimitrios S. Nikolopulos
    Professor and Director of Research
    Chair in High Performance and Distributed Computing
    School of Electronics, Electrical Engineering and Computer Science
    Queen’s University of Belfast, Northern Ireland
    Title: Server Resource Provisioning for Real-Time Analytics Using Iso-Metrics
    Abstract: This talk explores system metrics and assessment methods that tackle the diversity in available HPC architectures for the emerging real-time analytical workloads. We discuss architectural resource provisioning based on quality-of-service metrics and explore in more depth how energy-aware resource provisioning materialises in a diversity of architectures, ranging form microserver SoCs to high-end accelerators.
    Speaker Bio: Dimitrios S. Nikolopoulos is Professor in the School of EEECS, at Queen’s University of Belfast. He holds the Chair in High Performance and Distributed Computing and directs the HPDC Research Cluster. His research explores scalable computing systems for data-driven applications and new computing paradigms at the limits of performance, power and reliability. Dimitrios’s many accolades include the NSF CAREER Award, DOE CAREER Award, IBM Faculty Award, and Best Paper Awards from the best IEEE and ACM HPC conferences, including SC, PPoPP, and IPDPS. His research has been supported with over £23 million (£8.3 million as PI) of highly competitive research funding from NSF, DOE, EPSRC, RAEng, EU and the private sector. He regularly teaches modules in computer organisation, parallel computing, and systems programming. Dimitrios is a Fellow of the British Computer Society, Senior Member of the IEEE and Senior Member of the ACM. He earned a PhD (2000) in Computer Engineering and Informatics from the University of Patras.
  • Brian Van Straalen
    Computational Research Division
    Lawrence Berkeley Laboratory
    TitleThe Empirical Roofline Toolkit (Brian Van Straalen, Leonid Oliker, Sam Williams, Terry Ligocki, Wyatt Spear)
    Abstract: As the complexity of high performance computing and its associate software infrastructure continues to grow, it has become increasingly urgent to develop accurate and usable performance modeling capabilities that identify fundamental bottlenecks and achievable optimized code performance. To that end, we are developing the Empirical Roofline Tool (ERT), whose goal is the build on the successful Roofline Model [Williams:2009] and deliver an automatic engine that facilitates the extraction of key architectural characterizations. Recent work has explored the development of these micro-benchmarks to measure key performance limits within the context of sustained memory bandwidth and attainable computational rate, across a broad variety of modern platforms, including multicore, manycore, and accelerated technologies. Past efforts to model performance using vendor specifications have been tedious and error-prone. Additionally, we are extending the original serial-driven roofline model to quantify the impact of increasingly-dominating parallel overheads, such as kernel launch time and synchronization costs. Our goals include publicly releasing the Empirical Roofline Tool coupled with an integrated visualizer, as well as enabling data-base support to track the performance characteristics of evolving architectures and software stack implementations. Overall, we believe ERT has the potential to effectively guide important coding tradeoff design decisions to maximize programmer and performance efficiency.
    Speaker Bio: Brian Van Straalen received his BASc Mechanical Engineering in 1993 and MMath in Applied Mathematics in 1995 from University of Waterloo.  He has been working in the area of scientific computing since he was an undergraduate. He worked with Advanced Scientific Computing Ltd. developing CFD codes written largely in Fortran 77 running on VAX and UNIX workstations.  He then worked as part of the thermal modeling group with Bell Northern Research.  His Master’s thesis work was in the area of a posteriori error estimation for Navier-Stokes equations, which is an area that is still relevant to Department of Energy scientific computing.   He worked for Beam Technologies developing the PDESolve package: a combined symbolic manipulation package and finite element solver, running in parallel on some of the earliest NSF and DOE MPP parallel computers.  He came to LBNL in 1998 to work with Phil Colella and start up the Chombo Project, now in its 13th year of development. He is currently working on his PhD in the Computer Science department at UC Berkeley.
  • Felix Wolf
    Parallel Programming Group
    Technische Universität Darmstadt, Germany
    TitleMass-Producing Insightful Performance Models of Parallel Applications
    Abstract: Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made – a point where remediation can be difficult. However, creating performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. By automatically generating empirical performance models for each function in the program, we make this powerful methodology easier to use and expand its coverage. This presentation gives an overview of the method, illustrates its usage in a range of examples, and assesses its potential.
    Speaker Bio: Felix Wolf is a full professor at the Department of Computer Science of Technische Universität Darmstadt and head of the university’s Laboratory for Parallel Programming. He specializes in software and tools for parallel computers. After receiving his Ph.D. degree from RWTH Aachen University in 2003, he worked more than two years as a postdoc at the Innovative Computing Laboratory of the University of Tennessee. In 2005, he was appointed research group leader at the Jülich Supercomputing Center. During this period, he launched the Scalasca project, a widely used open-source performance analysis tool, together with Dr. Bernd Mohr. Moreover, he initiated the Virtual Institute – High Productivity Supercomputing, an international initiative of academic HPC programming-tool builders aimed at the enhancement, integration, and deployment of their products. From 2009 until recently, Prof. Wolf was head of the Laboratory for Parallel Programming at the German Research School for Simulation Sciences.
  • Georg Hager
    Erlangen Regional Computing Center
    University of Erlangen-Nuremberg, Germany
    Title: Performance engineering via analytical models
    Abstract:  We introduce the “Execution-Cache-Memory” (ECM) performance model as a refinement of the Roofline model to describe single-core and single-socket performance behavior. Based on a thorough code and data transfer analysis, the ECM model can provide valuable insights into performance bottlenecks and optimization opportunities. Nevertheless, it is still simple enough to be done with “pencil and paper.” We show early concepts for automating the construction of the ECM model and combine it with a simple multicore power model to get a fresh view on the energy aspects of computation.
    Speaker Bio: Georg Hager holds a Ph.D. in Computational Physics from the University of Greifswald. He is a senior researcher in the HPC Services group at Erlangen Regional Computing Center (RRZE) at the University of Erlangen-Nuremberg. Recent research includes architecture-specific optimization strategies for current microprocessors, performance engineering of scientific codes on chip and system levels, and special topics in shared memory and hybrid programming. His daily work encompasses all aspects of user support in High Performance Computing like tutorials and training, code parallelization, profiling and optimization, and the assessment of novel computer architectures and tools. His textbook “Introduction to High Performance Computing for Scientists and Engineers” is recommended or required reading in many HPC-related lectures and courses worldwide. In his teaching activities he puts a strong focus on performance modeling techniques that lead to a better understanding of the interaction of program code with the hardware.
  • Robert Numrich
    City University Of New York
    TitleComputational time, energy, power and action
    Abstract:  We derive a formula for the relationship between electrical energy measured in joules and computational energy measured in floating-point operations. The formula is based on an analogy between electrical systems and computational systems. The analogy extends a previous model based on a Hamiltonian system where energy is conserved. That model did not account for dissipation of heat energy to the external system, perhaps the most important factor in determining the energy efficiency of large-scale systems. Accordingly, we change our focus from a Hamiltonian system to a Lagrangian system where energy is not conserved. The trajectories obtained from the equations of motion satisfy a principle of least action modified to account for the loss of heat energy. From these trajectories, we derive a formula for the relationship between joules and flops. This formula depends on parameters of the electrical system: voltage, resistance, capacitance and inductance, and on parameters of the computational system: computational intensity analogous to voltage, a dashpot parameter analogous to resistance, a spring constant analogous to capacitance, and a computational mass analogous to inductance. For a particular piece of hardware, the formula indicates that the most important factor a programmer must consider to reduce energy consumption for a particular algorithm is to increase the computational intensity, that is, to increase the number of floating-point operations performed per byte of data moved. This result has been the bedrock principle of performance optimization from the very beginning of algorithm design. Our formula quantifies the result with an explicit formula relating hardware electric parameters to software algorithmic parameters.
    Speaker Bio: Robert W. Numrich is Senior Scientist at the High-Performance Computng Center located at the College of Staten Island, City University of New York. His research interests include parallel programming models, parallel numerical algorithms, and performance modeling and analysis. He was previously Senior Research Associate at the Minnesota Supercomputing Institute (MSI) located at the University of Minnesota in Minneapolis. Before MSI, he was Principal Scientist at Cray Research as a member of the Cray2/3 team, the Cray T90 team, the Cray T3D/T3E team, and the Cray X1 team. He was the principal inventor of the one-sided communication model, which became known as the Shmem Library, and of the coarray parallel programming model, which is now part of the official Fortran 2008 language. He is the foremost advocate for the application of dimensional analysis to problems in computer performance analysis.
  • Laura C. Carrington
    Performance Modeling and Characterization (PMaC) Laboratory
    San Diego Supercomputer Center at University of California, San Diego
    Title: Putting a Dent into the Memory Wall: Combined Power-Performance Modeling for Memory Systems
    Abstract: To deliver the energy efficiency and raw compute throughput necessary to realize Exascale systems, projected designs call for massive numbers of (simple) cores per processor and strict power budgets allocated to each compute node. The unfortunate consequences of such designs are that the memory bandwidth per core will be significantly reduced and some level of power capping has to be enforced on the memory subsystem, thereby reducing its performance. These Exascale system design implications can significantly degrade the performance of memory-intensive HPC workloads. To identify the code regions that are most impacted and to guide the formulation of mitigating solutions, system designers and application developers alike would benefit immensely from a systematic framework that allowed them to identify the types of computations that are sensitive to reduced memory performance and to precisely identify those regions in their code that exhibit this sensitivity. This work introduces such a framework, which utilizes fine-grained application and hardware characterizations to build machine-learning based models that accurately predict performance sensitivity of HPC applications to the reduced memory subsystem performance. We evaluate our framework on several large-scale HPC applications, observing that the performance sensitivity models show an average absolute mean error of less than 5%.
    Speaker Bio: Dr. Laura (Nett) Carrington is an expert in High Performance Computing and director of the Performance, Modeling, and Characterization (PMaC) Lab at the San Diego Supercomputer Center. Her research has focused on HPC benchmarking, workload analysis, application performance modeling, analysis of accelerators (i.e. FPGAs, GPUs, and Xeon Phis) for scientific workloads, tools in performance analysis (i.e. processor and network simulators), and energy-efficient computing.
  • Jeffrey S. Vetter
    Oak Ridge National Laboratory and Georgia Institute of Technology
    Title: Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with Holistic Performance Prediction
    Abstract: Concerns about energy-efficiency and reliability have forced our community to reexamine the full spectrum of architectures, software, and algorithms that constitute our ecosystem. While architectures and programming models have remained relatively stable for almost two decades, new architectural features, such as heterogeneous processing, nonvolatile memory, and optical interconnection networks, will demand that software systems and applications be redesigned so that they expose massive amounts of hierarchical parallelism, carefully orchestrate data movement, and balance concerns over performance, power, resiliency, and productivity. In what DOE has termed “co-design,” teams of architects, software designers, and applications scientists, are working collectively to realize an integrated solution to these challenges. A key capability of this activity is accurate modeling of performance, power, and resiliency. We have developed the Aspen performance modeling language that allows fast exploration of the holistic design space. Aspen is a domain specific language for structured analytical modeling of applications and architectures. Aspen specifies a formal grammar to describe an abstract machine model and describe an application’s behaviors, including available parallelism, operation counts, data structures, and control flow. Aspen’s DSL constrains models to fit the formal language specification, which enforces similar concepts across models and allows for correctness checks. Aspen is designed to enable rapid exploration of new algorithm and architectures. Because of the succinctness, expressiveness, and composability of Aspen, it can be used to model many properties of a system including performance, power, and resiliency. Aspen has been used to model traditional HPC applications, and recently extended to model scientific workflows for HPC systems and scientific instruments, like ORNL’s Spallation Neutron Source. Models can be written manually or automatically generated from other structured representations, such as application source code or execution DAGs. These Aspen models can then be used for a variety of purposes including predicting performance of future applications, evaluating system architectures, informing runtime scheduling decisions, and identifying system anomalies. Aspen is joint work with Jeremy Meredith (ORNL).
    Speaker Bio: Jeffrey Vetter, Ph.D., holds a joint appointment between Oak Ridge National Laboratory (ORNL) and the Georgia Institute of Technology (GT). At ORNL, Vetter is a Distinguished R&D Staff Member, and the founding group leader of the Future Technologies Group in the Computer Science and Mathematics Division. At GT, Vetter is a Joint Professor in the Computational Science and Engineering School, the Principal Investigator for the NSF-funded Keeneland Project that brings large scale GPU resources to NSF users through XSEDE, and the Director of the NVIDIA CUDA Center of Excellence. His papers have won awards at the International Parallel and Distributed Processing Symposium and EuroPar; he was awarded the ACM Gordon Bell Prize in 2010. His recent books “Contemporary High Performance Computing (Vols. 1 and 2)” survey the international landscape of HPC. See his website for more information: http://ft.ornl.gov/~vetter/.
  • Rich Vuduc
    Associate professor
    School of Computational Science and Engineering (CSE)
    Georgia Institute of Technology
    Title:
    Bridges between macroscopic and microscopic models for co-design
    Abstract
    : This talk starts by discussing three modeling ideas for co-design: setting high-level (or “macroscopic”) architectural parameters via mathematical optimization, estimating lower bounds on communication from program traces, and actively perturbing programs to understand microarchitectural (or “microscopic”) bottlenecks. The specific techniques I’ll discuss are, a present, separate threads of study. I will try to suggest – as well as solicit – ideas for bridging these techniques to understand whether they can be combined to answer practical questions about how to better tune algorithms, software, and architectures for better performance-, power-, and energy-efficiency.
    Speaker Bio: Rich Vuduc received his Ph.D. in Computer Science from the University of California, Berkeley, in January 2004, in the BeBOP group, under Profs. James Demmel and Katherine Yelick. He was a post-doctoral researcher in the Center for Applied Scientific Computing at the Lawrence Livermore National Laboratory, where he worked with Dr. Dan Quinlan on the ROSE project. His lab at Georgia Tech is developing automated tools and techniques to tune, to analyze, and to debug software for parallel machines, including emerging high-end multi/manycore architectures and accelerators. They focus on applying these methods to CSE applications, which include computer-based simulation of natural and engineered systems and data analysis.
  • Martin Schulz
    Lawrence Livermore National Laboratory
    Title: Modeling Performance Under a Power Bound: A Short Tour of the Near Future
    Abstract: With the US Department of Energy limiting potential exaflop designs to less than 20MW, we have entered an era where power availability will constrain performance much more than hardware cost. If we measure utilization from the scarce resource, this implies that future machine designs will draw at their power bound from the time they are brought online until the time they are decommissioned. To accomplish this, designs will be hardware-overprovisioned: there will be more hardware than can be run simultaneously at peak power, and power will be moved around within the system in such a manner as to either optimize execution time (at the job level) or job throughput (at the job-scheduler level).
    Performance optimization is difficult enough on homogeneous machines with ample power. The new reality of power-limited supercomputing makes the task far more difficult. Manufacturing variations results in processor yields that use varying amounts of power for the same performance (and under identical power bounds will exhibit varying performance). Optimal algorithms and compiler optimizations at one power level may be suboptimal at another. Schedulers will now be assigning time, nodes and power bounds to jobs, and optimal throughput will likely require that these bounds be dynamic. Job runtime systems will need to allocate power to nodes within a job, identifying and favoring nodes that are on the critical path. Node operating systems will have to choose the best execution configuration consistent with their bound and provide power/performance curves as feedback.
    This talk will provide an overview of the state of the art and open problems in this area.
    Speaker Bio: Martin is a Computer Scientist at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL). He earned his Doctorate in Computer Science in 2001 from the Technische Universität München (Munich, Germany). He also holds a Master of Science in Computer Science from the University of Illinois at Urbana Champaign. After completing his graduate studies and a postdoctoral appointment in Munich, he worked for two years as a Research Associate at Cornell University, before joining LLNL in 2004.
    Martin is a member of LLNL’s Scalability Team, which focuses on research towards a scalable software stack for next generation systems, as well as LLNL’s ASC CSSE ADEPT (Application Development Environment and Performance Team) and he works closely with colleagues in CASC’s Computer Science Group (CSG) and in the Development Environment Group (DEG).
  • Alexander Grebhahn
    Chair of Software Product Lines
    University of Passau, Germany
    Title: Performance-Influence Models
    Abstract: Configurable systems allow for performance optimization through tailoring to a specific hardware platform and use-case. However, selecting the optimal configuration is challenging due to the inherent variability of a configurable system. To address this challenge, we developed a machine-learning approach to derive performance-influence models, which describe all relevant influences of configuration options and their interactions on the performance of all possible system variants. To derive a performance-influence model, we use an iterative approach that relies on the measurements of a small number of configurations that are selected using structured sampling heuristics and experimental designs. In a series of experiments, we demonstrated the feasibility of our approach in terms of the prediction accuracy of the performance-influence models and the effort needed to derive them.
    Speaker Bio: Alexander Grebhahn is a Ph.D. Student at the Chair of Software Product Lines, University of Passau. He received his Master’s Degree in 2012 from the University of Magdeburg for his work on forensic-secure deletion in database systems. In his research, he focuses on predicting the performance of configurable systems and their system variants. He especially concentrates on multigrid solvers and the integration of domain knowledge into performance prediction. He is currently one of the main developers of the tool SPL Conqueror.