Process-Oriented Performance Engineering

Overview

Real-Time Signal Processing in Optical Coherence Tomography (OCT)
Optimization of a Glacier Ice Simulation Code

Towards Software-Based Real-Time Singles and Coincidence Processing in Positron Emission Tomography (PET)
Speeding up machine-learning on GPU accelerated Cluster nodes
Accurate Simulation of the Crystallization Process in Injection Molded Plastic Components
Optimization of a granular gas solver
Optimization of soft-matter simulation package on top of LB3D
Fixing load imbalance in flow simulations using Flow3D
Fixing mis-configuration in job scripts
Inefficient resource usage by oversubscribing single nodes

Real-Time Signal Processing in Optical Coherence Tomography (OCT)

Short description
Optical coherence tomography (OCT) systems are developed at the Fraunhofer Institute for Production Technology (IPT) for medical and industrial applications. OCT is an imaging technique that relies on a signal processing chain. This chain used in Fraunhofer’s OCT systems covers, e.g., the acquisition of data from a line scan camera, image-related adjustments, a Fourier transformation, and the reduction of signal noise. Since these OCT systems are required to provide high spatial and temporal resolution in production setups, e.g., video frame rate (25 frames/s), the parallelization and tuning of the signal processing chain is crucial.

To achieve video frame rate, we have developed a CUDA version of Fraunhofer’s OCT signal processing chain that targets consumer GPUs as used in customers’ production environments. We reformulated numerous components of the signal processing chain as matrix transformations to enable the usage of well-tuned libraries such as CUBLAS and CUFFT. Furthermore, we focused on reducing data transfer times by overlapping data and compute operations. To this end, we wrap independent parts of the image – so-called B-scans – in CUDA streams and process them asynchronously.

For the performance validation of our CUDA implementation, we establish a performance model that predicts kernel runtimes and data transfer runtimes in such an asynchronous setup. Our performance model relies on existing approaches and interprets and combines them for our real-world code.

Results
Starting point of our performance evaluations is the original serial code version (MSVS) using the Microsoft Visual Compiler (v14.0). Due to the new algorithmic matrix formulations, we also created parallel CPU versions of the code with little additional effort: With the Microsoft Compiler, we used OpenBLAS (MSVS_BLAS) and, with the Intel compiler (v17.0), we used MKL. Our new CUDA implementation was evaluated on two NVIDIA Pascal consumer GPUs (with similar results). To show the benefit of our optimizations with respect to asynchronous and overlapping operations, we present performance results of CUDA_SYNC and CUDA_ASYNC code versions.

Overall, our CUDA version (CUDA_ASYNC) achieves ~50 frames/s even for the largest tested data set and, hence, fulfills our requirement for video frame rate. This means a speedup of 8-16 over the serial MSVS version, and a speedup of 4-5 over the fastest CPU parallel version. Our corresponding GPU performance model is in line with our runtime measurements (deviation below 15 %) and predicts that triple-sized images can still be processed in video frame rate with our CUDA_ASYNC version.

Click here for more information.

Optimization of a Glacier Ice Simulation Code

Background – Simulations of a customer were identified in our job-specific performance monitoring because of very low resource utilization. The customer is using the Open Source Finite Element Software Elmer/Ice for his simulations.

Analysis – Together with the customer a relevant test case was defined and a performance analyst at RRZE analyzed the application performance. As it turned out to acquire a runtime profile for Elmer/Ice is non-trivial as the code uses a sophisticated on-demand library loading mechanism which confuses standard runtime profilers. An Intel contact person, that is specialized on Elmer, could give us advice how to enable runtime profiling in the code itself. Two hotspots were identified consuming a majority of the runtime. A code review revealed that one specific statement was very expensive.

Optimization – Together with the customer an alternative implementation of this statement was used which resulted in an overall speedup of x3.4 for our benchmark case. During the static code review together with the customer several simulation steps where identified that could be executed only at the end of the simulation.

Summary – This saving of algorithmic work together with the first optimization accumulated to a speedup of factor x9.92. The effort spent on our side was 1.5 days, where getting the code to compile and run took already roughly half a day.

Click here for more information.

Speedup of x9.92 with an effort of 1.5 days. Improvement by saving work.

Towards Software-Based Real-Time Singles and Coincidence Processing in Positron Emission Tomography (PET)

Short Description
For the deployment of novel PET scanners developed in the context of the HYPMED project [1], the institute for Experimental Molecular Imaging (ExMI) aims at utilizing software-based approaches for its singles and coincidence processing chain [2,3]. Contrary to current hardware-based implementations, this allows for easier prototyping and the retrieval of raw scanner data, which would otherwise be lost. However, achieving real-time processing throughput becomes a challenge with the increasing data streams of the scanners and the growing complexity of state-of-the-art positioning algorithms [4].

The application framework developed at the ExMI utilizes the Message Passing Interface (MPI) to allow for distributed processing of incoming scanner data. While throughputs of above 500 MB/s can already be reached with the current implementation, scalability to larger detector systems is desirable. As the application data input is limited by the physical connection between acquisition device and control PC, a throughput rate of 1-2 GB/s is targeted.

For performance analysis, hardware performance counter information collected during program execution gave insight into the dynamic application behaviour. We were able to optimize the locality of memory accesses since the collected metrics revealed high waiting times due to data translation lookaside buffer (TLB) cache misses. As data TLB cache entries map to 4 KB pages, memory accesses exceeded this range. By sorting a critical data structure before processing its entries, scalability and throughput rates could be improved significantly.

The performance counter data additionally showed low vectorization ratios in the kernel function. Although a memory layout of a structure of arrays was used inside the kernel loop, memory accesses proved to be irregular. Utilizing an array of structures instead led to scalar code execution but had a positive effect on execution time.

Results
Starting point of our performance evaluation was the original source code parallelized with MPI, utilizing a release build for compilation with the Intel 18.0 compiler and IntelMPI 2017.4. The stable-sort-timeline version contains optimizations aimed at improving data TLB cache locality, while vector-of-structs utilizes an array of structures to improve data locality within the kernel as described above.

The graphs below show the median throughput rate of 20 application runs on a single Intel Xeon Broadwell EP node with 24 cores, one node of the CLAIX2016 partition of the RWTH Compute Cluster. The stable-sort-timeline version speeds up execution by up to 97% compared to the initial release version. Additionally, parallel efficiency for 48 processes is increased from 41% to 58%. Further runtime improvements by the vector-of-structs version add up to an overall doubling of the throughput rate over the initial release version resulting in a peak throughput rate of 1,250 MB/s. Although this outcome satisfies the set goal of reaching a throughput rate above 1 GB/s, future collaboration efforts may focus on the analysis of other critical processing steps for achieving further performance improvements.

Click here for more information.

Speeding up machine-learning on GPU accelerated Cluster nodes

Background – A user contacted us via help desk he suspects that his python based machine-learning calculations on TinyGPU are slowed down by scattered access to a very large HDF5 input file. In the ticket he already indicated putting the file on a SSD helped a lot on other systems.

Analysis – Most TinyGPU nodes have local SSDs installed. Unfortunately our documentation was a bit scattered on this topic. We improved the documentation to prevent similar problems in the future.

Optimization – Putting the input file on a SSD speeds up execution by factor 13. The user further optimized the access by using the python library h5py (read_direct routine), which prevents additional data copies, by a factor of 4.

Summary – The largest speedup is achieved by putting the data on a SSD. Optimizing the read access brings another factor 4 improvement.

Reduce file IO overhead by choosing the right hardware together with optimized data access.

Accurate Simulation of the Crystallization Process in Injection Molded Plastic Components

Short Description
Injection molding is very important in plastics processing since it enables cheap and easily adaptable large-scale production of plastic components. As the microstructure of semicrystalline thermoplastics has a huge impact on the mechanical properties, the Institute for Plastics Processing at RWTH Aachen University (IKV) has developed the simulation software SphäroSim which can predict the microstructure of semicrystalline thermoplastics during solidification. Due to the long runtime of the simulation, high-resolution simulations are restricted to small areas. In 2014, SphäroSim has been parallelized to simulate big data sets on an HPC system [1], but algorithmic changes require further performance optimization. In the context of this project, we decreased the corresponding runtime and reduced the memory consumption. Now, the simulation runs on a single compute node without memory overflow for the currently used data set under investigation. Since the program has been developed under Windows, we also ported it to Linux to use the capacities of the Linux Cluster of RWTH Aachen University.

Our performance analysis with the tool Intel VTune Amplifier revealed functions which do not make use of the capacities of parallel computer architectures: We moved some instructions out of a big loop and removed some unnecessary Mutex which prevented parallelization. We also identified long idle times while writing intermediate data to disk. We reduced it by introducing asynchronous parallel IO with OpenMP tasks. Furthermore, we integrated an update of the used Eigen library and included a compiler flag for aggressive inlining. By analyzing the memory consumption of the application, we identified a very large data structure which allocated a few million memory pages but used only about a quarter of each page. Adapting the size of this structure to the actual used size enabled us to activate a precalculation step of frequently used data that has been implemented previously but which was restricted due to the lack of memory. Using the precalculation also reduced the runtime.

Results
As base for the performance evaluation, we use the code version Init that is parallelized with OpenMP and ported to Linux. It is compiled with the Intel 18.0 compiler and takes about 20 hours with 48 threads to complete on a CLAIX-2018 compute node. The following figure shows the runtimes after the optimization steps described above. All measurements are based on a single node of the CLAX-2018 compute cluster of the RWTH Aachen University. These nodes contain 48 Intel Skylake cores operating at a frequency of 2.1 GHz, and 192 GB of main memory.

Removing the Mutex and moving some instructions out of the loop (version rmSerMov ) reduces the runtime by 30% (compared to Init). Introducing asynchronous parallel IO (version ParIO) saves another 10% (compared to Init). The update of the Eigen library and aggressive inlining (version Eig) reduces the runtime by additionally 5%. Enabling the precalculation (version Precalc) improved the runtime by another 37%: This version is 77% faster than the initial version and, now simulations only take roughly 5 hours to complete. This means that simulation runs can complete overnight.

Reducing the size of the huge structure decreases the memory consumption of SphäroSim from more than 250 GB (calculated value) to 80 GB (measured value). Hence, the simulation does not crash anymore for big runs due to the lack of memory. Overall, we reduced the memory consumption by 70% and the runtime by 77% which enables the possibility to simulate larger data sets

Optimization of a granular gas solver

Background – In the scope of a KONWIHR project, an MPI-parallel granular gas solver was optimized to improve its performance and parallel efficiency.

Analysis – The single-core and multi-core performance of the code was analyzed using a simple test case. A runtime profile was generated to identify the most time-intensive parts of the simulation. Additionally, the LIKWID performance tools were used to measure hardware performance metrics like memory bandwidth and FLOP/s. In the multi-core case, the MPI communication pattern was analyzed using the Intel Trace Analyzer and Collector (ITAC) to determine the ratio of communication to computation for different numbers of processes.

Optimization – The runtime profile showed different possible optimizations of the code. Some functions were not inlined automatically by the compiler, but had to be forced by using a specific compiler flag. The computational costs of some calculations were reduced by avoiding divisions and by reusing already computed quantities. Additionally, some unnecessary allocation and deallocation of memory was identified. After including these optimizations, the code was able to run 14.5 times faster than the original version.
The analysis of the MPI communication behavior with ITAC revealed a share of 30% for communication between 4 processes, which increased further with increasing process number. More investigations on the code showed unnecessary data transfers. By sending only relevant data between processes, the parallel efficiency and performance were increased. For 216 cores, a simple test case was able to run 80% faster with an increase in parallel efficiency of 17% in comparison to the original code.

Summary – By using basic code analysis and optimizations, the runtime of the code was decreased by a factor of 14.5 on a single core. Additionally, a more efficient communication between the MPI processes was able to further decrease the communication overhead and the total runtime of the simulation.

Click here for more information.

Reduce runtime by factor of 14.5 by saving work.

Optimization of soft-matter simulation package on top of LB3D

Background – A customer contacted us to help them optimizing their LB3D-based software package to simulate soft matter systems at a mesoscopic scale. The software package was rewritten prior to the request to an object-oriented paradigm with redesign of computationally intensive routines. In order to analyze the outcome, the customer wanted to integrate LIKWID’s MarkerAPI to measure specific code
regions. The simulation commonly runs on Tier-1 systems like Hazel Hen at HLRS.

Analysis – The runtime profile showed a few hot functions where most execution time was spent. For analysis, we added MarkerAPI calls around the hot functions and ran some evaluations (FLOP rate, memory bandwidth and vectorization ratio). The vectorization ratio was rather low but the compiler got the proper flags for vectorization. Despite Fortran provides handy declarations to create a new array,
the customers used ‘malloc()’ calls.

Optimization – The main data structure contains the arrays as pointers (allocated by C-wrappers) and the GNU Fortran compiler was not able to determine whether the allocated arrays are contiguous in memory or not, so refused to apply AVX2 vectorization. By adding the ‘contiguous’ keyword to the array declarations, the compiler successfully vectorized the hot functions.

Summary – In the one-day meeting with the customers, we did a hands-on on LIKWID measurements and how to read the results. Moreover, we analyzed code regions in the customers’ software package and found vectorization problems caused by a missing keyword. With the ‘contiguous’ keyword, the performance was increased by 20%. After the one-day meeting, the group continued working on their code resulting in a three-fold improvement in performance.

Click here for more information.

It is possible to appreciate the more than three-fold improvement in performance

Fixing load imbalance in flow simulations using Flow3D

Background – Flow3D is a CFD software for flow simulations. The simulation domain is distributed on various compute nodes.

Analysis – In the cluster-wide job monitoring some Flow3D jobs showed load imbalance between compute nodes. The imbalance is caused by the structure of the simulation domain and consequently some processes had a higher workload than others.

Optimization – The distribution of the domain was improved and the user was recommended to use another type of compute node or less compute nodes. The performance didn’t drop significantly with different node selection but the resources could be used more efficiently. This results in a cost reduction by 14000€ per year by investing only 1.5 hours of work.

Summary – By balancing the workload between compute nodes and using other/less compute nodes, the resources of the compute nodes could be used more efficiently
when executing flow simulations with Flow3D CFD package.

Cost reduction by 14000€ per year by investing only 1.5 hours of work

Fixing mis-configuration in job scripts

Background – A common mistake when submitting jobs on HPC clusters is to reuse old job scripts for different experiments without adjusting them to the current job requirements.

Analysis – The job monitoring revealed jobs that were requesting five compute nodes (each having 20 physical CPU cores) but the application ran only on a single CPU core.

Optimization – After the improvement of the job script, the performance was still the same while running only on a single core, thus saving resources. By further using a different type of compute nodes with higher single core performance but less CPU cores, the performance could be increased with a more efficient usage of the available resources.

Summary – By fixing mis-configurations in job scripts and moving to the optimal compute node type for the job, the performance was increased by reducing the
resource usage at the same time.

Performance was increased by reducing the resource usage at the same time

Inefficient resource usage by oversubscribing single nodes

Background – The jobs of a user showed an inefficient resource usage in the job monitoring. Some of the compute nodes executed more processes than physical CPU cores while others almost didn’t execute any CPU instructions or used any memory.

Analysis – The imbalances were caused by bad parameter selection at the job configuration.

Optimization – After fixing the job scripts, the workload was equally distributed among all compute nodes which caused a performance increase of roughly 15%. The saved core hours in the user’s contingent were invested by the user in additional computations.

Summary – Load imbalance among compute nodes was caused by bad parameter selection in the job scripts. Fixed job scripts distributing work equally result in a performance gain of 15%.

Fixed job scripts distributing work equally result in a performance gain of 15%

Process-Oriented Performance Engineering

ProPE project documentation and activities

Content

Success Stories

Overview

Real-Time Signal Processing in Optical Coherence Tomography (OCT)

Optimization of a Glacier Ice Simulation Code

Speedup of x9.92 with an effort of 1.5 days. Improvement by saving work.

Towards Software-Based Real-Time Singles and Coincidence Processing in Positron Emission Tomography (PET)

Speeding up machine-learning on GPU accelerated Cluster nodes

Reduce file IO overhead by choosing the right hardware together with optimized data access.

Accurate Simulation of the Crystallization Process in Injection Molded Plastic Components

Optimization of a granular gas solver

Reduce runtime by factor of 14.5 by saving work.

Optimization of soft-matter simulation package on top of LB3D

It is possible to appreciate the more than three-fold improvement in performance

Fixing load imbalance in flow simulations using Flow3D

Cost reduction by 14000€ per year by investing only 1.5 hours of work

Fixing mis-configuration in job scripts

Performance was increased by reducing the resource usage at the same time

Inefficient resource usage by oversubscribing single nodes

Fixed job scripts distributing work equally result in a performance gain of 15%