2018
- : Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis, (2018), Computer & Fluids, Special Issue DSFD2017. preprint arXiv:1711.11468. doi:10.1016/j.compfluid.2018.03.030.
2017
- : A two-scale approach for efficient on-the-fly operator assembly in massively parallel high performance multigrid codes. Applied Numerical Mathematics (2017) preprint arXiv:1608.06473. doi:10.1016/j.apnum.2017.07.006.
2015
- : An analysis of energy-optimized lattice-Boltzmann CFD simulations from the chip to the highly parallel level. Concurrency and Computation: Practice & Experience (2016) preprint arXiv:1304.7664 doi:10.1002/cpe.3489.
2014
- : Defmacro for C: Lightweight, Ad Hoc Code Generation. Accepted for European Lisp Symposium at IRCAM, May 5-6, 2014, Paris, France.
- : FETOL: A devide-and-conquer based approach for resilient HPC. INFOCOMP 2013: The Third International Conference on Advanced Communications and Computation, Nov. 17-21, 2013, Lisbon, Portugal.
2013
- : A Survey of Checkpoint/Restart Techniques on Distributed Memory Systems. Parallel Processing Letters 23 (04) (2013). doi:10.1142/S0129626413400112.
- : PGAS implementation of SpMVM and LBM using GPI. Proceedings of the 7th International Conforence on PGAS Programming Models PGAS2013, 3./4. October 2013, Edinburgh, Scotland, UK.
- : MPC and Coarray Fortran: Alternatives to Classic MPI Implementations on the Examples of Scalable Lattice Boltzmann Flow Solvers. High Performance Computing in Science and Engineering ‘12, pages 367-372 (2013). doi:10.1007/978-3-642-33374-3_27.
- : An Evaluation of Different IO Techniques for Checkpoint/Restart. Workshop on Large-Scale Parallel Processing 2013 (LSPP13) at IPDPS 2013.
2012
- : Asynchronous checkpointing by dedicated checkpoint threads. Recent Advances in the Message Passing Interface, Volume 7490 of Lecture Notes in Computer Science, pp. 289-290 (2012). doi:10.1007/978-3-642-33518-1_36.
- : Comparison of Different Propagation Steps for Lattice Boltzmann Methods. Computers and Mathematics with Applications (2012) doi:10.1016/j.camwa.2012.05.002 preprint arXiv:1111.0922.
- : Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations. Computers & Fluids (2012) doi:10.1016/j.compfluid.2012.02.007 preprint arXiv:1111.1129.
2010
- : Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters. Parallel Processing Letters 20 (4), 359-376 (2010). doi:10.1142/S0129626410000296 preprint arXiv:1006.3148.
- : Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory, Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), IPDPS 2010, pp. 1-7, 19-23 April 2010, doi:10.1109/IPDPSW.2010.5470813
2009
- : Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. Proceedings of 2009 33rd Annual IEEE International Computer Software and Applications Conference (COMPSAC 2009, Seattle, USA, Juli 20 -24, 2009). IEEE Computer Society : IPSJ/IEEE SAINT Conference, (2009), pp. 579-586. doi 10.1109/COMPSAC.2009.82.
- , Hardware-effiziente, hochparallele Implementierungen von Lattice-Boltzmann-Verfahren für komplexe Geometrien, Technische Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg, Dissertation, September 2016.
- , Potentials of temporal blocking for stencil-based computations on multi-core systems, Georg Simon Ohm University of Applied Sciences Nuremberg, Master’s Thesis, March 2009, supervisors: Prof. Dr. Eck and Dr. Georg Hager. Poster, presentet at SC09 (USA, Portland, OR).
- , Ein Maple-Paket zur Bestimmung von Nullstellen, Diploma Thesis, Georg Simon Ohm University of Applied Sciences Nuremberg, September 2007, supervisors: Prof. Dr. Wermuth and Prof. Dr. Delfs.
2018
- : Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, April 3-6, 2018.
- : Annual course on Parallel Programming of High Performance Systems. RRZE, Erlangen, Germany, March 6-10, 2018.
2017
- : Performance analysis of sparse triangular solve on current hardware architectures. GAMM Workshop on Applied and Numerical Linear Algebra, Cologne, Germany, September 7-8, 2017.
- : Lattice Boltzmann Benchmark Kernel as a Testbed for Performance Analysis. DSFD’17, Erlangen, Germany, July 10-14, 2017.
- : Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, April 3-6, 2017.
- : Annual course on Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 6-10, 2017.
2016
- : Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, April 4-7, 2016.
- : Annual course on Parallel Programming of High Performance Systems. RRZE, Erlangen, Germany, March 7-11, 2016.
2015
- : Performance Modeling and Analysis of Stencil operations in Earth Mantle Convection Simulations. ParCo 2015, Symposium on Parallel solvers for very large PDE based systems in the Earth- and atmospheric sciences, Edinburgh, Scotland, September 1-4, 2015.
- : Extreme Scale-Out SuperMUC Phase 2, lessons learned. ParCo 2015, Edinburgh, Scotland, September 1-4, 2015.
- : Locality and Performance Optimized Adjacency List Generation for Lattice Boltzmann Based Simulations. ParCFD 2015, Montreal, Canada, May 17-21, 2015.
- : Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, April 7-10, 2015.
- : Bestimmung eines optimalen Betriebspunkts am Beispiel eines Lattice-Boltzmann-Lösers auf SuperMUC. ZKI AK Supercomputing, CAU, Kiel, Germany, March 16-17, 2015.
- : Annual course on Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 9-13, 2015.
2014
- : Single Node Performance and Energy Modeling. Invited Talk, Lehrstuhl für Rechnertechnik und Rechnerorganisation / Parallelrechnerarchitektur (LRR), TUM, Garching, Germany, June 10th, 2014.
- : Modeling and Analyzing Performance for Highly Optimized Propagation Steps of the Lattice Boltzmann Method on Sparse Lattices. ParCFD 2014, Trondheim, Norway, May 20-22, 2014.
- : Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 31-April 03, 2014.
- : Annual course on Parallel Programming of High Performance Systems. RRZE, Erlangen, Germany, March 10-14, 2014.
2013
- : Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 18-21, 2013.
- : Annual course on Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 4-8, 2013.
2012
- : MPC and Coarray Fortran: alternatives to classic MPI implementations on the examples of scalable lattice Boltzmann flow solvers. Poster, 15th Results and Review Workshop of the HLRS, Stuttgart, Germany, 10-11. October 2012.
- : LIKWID Tutorial: Lightweight performance tools, 6th International Parallel Tools Workshop, HLRS, Stuttgart, Germany, 26. September 2012.
2011
- : Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations, ParCDF special session on LBM, Barcelona, Spain, May 2011.
2010
- : Partitioning for lattice Boltzmann solver, LBM Day, Bochum, Germany, 30. November, 2010.
- : Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. 33rd Annual IEEE International Computer Software and Applications Conference (COMPSAC 2009), Best Paper Award, Seattle (WA, USA), 20-24. July 2010.
- : Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory, LSPP10, the Workshop on Large-Scale Parallel Processing at IPDPS 2010, Atlanta, Georgia, USA, 23. April, 2010.
- : A Pipelined, Multicore-aware Approach to Parallel Temporal Blocking of Stencil Codes for Shared and Distributed Memory, Facing the Multicore-Challenge, Heidelberg, Germany, 19. March, 2010.
2009
- : Enabling temporal blocking for stencil computations by multicore-aware wavefront parallelization. CSE Seminar, UC Berkeley and Lawrence Berkeley National Laboratory, Berkeley, CA, USA, 15. May 2009.
2015
- : Modeling and analyzing performance for highly optimized propagation steps of the lattice Boltzmann method on sparse lattices, (2015), Version 2, submitted to ISC’16, arXiv:1410.0412.
- : Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero, (2015) arXiv:1506.03997.
2014
- : Modeling and analyzing performance for highly optimized propagation steps of the lattice Boltzmann method on sparse lattices, (2014), Version 1, arXiv:1410.0412v1.
- : Performance-Optimierung des Lattice-Boltzmann-Lösers im Verbundprojekt OptiLBM, (2014) Quartl No. 70.
2013
- : Asynchronous MPI for the Masses, (2013) arXiv:1302.4280.
2010
- : Optimizing ccNUMA locality for task-parallel execution under OpenMP and TBB on multicore-based systems, (2010) arXiv:1101.0093v1.
2009
- : A Proof of Concept for Optimizing Task Parallelism by Locality Queues, (2009) arXiv:0902.1884.