2018
- Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis, (2018), Computer & Fluids, Special Issue DSFD2017. preprint arXiv:1711.11468. doi:10.1016/j.compfluid.2018.03.030. :
2017
- A two-scale approach for efficient on-the-fly operator assembly in massively parallel high performance multigrid codes. Applied Numerical Mathematics (2017) preprint arXiv:1608.06473. doi:10.1016/j.apnum.2017.07.006. :
2015
- An analysis of energy-optimized lattice-Boltzmann CFD simulations from the chip to the highly parallel level. Concurrency and Computation: Practice & Experience (2016) preprint arXiv:1304.7664 doi:10.1002/cpe.3489. :
2014
- Defmacro for C: Lightweight, Ad Hoc Code Generation. Accepted for European Lisp Symposium at IRCAM, May 5-6, 2014, Paris, France. :
- FETOL: A devide-and-conquer based approach for resilient HPC. INFOCOMP 2013: The Third International Conference on Advanced Communications and Computation, Nov. 17-21, 2013, Lisbon, Portugal. :
2013
- A Survey of Checkpoint/Restart Techniques on Distributed Memory Systems. Parallel Processing Letters 23 (04) (2013). doi:10.1142/S0129626413400112. :
- PGAS implementation of SpMVM and LBM using GPI. Proceedings of the 7th International Conforence on PGAS Programming Models PGAS2013, 3./4. October 2013, Edinburgh, Scotland, UK. :
- MPC and Coarray Fortran: Alternatives to Classic MPI Implementations on the Examples of Scalable Lattice Boltzmann Flow Solvers. High Performance Computing in Science and Engineering ‘12, pages 367-372 (2013). doi:10.1007/978-3-642-33374-3_27. :
- An Evaluation of Different IO Techniques for Checkpoint/Restart. Workshop on Large-Scale Parallel Processing 2013 (LSPP13) at IPDPS 2013. :
2012
- Asynchronous checkpointing by dedicated checkpoint threads. Recent Advances in the Message Passing Interface, Volume 7490 of Lecture Notes in Computer Science, pp. 289-290 (2012). doi:10.1007/978-3-642-33518-1_36. :
- Comparison of Different Propagation Steps for Lattice Boltzmann Methods. Computers and Mathematics with Applications (2012) doi:10.1016/j.camwa.2012.05.002 preprint arXiv:1111.0922. :
- Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations. Computers & Fluids (2012) doi:10.1016/j.compfluid.2012.02.007 preprint arXiv:1111.1129. :
2010
- Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters. Parallel Processing Letters 20 (4), 359-376 (2010). doi:10.1142/S0129626410000296 preprint arXiv:1006.3148. :
- Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory, Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), IPDPS 2010, pp. 1-7, 19-23 April 2010, doi:10.1109/IPDPSW.2010.5470813 :
2009
- Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. Proceedings of 2009 33rd Annual IEEE International Computer Software and Applications Conference (COMPSAC 2009, Seattle, USA, Juli 20 -24, 2009). IEEE Computer Society : IPSJ/IEEE SAINT Conference, (2009), pp. 579-586. doi 10.1109/COMPSAC.2009.82. :
- Hardware-effiziente, hochparallele Implementierungen von Lattice-Boltzmann-Verfahren für komplexe Geometrien, Technische Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg, Dissertation, September 2016. ,
- Potentials of temporal blocking for stencil-based computations on multi-core systems, Georg Simon Ohm University of Applied Sciences Nuremberg, Master’s Thesis, March 2009, supervisors: Prof. Dr. Eck and Dr. Georg Hager. Poster, presentet at SC09 (USA, Portland, OR). ,
- Ein Maple-Paket zur Bestimmung von Nullstellen, Diploma Thesis, Georg Simon Ohm University of Applied Sciences Nuremberg, September 2007, supervisors: Prof. Dr. Wermuth and Prof. Dr. Delfs. ,
2018
- Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, April 3-6, 2018. :
- Annual course on Parallel Programming of High Performance Systems. RRZE, Erlangen, Germany, March 6-10, 2018. :
2017
- Performance analysis of sparse triangular solve on current hardware architectures. GAMM Workshop on Applied and Numerical Linear Algebra, Cologne, Germany, September 7-8, 2017. :
- Lattice Boltzmann Benchmark Kernel as a Testbed for Performance Analysis. DSFD’17, Erlangen, Germany, July 10-14, 2017. :
- Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, April 3-6, 2017. :
- Annual course on Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 6-10, 2017. :
2016
- Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, April 4-7, 2016. :
- Annual course on Parallel Programming of High Performance Systems. RRZE, Erlangen, Germany, March 7-11, 2016. :
2015
- Performance Modeling and Analysis of Stencil operations in Earth Mantle Convection Simulations. ParCo 2015, Symposium on Parallel solvers for very large PDE based systems in the Earth- and atmospheric sciences, Edinburgh, Scotland, September 1-4, 2015. :
- Extreme Scale-Out SuperMUC Phase 2, lessons learned. ParCo 2015, Edinburgh, Scotland, September 1-4, 2015. :
- Locality and Performance Optimized Adjacency List Generation for Lattice Boltzmann Based Simulations. ParCFD 2015, Montreal, Canada, May 17-21, 2015. :
- Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, April 7-10, 2015. :
- Bestimmung eines optimalen Betriebspunkts am Beispiel eines Lattice-Boltzmann-Lösers auf SuperMUC. ZKI AK Supercomputing, CAU, Kiel, Germany, March 16-17, 2015. :
- Annual course on Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 9-13, 2015. :
2014
- Single Node Performance and Energy Modeling. Invited Talk, Lehrstuhl für Rechnertechnik und Rechnerorganisation / Parallelrechnerarchitektur (LRR), TUM, Garching, Germany, June 10th, 2014. :
- Modeling and Analyzing Performance for Highly Optimized Propagation Steps of the Lattice Boltzmann Method on Sparse Lattices. ParCFD 2014, Trondheim, Norway, May 20-22, 2014. :
- Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 31-April 03, 2014. :
- Annual course on Parallel Programming of High Performance Systems. RRZE, Erlangen, Germany, March 10-14, 2014. :
2013
- Annual course on Advanced Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 18-21, 2013. :
- Annual course on Parallel Programming of High Performance Systems. LRZ, Garching, Germany, March 4-8, 2013. :
2012
- MPC and Coarray Fortran: alternatives to classic MPI implementations on the examples of scalable lattice Boltzmann flow solvers. Poster, 15th Results and Review Workshop of the HLRS, Stuttgart, Germany, 10-11. October 2012. :
- LIKWID Tutorial: Lightweight performance tools, 6th International Parallel Tools Workshop, HLRS, Stuttgart, Germany, 26. September 2012. :
2011
- Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations, ParCDF special session on LBM, Barcelona, Spain, May 2011. :
2010
- Partitioning for lattice Boltzmann solver, LBM Day, Bochum, Germany, 30. November, 2010. :
- Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. 33rd Annual IEEE International Computer Software and Applications Conference (COMPSAC 2009), Best Paper Award, Seattle (WA, USA), 20-24. July 2010. :
- Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory, LSPP10, the Workshop on Large-Scale Parallel Processing at IPDPS 2010, Atlanta, Georgia, USA, 23. April, 2010. :
- A Pipelined, Multicore-aware Approach to Parallel Temporal Blocking of Stencil Codes for Shared and Distributed Memory, Facing the Multicore-Challenge, Heidelberg, Germany, 19. March, 2010. :
2009
- Enabling temporal blocking for stencil computations by multicore-aware wavefront parallelization. CSE Seminar, UC Berkeley and Lawrence Berkeley National Laboratory, Berkeley, CA, USA, 15. May 2009. :
2015
- Modeling and analyzing performance for highly optimized propagation steps of the lattice Boltzmann method on sparse lattices, (2015), Version 2, submitted to ISC’16, arXiv:1410.0412. :
- Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero, (2015) arXiv:1506.03997. :
2014
- Modeling and analyzing performance for highly optimized propagation steps of the lattice Boltzmann method on sparse lattices, (2014), Version 1, arXiv:1410.0412v1. :
- Performance-Optimierung des Lattice-Boltzmann-Lösers im Verbundprojekt OptiLBM, (2014) Quartl No. 70. :
2013
- Asynchronous MPI for the Masses, (2013) arXiv:1302.4280. :
2010
- Optimizing ccNUMA locality for task-parallel execution under OpenMP and TBB on multicore-based systems, (2010) arXiv:1101.0093v1. :
2009
- A Proof of Concept for Optimizing Task Parallelism by Locality Queues, (2009) arXiv:0902.1884. :