About me
I am a PhD student and research assistant at TU Dortmund. My advisors are Prof. Turek (Chair of Applied Mathematics and Numerics) and Prof. Müller (Chair of Computer Graphics). I have just recently submitted my thesis and am busy preparing myself for the defense (which happens to take the form of a full rigorosum and not a disputation).
Reseach interests
Broadly speaking, my research interests lie in the field of scientific computing; more specifically, I pursue research into parallelisation of numerical techniques for continuum mechanics that scale well in all aspects. Problems from this field are characterised by an inherent trade-off of various efficiency aspects, and in our group, this aspect has been termed hardware-oriented numerics.
Hardware-oriented numerics
Numerical efficiency affects both the discretisation and the solution. My research is focused on finite element discretisations and multilevel/multigrid methods. The favourable theoretical properties and serial convergence rates should be preserved in practice and in parallel: For an ideal parallelisation, the convergence rates should be independent of the partitioning of a given domain into subdomains.
Parallel efficiency refers to how well a given algorithm can be parallelised: I am interested in all three levels of parallelism, the coarse-grained parallelism between the nodes in a compute cluster, the medium-grained parallelism between the CPU cores in each node, and the fine-grained parallelism within many-core processors. Each level of parallelism has its own associated communication model, from message passing between distributed memories over (explicit or implicit) locks and mutexes on shared data in global off-chip memory, down to blocks of concurrent threads synchronising via small, on-chip memories. Two important aspects in this context are weak and strong scalability of an algorithm. The former means that when doubling the problem size and the number of compute resources, the time to solution should not change, and the latter means that doubling the amount of resources but leaving the problem size unchanged should lead to a halving of the solution time, i.e., the parallel speedup is linear in the amount of added compute resources. Different levels of parallelism exhibit different communication characteristics, and for perfect both strong and weak scalability, communication must be fully overlapped with independent computations. One may argue that weak scalability can be identified, in terms of convergence rates rather than time to solution, with parallel (numerical) efficiency: Doubling the number of subdomains and the number of compute nodes but refining each subdomain by the same amount should not change convergence rates.
Finally, good hardware efficiency requires highly optimised implementations, with the goal to extract a significant fraction of the peak performance of modern computer architectures: Hardware efficiency affects both the single-processor performance and the communication characteristics of a given architecture on all levels of parallelism and all levels of the memory hierarchy (memory wall). In practice this means that high performance computing techniques, efficient data structures, data layouts and blocking techniques adapted to the memory hierarchy of each architecture must be developed and applied: Different strategies are required for different computer systems and different levels of parallelism. As an example, current data-parallel, throughput-oriented architectures offer the potential to accelerate numerical computations by an order of magnitude, in contrast to von-Neumann CPUs and instruction-level parallelism.
GPU computing
GPGPU is one of my favourite research topics. I started out back in the days when GPGPU was obscure and hacky, and when GPGPU meant persuading the hardware to do fine-grained parallel computations through graphics APIs such as OpenGL. The field has matured a lot since then, in particular since the arrival of the NVIDIA CUDA and AMD STREAM architectures. My primary research focus has been and continues to be on inherently sparse finite element computations (and on large-scale hardware-software integration, see below), and I have since broadened by interest towards Lattice Boltzmann CFD and many other things. I have been honoured to present a couple of tutorials on the topic over the years and I have written sample code, and my publication list reflects my enthusiasm over the past couple of years. I have also had the honour to serve as a reviewer for various GPGPU-related publications, for journals and in program committees.
Large-scale heterogeneous computing
This project is a collaboration with my fellow PhD students here in Dortmund, Robert Strzodka from the Max Planck Institute and Patrick McCormick and Jamaludin Mohd-Yusof from Los Alamos.
We are investigating how such accelerators can be used efficiently in heterogeneous cluster environments for the parallel solution of large-scale PDE problems. In the FEAST project we try to combine the advantages and alleviate the disadvantages of domain decomposition and parallel multigrid methods. The core idea is to hide anisotropies locally and to exploit regularity globally to achieve good parallel scalability, optimal global convergence rates and a highly efficient solution.
While this approach works very well on commodity CPU clusters, it turns out to be hard to efficiently include co-processor hardware in such schemes without exposing the changes in the underlying solver infrastructure to the application programmer. For instance, these co-processors have very limited local storage, and we need to model the video memory on GPUs as caches with manual or automatic prefetching. These slides summarise our approach to accelerating a complex application from computational solid mechanics (CSM) and fluid dynamics (CFD) without changing a single line of application code, please refer to my publication list for more recent results.
The Partnership for Advanced Computing in Europe (PRACE) recognised our efforts in 2008, and we were awarded the first PRACE award at ISC'08.
Mixed precision
This is a project I pursue in collaboration with Robert Strzodka.
Most of these architectures deliver significantly higher performance in single than in double precision (GPUs actually only just recently started to provide double precision arithmetics). As it can be easily shown, the relation between computational precision and result accuracy is highly non-monotonic. My work has been focussed on emulation techniques and especially mixed precision schemes, and we have devised schemes that allow to perform up to 99% of the computations in low precision while not sacrificing the high accuracy of the results. In contrast to previous research (dating back in the 1960s), we have concentrated on the performance aspects of such schemes, especially for solvers of multigrid type. Such mixed precision schemes are also very hardware-efficient. For a detailed overview of this field of research as well as a discussion of our approaches and some results on GPUs and FPGAs, please refer to our survey paper, a number of updated results can be found in my publication list.
Other research interests
- Software aspects of HPC in general, with a special focus on applications suffering from the memory wall problem
- Multi- and Manycore programming techniques, NUMA
- SpMV
- Hardware evolution towards heterogeneity
- Large scale FEM solvers with applications in CFD and CSM
- Hierarchical 3D hexahedral mesh generation for FEM multigrid solvers.
- Hexahedral mesh adaption to complex CAD geometries.
- GPU-accelerated techniques in physically-based modelling: Interactive rendering of ODE- and PDE-based effects such as water waves, cloth simulation, particle systems, fire and smoke etc. In short: Effects to be found in virtual environments and computer games two generations from now. Refer to Jens Krüger's PhD thesis for an excellent overview.
- Parallelization of numerical algorithms on distributed and shared-memory architectures.
- Cache-oblivious programming techniques for FEM components and hardware-oriented numerics in general.
- Infrastructure for large-scale simulation software: benchmarking, regression tests etc. built on top of existing codes with scripting languages and cluster queuing systems.
Non-scientific life
Some time of my non-scientific life is dedicated to a group called VzUO. We collect, repair and upgrade second-hand computers and provide them to schools and social initiatives in eastern Europe to faciliate education in IT-related topics, in particular Hungary, Romania, Ukraine, Serbia, Croatia, Czech Republic, Slowakia and others.