Dominik Göddeke -- Home

About me

I am a PhD student and research assistant at TU Dortmund. My advisors are Prof. Turek (Chair of Applied Mathematics and Numerics) and Prof. Müller (Chair of Computer Graphics). I have just recently submitted my thesis and am busy preparing myself for the defense (which happens to take the form of a full rigorosum and not a disputation).

Reseach interests

Broadly speaking, my research interests lie in the field of scientific computing; more specifically, I pursue research into parallelisation of numerical techniques for continuum mechanics that scale well in all aspects. Problems from this field are characterised by an inherent trade-off of various efficiency aspects, and in our group, this aspect has been termed hardware-oriented numerics.

Hardware-oriented numerics

Numerical efficiency affects both the discretisation and the solution. My research is focused on finite element discretisations and multilevel/multigrid methods. The favourable theoretical properties and serial convergence rates should be preserved in practice and in parallel: For an ideal parallelisation, the convergence rates should be independent of the partitioning of a given domain into subdomains.

Parallel efficiency refers to how well a given algorithm can be parallelised: I am interested in all three levels of parallelism, the coarse-grained parallelism between the nodes in a compute cluster, the medium-grained parallelism between the CPU cores in each node, and the fine-grained parallelism within many-core processors. Each level of parallelism has its own associated communication model, from message passing between distributed memories over (explicit or implicit) locks and mutexes on shared data in global off-chip memory, down to blocks of concurrent threads synchronising via small, on-chip memories. Two important aspects in this context are weak and strong scalability of an algorithm. The former means that when doubling the problem size and the number of compute resources, the time to solution should not change, and the latter means that doubling the amount of resources but leaving the problem size unchanged should lead to a halving of the solution time, i.e., the parallel speedup is linear in the amount of added compute resources. Different levels of parallelism exhibit different communication characteristics, and for perfect both strong and weak scalability, communication must be fully overlapped with independent computations. One may argue that weak scalability can be identified, in terms of convergence rates rather than time to solution, with parallel (numerical) efficiency: Doubling the number of subdomains and the number of compute nodes but refining each subdomain by the same amount should not change convergence rates.

Finally, good hardware efficiency requires highly optimised implementations, with the goal to extract a significant fraction of the peak performance of modern computer architectures: Hardware efficiency affects both the single-processor performance and the communication characteristics of a given architecture on all levels of parallelism and all levels of the memory hierarchy (memory wall). In practice this means that high performance computing techniques, efficient data structures, data layouts and blocking techniques adapted to the memory hierarchy of each architecture must be developed and applied: Different strategies are required for different computer systems and different levels of parallelism. As an example, current data-parallel, throughput-oriented architectures offer the potential to accelerate numerical computations by an order of magnitude, in contrast to von-Neumann CPUs and instruction-level parallelism.

GPU computing

GPGPU is one of my favourite research topics. I started out back in the days when GPGPU was obscure and hacky, and when GPGPU meant persuading the hardware to do fine-grained parallel computations through graphics APIs such as OpenGL. The field has matured a lot since then, in particular since the arrival of the NVIDIA CUDA and AMD STREAM architectures. My primary research focus has been and continues to be on inherently sparse finite element computations (and on large-scale hardware-software integration, see below), and I have since broadened by interest towards Lattice Boltzmann CFD and many other things. I have been honoured to present a couple of tutorials on the topic over the years and I have written sample code, and my publication list reflects my enthusiasm over the past couple of years. I have also had the honour to serve as a reviewer for various GPGPU-related publications, for journals and in program committees.

Large-scale heterogeneous computing

This project is a collaboration with my fellow PhD students here in Dortmund, Robert Strzodka from the Max Planck Institute and Patrick McCormick and Jamaludin Mohd-Yusof from Los Alamos.

We are investigating how such accelerators can be used efficiently in heterogeneous cluster environments for the parallel solution of large-scale PDE problems. In the FEAST project we try to combine the advantages and alleviate the disadvantages of domain decomposition and parallel multigrid methods. The core idea is to hide anisotropies locally and to exploit regularity globally to achieve good parallel scalability, optimal global convergence rates and a highly efficient solution.

While this approach works very well on commodity CPU clusters, it turns out to be hard to efficiently include co-processor hardware in such schemes without exposing the changes in the underlying solver infrastructure to the application programmer. For instance, these co-processors have very limited local storage, and we need to model the video memory on GPUs as caches with manual or automatic prefetching. These slides summarise our approach to accelerating a complex application from computational solid mechanics (CSM) and fluid dynamics (CFD) without changing a single line of application code, please refer to my publication list for more recent results.

The Partnership for Advanced Computing in Europe (PRACE) recognised our efforts in 2008, and we were awarded the first PRACE award at ISC'08.

Mixed precision

This is a project I pursue in collaboration with Robert Strzodka.

Most of these architectures deliver significantly higher performance in single than in double precision (GPUs actually only just recently started to provide double precision arithmetics). As it can be easily shown, the relation between computational precision and result accuracy is highly non-monotonic. My work has been focussed on emulation techniques and especially mixed precision schemes, and we have devised schemes that allow to perform up to 99% of the computations in low precision while not sacrificing the high accuracy of the results. In contrast to previous research (dating back in the 1960s), we have concentrated on the performance aspects of such schemes, especially for solvers of multigrid type. Such mixed precision schemes are also very hardware-efficient. For a detailed overview of this field of research as well as a discussion of our approaches and some results on GPUs and FPGAs, please refer to our survey paper, a number of updated results can be found in my publication list.

Other research interests

Non-scientific life

Some time of my non-scientific life is dedicated to a group called VzUO. We collect, repair and upgrade second-hand computers and provide them to schools and social initiatives in eastern Europe to faciliate education in IT-related topics, in particular Hungary, Romania, Ukraine, Serbia, Croatia, Czech Republic, Slowakia and others.