17th July 2014

Day 2 of the conference had below talks. Prof. Dr. Bernd Brügmann gave a short introduction. He pointed out that Jena is number 10 in Physics in Germany, has ca. 100.000 inhabitants, and 20.000 students.

- Dr. Karl Rupp, Vienna, Lessons Learned in Developing the Linear Algebra Library ViennaCL. Notes: C++ operator overloading normally uses temporary, special trickery necessary to circumvent this, ViennaCL not callable from Fortran due to C++/operator overloading, eigen.tuxfamily.org, Karl Rupp's slides, with CUDA 5+6 OpenCL and CUDA are more or less on par,
- Prof. Dr. Rainer Heintzmann, Jena, CudaMat - a toolbox for Cuda computations. Notes: Information on CudaMat, 300 GB fly-head, Delft Image Processing Library, wrote his own CUDA memory allocator with storage from heap, does not work on Octave
- Prof. Dipl.-Ing. Dr. Gundolf Haase, Graz, Interpolation with Radial Basis Functions on GPGPUs using CUDA. Notes: AVL Graz, car industry, simulation software, OpenACC disappointing, significant speedup with GPU/CUDA, rule of thumb: start with OpenMP, then MIC, then OpenACC
- Lars Kühne, Jena, A Concurrent Algorithm for Computing the Flow Complex.
- Axel Hübl, Helmholtz-Zentrum Dresden-Rossendorf, Scaling Plasma Simulations to more than 18,000 GPUs.
- Carsten Eye Frigaard, www.lab4241.com, Running GADGET2 on GPUs: Optimizing Tree-search Algorithms by Detailed Profiling of GPU Code. Notes:
`gpuprofgui`

, C-source level counters, PTX level counters, SASS level counters, BARRA, UNISIM - M. Sc. Moritz Kreutzer, Erlangen, Building blocks for sparse linear algebra on heterogeneous hardware. Notes: excellent speech, 45% comes from accelerator in Top50 supercomputers, vulnerability for hardware faults, fusing kernels, checkpoints, ESSEX programme/project, JDS, CRS, SSE, AVX, Sliced ELLPACK, computation done in permuted fashion
- Dipl.-Phys. Marcus Noack, Oslo, Parallel and simultaneous computation of eikonal and transport equations by taking full advantage of GPU computer architecture. Notes: Oil, seismic, just a single CUDA kernel, used OpenMP
- Dr. Manfred Liebmann, Graz, Optimal Control of the Schrödinger Equation on Many-Core Architectures. Notes: Crank-Nicholson much worse, Intel compiler not better than gcc/g++, 10+ PDEs per iteration, good initial approximation necessary, GPU two times faster, unitarity not a problem
- Dr. Johannes Langguth, Oslo, Scalable Finite Volume Computations in Heterogeneous Systems.
- Dipl.-Inf. Ralf Seidler, Jena, Implementing the Radon Transform using Advanced Techniques on GPGPUs. Notes: GTX750 consumer card, no problem with single precision
- Prof. Dr. Gerhard Zumbusch, Jena, A parallel functional language for high performance finite difference stencil codes. Notes: Very interesting, excellent presentation, large gap between GFlop/s and memory speed, you have to fuse operations, Runge-Kutta discretization, you are measuring memory speed not computational speed
- Mohammed Sourouri, Oslo, An Optimized Intra-Node Communication Scheme Using Multiple CUDA Streams and OpenMP Threads.
- Carsten Eckert, Helmholtz-Zentrum Dresden-Rossendorf, An adaptive, load-balanced MPI/GPU-Code for calculating the gain in High Power Laser media. Notes: ArchLinux, 64 GPUs, all communication via MPI, 1 point = 1 kernel, Tesla K20M, Computational Radiation Physics, Monte-Carlo integration
- Dr. Erik Rodner, Jena, Computational Challenges for Visual Recognition with Deep Learning Architectures.
- Dipl.-Phys. Richard Pausch, Dresden, Scalable, interactive 3D in-situ visualization of large-scale Simulations.

**Categories: **C / C++, CUDA, mathematics, NVidia
**Tags: **
**Author: **Elmar Klausmeier