References

This library is built on the unique Julia infrastructure for transpiling code to GPU backends, and years spent developing the JuliaGPU ecosystem that make it a joy to use. In particular, credit should go to the following people and work:

  • The Julia language design, which made code manipulation and generation a first class citizen: Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: A fresh approach to numerical computing. SIAM review. 2017.

  • The GPU compiler infrastructure built on top of Julia's unique compilation model: Besard T, Foket C, De Sutter B. Effective extensible programming: unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems. 2018.

  • The KernelAbstractions.jl library with its unique backend-agnostic compilation: Churavy V, Aluthge D, Wilcox LC, Schloss J, Byrne S, Waruszewski M, Samaroo J, Ramadhan A, Meredith SS, Bolewski J, Smirnov A. JuliaGPU/KernelAbstractions. jl: v0.8.3.

  • For distributed applications, the MPI.jl library which makes integrating GPU codes with multi-node communication so easy: Byrne S, Wilcox LC, Churavy V. MPI. jl: Julia bindings for the Message Passing Interface. InProceedings of the JuliaCon Conferences 2021.

If you use AcceleratedKernels.jl in publications, please cite the works above.

While the algorithms themselves were implemented anew, multiple existing libraries and resources were useful; in no particular order:

  • Kokkos: https://github.com/kokkos/kokkos

  • RAJA: https://github.com/LLNL/RAJA

  • Thrust / CUDA C++ Core Libraries: https://github.com/nvidia/cccl

  • ThrustRTC: https://github.com/fynv/ThrustRTC

  • Optimizing parallel reduction in CUDA: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

  • Parallel prefix sum (scan) with CUDA: https://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/scan/doc/scan.pdf

  • Parallel prefix sum (scan) with CUDA: https://github.com/mattdean1/cuda

  • rocThrust: https://github.com/ROCm/rocThrust

  • FidelityFX: https://github.com/GPUOpen-Effects/FidelityFX

  • Intel oneAPI DPC++ library: https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-library.html

  • Metal performance shaders: https://developer.apple.com/documentation/metalperformanceshaders


Much of this work was possible because of the fantastic HPC resources at the University of Birmingham and the Birmingham Environment for Academic Research, which gave us free on-demand access to thousands of CPUs and GPUs that we experimented on, and the support teams we nagged. In particular, thank you to Kit Windows-Yule and Andrew Morris on the BlueBEAR and Baskerville T2 supercomputers' leadership, and Simon Branford, Simon Hartley, James Allsopp and James Carpenter for computing support.