References

This library is built on the unique Julia infrastructure for transpiling code to GPU backends, and years spent developing the JuliaGPU ecosystem that make it a joy to use. In particular, credit should go to the following people and work:

The Julia language design, which made code manipulation and generation a first class citizen: Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: A fresh approach to numerical computing. SIAM review. 2017.
The GPU compiler infrastructure built on top of Julia's unique compilation model: Besard T, Foket C, De Sutter B. Effective extensible programming: unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems. 2018.
The KernelAbstractions.jl library with its unique backend-agnostic compilation: Churavy V, Aluthge D, Wilcox LC, Schloss J, Byrne S, Waruszewski M, Samaroo J, Ramadhan A, Meredith SS, Bolewski J, Smirnov A. JuliaGPU/KernelAbstractions. jl: v0.8.3.
For distributed applications, the MPI.jl library which makes integrating GPU codes with multi-node communication so easy: Byrne S, Wilcox LC, Churavy V. MPI. jl: Julia bindings for the Message Passing Interface. InProceedings of the JuliaCon Conferences 2021.

If you use AcceleratedKernels.jl in publications, please cite the works above.

While the algorithms themselves were implemented anew, multiple existing libraries and resources were useful; in no particular order:

Kokkos: https://github.com/kokkos/kokkos
RAJA: https://github.com/LLNL/RAJA
Thrust / CUDA C++ Core Libraries: https://github.com/nvidia/cccl
ThrustRTC: https://github.com/fynv/ThrustRTC
Optimizing parallel reduction in CUDA: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
Parallel prefix sum (scan) with CUDA: https://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/scan/doc/scan.pdf
Parallel prefix sum (scan) with CUDA: https://github.com/mattdean1/cuda
rocThrust: https://github.com/ROCm/rocThrust
FidelityFX: https://github.com/GPUOpen-Effects/FidelityFX
Intel oneAPI DPC++ library: https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-library.html
Metal performance shaders: https://developer.apple.com/documentation/metalperformanceshaders

Much of this work was possible because of the fantastic HPC resources at the University of Birmingham and the Birmingham Environment for Academic Research, which gave us free on-demand access to thousands of CPUs and GPUs that we experimented on, and the support teams we nagged. In particular, thank you to Kit Windows-Yule and Andrew Morris on the BlueBEAR and Baskerville T2 supercomputers' leadership, and Simon Branford, Simon Hartley, James Allsopp and James Carpenter for computing support.