Reduce · AcceleratedKernels.jl

Reductions

Apply a custom binary operator reduction on all elements in an iterable; can be used to compute minima, sums, counts, etc.

Other names: Kokkos:parallel_reduce, fold, aggregate.

Function signature:

reduce(op, src::AbstractGPUVector; init,
       block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing,
       switch_below::Int=0)

Example computing a sum:

import AcceleratedKernels as AK
using CUDA

v = CuArray{Int16}(rand(1:1000, 100_000))
AK.reduce((x, y) -> x + y, v; init=0)

In a reduction there end up being very few elements to process towards the end; it is sometimes faster to transfer the last few elements to the CPU and finish there (in a reduction we have to do a device-to-host transfer anyways for the final result); switch_below may be worth using (benchmark!) - here computing a minimum with the reduction operator defined in a Julia do block:

AK.reduce(v; init=typemax(eltype(v)), switch_below=100) do x, y
    x < y ? x : y
end

Yes, the lambda within the do block can equally well be executed on both CPU and GPU, no code changes/duplication required.