Reductions
Apply a custom binary operator reduction on all elements in an iterable; can be used to compute minima, sums, counts, etc.
Other names:
Kokkos:parallel_reduce
,fold
,aggregate
.
Function signature:
reduce(op, src::AbstractGPUVector; init,
block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing,
switch_below::Int=0)
Example computing a sum:
import AcceleratedKernels as AK
using CUDA
v = CuArray{Int16}(rand(1:1000, 100_000))
AK.reduce((x, y) -> x + y, v; init=0)
In a reduction there end up being very few elements to process towards the end; it is sometimes faster to transfer the last few elements to the CPU and finish there (in a reduction we have to do a device-to-host transfer anyways for the final result); switch_below
may be worth using (benchmark!) - here computing a minimum with the reduction operator defined in a Julia do
block:
AK.reduce(v; init=typemax(eltype(v)), switch_below=100) do x, y
x < y ? x : y
end
Yes, the lambda within the do
block can equally well be executed on both CPU and GPU, no code changes/duplication required.