General Looping

    f, itr, backend::Backend=get_backend(itr);

    # CPU settings

    # GPU settings

Parallelised for loop over the indices of an iterable.

It allows you to run normal Julia code on a GPU over multiple arrays - e.g. CuArray, ROCArray, MtlArray, oneArray - with one GPU thread per index.

On CPUs at most max_tasks threads are launched, or fewer such that each thread processes at least min_elems indices; if a single task ends up being needed, f is inlined and no thread is launched. Tune it to your function - the more expensive it is, the fewer elements are needed to amortise the cost of launching a thread (which is a few μs). The scheduler can be :polyester to use Polyester.jl cheap threads or :threads to use normal Julia threads; either can be faster depending on the function, but in general the latter is more composable.


Normally you would write a for loop like this:

x = Array(1:100)
y = similar(x)
for i in eachindex(x)
    @inbounds y[i] = 2 * x[i] + 1

Using this function you can have the same for loop body over a GPU array:

using CUDA
const x = CuArray(1:100)
const y = similar(x)
foreachindex(x) do i
    @inbounds y[i] = 2 * x[i] + 1

Note that the above code is pure arithmetic, which you can write directly (and on some platforms it may be faster) as:

using CUDA
x = CuArray(1:100)
y = 2 .* x .+ 1

Important note: to use this function on a GPU, the objects referenced inside the loop body must have known types - i.e. be inside a function, or const global objects; but you shouldn't use global objects anyways. For example:

using oneAPI

x = oneArray(1:100)

# CRASHES - typical error message: "Reason: unsupported dynamic function invocation"
# foreachindex(x) do i
#     x[i] = i
# end

function somecopy!(v)
    # Because it is inside a function, the type of `v` will be known
    foreachindex(v) do i
        v[i] = i

somecopy!(x)    # This works