Roadmap · AcceleratedKernels.jl

Roadmap / Future Plans

Help is very welcome for any of the below:

Automated optimisation / tuning of e.g. block_size for a given input; can be made algorithm-agnostic.
- Maybe some thing like AK.@tune reduce(f, src, init=init, block_size=$block_size) block_size=(64, 128, 256, 512, 1024). Macro wizards help!
- Or make it general like:
```
AK.@tune begin
    reduce(f, src, init=init,
           block_size=$block_size,
           switch_below=$switch_below)
    block_size=(64, 128, 256, 512, 1024)
    switch_below=(1, 10, 100, 1000, 10000)
end
```
Add performant multithreaded Julia implementations to all algorithms; e.g. foreachindex has one, any does not.
Any way to expose the warp-size from the backends? Would be useful in reductions.
Define default init values for often-used reductions? Or just expose higher-level functions like sum, minimum, etc.?
Add a performance regressions runner.
Other ideas? Post an issue, or open a discussion on the Julia Discourse.