Benchmarks

Some arithmetic-heavy benchmarks are given below - see this repository for the code; our paper will be linked here upon publishing with a full analysis.

Arithmetic benchmark

See protoype/sort_benchmark.jl for a small-scale sorting benchmark code and prototype/thrust_sort for the Nvidia Thrust wrapper. The results below are from a system with Linux 6.6.30-2-MANJARO, Intel Core i9-10885H CPU, Nvidia Quadro RTX 4000 with Max-Q Design GPU, Thrust 1.17.1-1, Julia Version 1.10.4.

Sorting benchmark

As a first implementation in AcceleratedKernels.jl, we are on the same order of magnitude as Nvidia's official sorter (x3.48 slower), and an order of magnitude faster (x10.19) than the Julia Base CPU radix sort (which is already one of the fastest).

The sorting algorithms can also be combined with MPISort.jl for multi-device sorting - indeed, you can co-operatively sort using both your CPU and GPU! Or use 200 GPUs on the 52 nodes of Baskerville HPC to sort 538-855 GB of data per second (comparable with the highest figure reported in literature of 900 GB/s on 262,144 CPU cores):

Sorting throughput

Hardware stats for nerds available here. Full analysis will be linked here once our paper is published.