They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?
https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...
I would love to be enlightened about some real-world applications of radix sort I may have missed though, since it's a cool algorithm. Hence my question above.
LLMs are made from dense matrices, aren't they?
[1] https://developer.nvidia.com/blog/mastering-llm-techniques-i...