Essentially, we solved the problem of writing our stack in a bulk-oriented way that Nvidia kernels can optimize. Think apache arrow, pure vectorized dataframe pipelines, etc. However, cudf is 'eager' with per-step CPU/GPU control plane coordination, even if the data plane lives on the GPU. Polars in theory moves to lazy scheduling that can allow deforesting optimizations for more bulk GPU-side control macro steps, but not really. Nvidia efforts to cut python asyncio costs for multitenant etc flows didn't pan out either. So enabling moving more to the GPU here is super interesting.
Will be watching!
Re: heterogenous workload: I'm told by a friend in HPC that the old advice about avoiding diverging branches within warps is no longer much of an issue – is that true?
GPU-wide memory is not quite as scarce on datacenter cards or systems with unified memory. One could also have local executors with local futures that are `!Send` and place in a faster address space.
Training pipelines are full of data preparation that are first written on CPU then moving to GPU and always thinking of what to keep on CPU and what to put on GPU, when is it worth to create a tensor, or should it be tiling instead. I guess your company is betting on solving problems like this (and async-await is needed for serving inference requests directly on the GPU for example).
My question is a little bit different: how do you want to handle the SIMD question: should a rust function be running on the warp as a machine with 32 long arrays as data types, or always ,,hope'' for autovectorization to work (especially with Rust's iter library helpers).
The anticipated benefits are similar to the benefits of async/await on CPU: better ergonomics for the developer writing concurrent code, better utilization of shared/limited resources, fewer concurrency bugs.
GPUs are still not practically-Turing-complete in the sense that there are strict restrictions on loops/goto/IO/waiting (there are a bunch of band-aids to make it pretend it's not a functional programming model).
So I am not sure retrofitting a Ferrari to cosplay an Amazon delivery van is useful other than for tech showcase?
Good tech showcase though :)
I understand with newer GPUs, you have clever partitioning / pipelining in such a way block A takes branch A vs block B that takes branch B with sync/barrier essentially relying on some smart 'oracle' to schedule these in a way that still fits in the SIMT model.
It still doesn't feel Turing complete to me. Is there an nvidia doc you can refer me to?
> In SIMT, all threads in the warp are executing the same kernel code, but each thread may follow different branches through the code. That is, though all threads of the program execute the same code, threads do not need to follow the same execution path.
This doesn't say anything about dependencies of multiple warps.
I am just saying it's not as flexible/cost-free as you would on a 'normal' von Neumann-style CPU.
I would love to see Rust-based code that obviates the need to write CUDA kernels (including compiling to different architectures). It feels icky to use/introduce things like async/await in the context of a GPU programming model which is very different from a traditional Rust programming model.
You still have to worry about different architectures and the streaming nature at the end of the day.
I am very interested in this topic, so I am curious to learn how the latest GPUs help manage this divergence problem.
Here with the async/await approach, it seems like there needs to be manual book-keeping at runtime to know what has finished, what has not, and _then_ consider which warp should we put this new computation in. Do you anticipate that there will be measurable performance difference?
https://devblogs.microsoft.com/dotnet/bing-on-dotnet-8-the-i...
You mention futures are cooperative and GPUs lack interrupts, but GPU warps already have a hardware scheduler that preempts at the instruction level. ARe you intentionally working above that layer, or do you see a path to a fture executor that hooks into warp scheduling more directly to get preemptive-like behavior?
In years prior I wouldn't have even bothered, but it's 2026 and AMD's drivers actually come with a recent version of torch that 'just works' on windows. Anything is possible :)
(Beyond that, "executing the same code" on multiple instances of a single coroutine ought to be sometimes possible on an opportunistic basis.)
I assume tokio-like, i.e. work-stealing?
I hope they can minimize the bookkeeping costs because I don't see it gain traction in AI if it hurts big kernels performance.
Is the goal with this project (generally, not specifically async) to have an equivalent to e.g. CUDA, but in Rust? Or is there another intended use-case that I'm missing?
I am, bluntly, sick of Async taking over rust ecosystems. Embedded and web/HTTP have already fallen. I'm optimistic this won't take hold in GPU; well see. Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.
I use rust on the GPU for the following: 3d graphics via WGPU, cuFFT via FFI, custom kernels via Cudarc, and ML via Burn and Candle. Thankfully these are all Async-free.
> Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.
Someone somewhere convinced you there is a async coloring problem. That person was wrong, async is an inherent property of some operations. Adding it as a type level construct gives visibility to those inherent behaviors, and with that more freedom in how you compose them.
flip the colouring problem on its head
Our code looks like pure pandas (fancier SQL) wrapped as HTTP service (arrow instead of json), so the expressivity is more of a step backwards. We already did the work of turning awkward irregular code into relational pipelines that GPUs love.
Our problems are:
- Multi-tenancy. Our users get to time share GPUs, so when getting many GPU tasks big & small, we want them co-scheduled across the many GPUs & their many cores. GPUs are already more cost effective per Watt than CPUs, but we think we can 2x+ here, which is significant.
- Constant overheads. One job can be deep, with many operations, so round-tripping each step of the control plane, think each SQL subexpression, CPU<>GPU is silly and adds up. Small jobs are dominated by embarrassing overheads that are precluding certain use cases. We are thinking of doing CPU hot paths to avoid this, but rather just fix the GPU path.