im especially curious how build times would compare? Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow. coincidentally, just last week i was profiling build times and found that tools like sccache can dramatically reduce rebuild times by caching artifacts - but you still end up paying for expensive custom nvcc invocations (e.g. candle by hugging face calls custom nvcc command in their kernel compilation): https://arpadvoros.com/posts/2026/05/05/speeding-up-rust-whi...
> Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow.
I anecdotally haven't hit this; see the `cuda_setup` crate I made to handle the build scripts; it is a simple `build.rs` which only recompiles if the file changes, and it's a tiny compile time (compared to the rust CPU-side code)
That would be amazing, but probably not imo complementarily so.
I am curious what distinguished cuda-oxide. Beyond it being totally under nv control.
briefly looking at the repo, looks like the main workflow is using rustc-codegen-cuda to convert rust -> MIR -> pliron IR -> LLVM IR -> PTX, which is embedded in the host binary, where then cuda-core loads embedded PTX at runtime onto the GPU
but, if you arent directly making cuda kernels and just want cudarc for either calling existing kernels or other cuda driver api access then cudarc is lighter-weight option? or just use one of the sub-crates in this repo like cuda-core for those apis
1. use-after-free, drop semantics vs manual cudaFree
2. kernel args enforced using `cuda_launch!` whereas CPP void* args is just an array of pointers, validating count only
3. alias mutable writes. e.g. CPP can have more than one thread writing out[i] with same i and this will compile. but DisjointSlice<T> with ThreadIndex doesnt have any public constructor (see: https://github.com/NVlabs/cuda-oxide/blob/2a03dfd9d5f3ecba52...) and only using API of `index_1d` `index_2d` and `index_2d_runtime`
4. im pretty sure you can cuda memcpy a std::string and literally any other POD and "corrupt" its state making it unusable. here it ONLY accepts DisjointSlice<T>, scalars, and closures (https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...)
but most of the nitty gritty is in these sections
* https://nvlabs.github.io/cuda-oxide/gpu-safety/the-safety-mo...
* https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...
edit: that being said, not like this catch everything, just looks to give much more guardrails against UB with raw .cu files
From my perspective of someone who writes applications in Rust and sometimes wants to use GPU compute in these applications: I don't care. If we can leverage the memory model or ownership model in a low-friction way, that's fine. If it makes it a high friction experience, I would prefer not to do it that way.
The baseline is IMO how Cudarc currently does it. I don't think there is much memory management involved; it's just imperative syntax wrapping FFI, and some lines in the build script to invoke nvcc if the kernels change.
(Disclaimer: I like Slang a lot.)
Also shading languages are more user friendly given their features.
Finally NVida already has Slang in production and those folks aren't going to rewrite shader pipelines into Rust.
There's library code in rust that manages GPU memory and schedules pipelines and use a slang reflection to ensure memory layouts between rust and shaders match.
Oh and it supports metal/vulkan/dx12
Official CUDA port and they couldn't even bother with the introductory paragraph.
Okay, I'll try to ignore it and read the docs. Hey a custom IR, this sounds interesti-
> MLIR’s implementation, however, is C++ with a side of TableGen, a build system that requires you to compile all of LLVM, and debugging sessions that make you question your career choices.
I can't take this industry seriously anymore.
this is exactly on brand dog-fooding I would expect from an AI hyper
> A GPU kernel runs thousands of threads that all see the same memory at the same time. On a CPU, Rust prevents data races through ownership and borrowing – one mutable reference, no aliases, enforced at compile time. On a GPU, you have 2048 threads per SM, all launched from the same function, all pointing at the same output buffer. The borrow checker was not designed for this.
> cuda-oxide solves the problem in layers. The common case – one thread writes one element – is safe by construction, no unsafe required. The uncommon cases – shared memory, warp shuffles, hardware intrinsics – require unsafe with documented contracts. And the frontier cases – TMA, tensor cores, cluster-level communication – are fully manual, matching the complexity of the hardware they control.
That's.. not really Rusty. In Rust, we create new safe abstractions when the existing ones don't quite map to the problem at hand. See for example what's done in Rust for Linux
If it's not safe.. what's the point of Rust?
(it's okay to offer unsafe APIs for people that need to squeeze the last bit of performance, but this shouldn't be the baseline)
I compare this with userspace libs for APIs like io_uring and vulkan. designing safe APIs for them stuff is kind of hard (there's even some unsound attempts)
Does anyone have more details on NVIDIAs use of Spark/Ada?
All I can find is what's listed below:
https://www.adacore.com/case-studies/nvidia-adoption-of-spar...
All software can come on three editions. Stainless drivers that were never rusty, oxidized drivers that used Rust on existing code, and Iron editions which is where someone converted the Rust back to C using the new phosphoric tool...
Diversity can be our strength.
Making Iron C/c++ code can be called acid washing if it was rusted.
> Then GPL fans can
Checks out
https://rust-lang.github.io/rust-project-goals/2024h2/Rust-f...
note that it also discusses `std::offload`, which might also be of interest.
edit: oh, automatic differentiation?
Continuing from this discussion [0], this only makes it a Rust or a CUDA problem rather than a Python, CUDA and a PyTorch one if there bug in one of them.
Yet at the end of the day, it still uses Nvidia's closed source CUDA compiler 'nvcc' which they will never open source. A least Mojo promises to open source their own compiler which compiles to different accelerators with multiple backend support.
Unlike this...but uses Rust.
Also Linux support for CUDA on laptops, especially with dual GPU setup isn't particularly great.
Most workstation class laptops are Windows based.
Could maybe be forked with some dynamic smarts, HIP is basically 1:1 with CUDA: https://github.com/amd/amd-lab-notes/blob/release/hipify%2Fs...
Otherwise it isn't 1:1 with CUDA, and I am not counting everything else on CUDA ecosystem
Fortran: https://github.com/ROCm/hipfort
Python (not sure about JIT): https://rocm.docs.amd.com/projects/hip-python/en/latest/
Those people probably did not buy an Nvidia GPU for themselves. It should be common knowledge that the "Open" Nvidia drivers still run gigantic firmware blobs to dispatch complex workloads. And Nouveau is close to useless for GPGPU compute.
1) Higher level code is easier for LLMs to review and iterate upon. The more the intent is clear from the code, the easier it is for humans and LLMs to work with.
2) LLMs get stuck or fail to solve a problem sometimes. It is preferable to have artifacts that humans can grok without the massive extra effort of parsing out assembly code.
3) Assembly code varies massively across targets. We want provable, deterministic transformation from the intent (specified in a higher level language) to the target assembly language. LLMs can't reliably output many artifacts for different platforms that behave the same.
4) Hopefully, we are still reviewing the code output by LLMs to some extent.
The counter-argument, and one that matches my experience is working at a lower level is actually beneficial for LLMs since they can see the whole picture and don’t have to guess at abstractions.
A very big practical reason is also that assembler code would eat context like no other.
1.5) Having a compiler in the loop that does things like enforcing type constraints (and in the case if Rust in particular, therefore memory safety guarantees) is really useful both for humans and LLMs.
It’s just a matter of different workflows for different users and application.
Programming languages are tools for thinking. It's not clear that assembly code has the right abstractions to encourage the kind of thinking that programming large systems requires. After all, human intelligence found assembly insufficient and went on to invent better languages for thinking, why should artificial intelligence, trained on human intelligence, be any different? Maybe AI in the future will have its own languages for thinking, but assembly is likely not that.