Ask HN: How to learn CUDA to professional level
163 points
10 hours ago
| 25 comments
| HN
Hi all, I was wondering what books/courses/projects one might do to learn CUDA programming.

(To be frank, the main reason is a lot of companies I'd wish to work for require CUDA experience -- this shouldn't change your answers hopefully, just wanted to provide some context )

indianmouse
5 hours ago
[-]
As a very early CUDA programmer who participated in the cudacontest from NVidia during 2008 and I believe one of the only entries (I'm not claiming though) to be submitted from India and got a consolation and participation prize of a BlackEdition Card, I can vouch the method which I followed.

- Look up the CUDA Programming Guide from NVidia

- CUDA Programming books from NVidia from developer.nvidia.com/cuda-books-archive link

- Start creating small programs based on the existing implementations (A strong C implementation knowledge is required. So, brush up if needed.)

- Install the required Toolchains, compilers, and I am assuming you have the necessary hardware to play around

- Github links with CUDA projects. Read the code, And now you could use LLM to explain the code in the way you would need

- Start creating smaller, yet parallel programs etc., etc.,

And in about a month or two, you should have enough to start writing CUDA programs.

I'm not aware of the skill / experience levels you have, but whatever it might be, there are plenty of sources and resources available now than it was in 2007/08.

Create a 6-8 weeks of study plan and you should be flying soon!

Hope it helps.

Feel free to comment and I can share whatever I could to guide.

reply
edge17
58 minutes ago
[-]
What environment do you use? Is it still the case that Windows is the main development environment for cuda?
reply
hiq
5 hours ago
[-]
> I am assuming you have the necessary hardware to play around

Can you expand on that? Is it enough to have an nvidia graphic card that's like 5 year old, or do you need something more specific?

reply
rahimnathwani
5 hours ago
[-]
I'm not a CUDA programmer, but AIUI:

- you will want to install the latest version of CUDA Toolkit (12.9.1)

- each version of CUDA Toolkit requires the card driver to be above a certain version (e.g. toolkit depends on driver version 576 or above)

- older cards often have recent drivers, e.g. the current version of CUDA Toolkit will work with a GTX 1080, as it has a recent (576.x) driver

reply
slt2021
3 hours ago
[-]
each nVidia GPU has a certain Compute Capability (https://developer.nvidia.com/cuda-gpus).

Depending on the model and age of your GPU, it will have a certain capability that will be the hard ceiling for what you can program using CUDA

reply
dpe82
2 hours ago
[-]
When you're just getting started and learning that won't matter though. Any Nvidia card from the last 10 years should be fine.
reply
throwaway81523
9 hours ago
[-]
I looked at the CUDA code for Leela Chess Zero and found it pretty understandable, though that was back when Leela used a DCNN instead of transformers. DCNN's are fairly simple and are explained in fast.ai videos that I watched a few years ago, so navigating the Leela code wasn't too difficult. Transformers are more complicated and I want to bone up on them, but I haven't managed to spend any time understanding them.

CUDA itself is just a minor departure from C++, so the language itself is no big deal if you've used C++ before. But, if you're trying to get hired programming CUDA, what that really means is they want you implementing AI stuff (unless it's game dev). AI programming is a much wider and deeper subject than CUDA itself, so be ready to spend a bunch of time studying and hacking to come up to speed in that. But if you do, you will be in high demand. As mentioned, the fast.ai videos are a great introduction.

In the case of games, that means 3D graphics which these days is another rabbit hole. I knew a bit about this back in the day, but it is fantastically more sophisticated now and I don't have any idea where to even start.

reply
robotnikman
5 minutes ago
[-]
>But if you do, you will be in high demand

So I'm guessing trying to find a job as a CUDA programmer is nowhere as big of a headache compared to other software engineering jobs right now? I'm thinking maybe learning CUDA and more about AI might be a good pivot from the current position as a Java middleware developer.

reply
upmind
9 hours ago
[-]
This is a great idea! This is the code right' https://github.com/leela-zero/leela-zero

I have two beginner (and probably very dumb) questions, why do they have heavy c++/cuda usage rather than using only pytorch/tensorflow. Are they too slow for training Leela? Second, why is there tensorflow code?

reply
henrikf
5 hours ago
[-]
That's Leela Zero (plays Go instead of Chess). It was good for its time (~2018) but it's quite outdated now. It also uses OpenCL instead of Cuda. I wrote a lot of that code including Winograd convolution routines.

Leela Chess Zero (https://github.com/LeelaChessZero/lc0) has much more optimized Cuda backend targeting modern GPU architectures and it's written by much more knowledgeable people than me. That would be a much better source to learn.

reply
throwaway81523
8 hours ago
[-]
As I remember, the CUDA code was about 3x faster than the tensorflow code. The tensorflow stuff is there for non-Nvidia GPU's. This was in the era of the GTX 1080 or 2080. No idea about now.
reply
upmind
8 hours ago
[-]
Ah I see, thanks a lot!
reply
imjonse
9 hours ago
[-]
These should keep you busy for months:

https://www.gpumode.com/ resources and discord community Book: Programming massively parallel processors nvidia cuda docs are very comprehensive too https://github.com/srush/GPU-Puzzles

reply
mdaniel
2 hours ago
[-]
reply
amelius
7 hours ago
[-]
This follows a "winner takes all" scenario. I see the differences between the submissions are not so large, often smaller than 1%. Kind of pointless to work on this, if you ask me.
reply
lokimedes
9 hours ago
[-]
There’s a couple of “concerns” you may separate to make this a bit more tractable:

1. Learning CUDA - the framework, libraries and high-layer wrappers. This is something that changes with times and trends.

2. Learning high-performance computing approaches. While a GPU and the Nvlink interfaces are Nvidia specific, working in a massively-parallel distributed computing environment is a general branch of knowledge that is translatable across HPC architectures.

3. Application specifics. If your thing is Transformers, you may just as well start from Torch, Tensorflow, etc. and rely on the current high-level abstractions, to inspire your learning down to the fundamentals.

I’m no longer active in any of the above, so I can’t be more specific, but if you want to master CUDA, I would say learning how massive-parallel programming works, is the foundation that may translate into transferable skills.

reply
rramadass
8 hours ago
[-]
This is the right approach. Without (2) trying to learn (1) will just lead to "confusion worse confounded". I also suggest a book recommendation here - https://news.ycombinator.com/item?id=44216478
reply
jonas21
4 hours ago
[-]
I think it depends on your learning style. For me, learning something with a concrete implementation and code that you can play around with is a lot easier than trying to study the abstract general concepts first. Once you have some experience with the code, you start asking why things are done a certain way, and that naturally leads to the more general concepts.
reply
lokimedes
7 hours ago
[-]
This one was my go-to for HPC, but it may be a bit dated by now: https://www.amazon.com/Introduction-Performance-Computing-Sc...
reply
rramadass
6 hours ago
[-]
That's a good book too (i have it) but more general than the Ridgway Scott book which uses examples from Numerical Computation domains. Here is an overview of the chapters; example domains start from chapter 10 onwards - https://www.jstor.org/stable/j.ctv1ddcxfs

These sort of books are only "dated" when it comes to specific languages/frameworks/libraries. The methods/techniques are evergreen and often conceptually better explained in these older books.

For recent up to date works on HPC, the free multi-volume The Art of High Performance Computing by Victor Eijkhout can't be beat - https://news.ycombinator.com/item?id=38815334

reply
elashri
9 hours ago
[-]
I will give you personal experience learning CUDA that might be helpful.

Disclaime: I don't claim that this is actually a systematic way to learn it and it is more for academic work.

I got assigned to a project that needs learning CUDA as part of my PhD. There was no one in my research group who have any experience or know CUDA. I started with standard NVIDIA courses (Getting Started with Accelerated Computing with CUDA C/C++ and there is python version too).

This gave me good introduction to the concepts and basic ideas but I think after that I did most of learning by trial and error. I tried a couple of online tutorials for specific things and some books but it was always a deprecated function there or here or a change of API that make things obsolete. Or basically things changed for your GPU and now you have to be careful because yoy might be using GPU version not compatible with what I develop for in production and you need things to work for both.

I think learning CUDA for me is an endeavor of pain and going through "compute-sanitizer" and Nsight because you will find that most of your time will go into debugging why things is running slower than you think.

Take things slowly. Take a simple project that you know how to do without CUDA then port it to CUDA ane benchmark against CPU and try to optimize different aspect of it.

The one advice that can be helpful is not to think about optimization at the beginning. Start with correct, then optimize. A working slow kernel beats a fast kernel that corrupts memory.

reply
korbip
7 hours ago
[-]
I can share a similar PhD story (the result being visible here: https://github.com/NX-AI/flashrnn). Back then I didn't find any tutorials that cover anything beyond the basics (which are still important). Once you have understood the principle working mode and architecture of a GPU, I would recommend the following workflow: 1. First create an environment so that you can actually test your kernels against baselines written in a higher-level language. 2. If you don't have an urgent project already, try to improve/re-implement existing problems (MatMul being the first example). Don't get caught by wanting to implement all size cases. Take an example just to learn a certain functionality, rather than solving the whole problem if it's just about learning. 3. Write the functionality you want to have in increasing complexity. Write loops first, then parallelize these loops over the grid. Use global memory first, then put things into shared memory and registers. Use plain matrix multiplication first, then use mma (TensorCore) primitives to speed things up. 4. Iterate over the CUDA C Programming Guide. It covers all (most) of the functionality that you want to learn - but can't be just read an memorized. When you apply it you learn it. 5. Might depend on you use-case but also consider using higher-level abstractions like CUTLASS or ThunderKitten. Also, if your environment is jax/torch, use triton first before going to CUDA level.

Overall, it will be some pain for sure. And to master it including PTX etc. will take a lot of time.

reply
kevmo314
8 hours ago
[-]
> I think learning CUDA for me is an endeavor of pain and going through "compute-sanitizer" and Nsight because you will find that most of your time will go into debugging why things is running slower than you think.

This is so true it hurts.

reply
sputknick
6 hours ago
[-]
I used this to teach high school students. Probably not sufficient to get what you want, but it should get you off the ground and you can run from there. https://youtu.be/86FAWCzIe_4?si=buqdqREWASNPbMQy
reply
alecco
2 hours ago
[-]
Ignore everybody else. Start with CUDA Thrust. Study carefully their examples. See how other projects use Thrust. After a year or two, go deeper to cub.

Do not implement algorithms by hand. Recent architectures are extremely hard to reach decent occupancy and such. Thrust and cub solve 80% of the cases with reasonable trade-offs and they do most of the work for you.

https://developer.nvidia.com/thrust

reply
bee_rider
4 minutes ago
[-]
It looks quite nice just from skimming the link.

But, I don’t understand the comparison to TBB. Do they have a version of TBB that runs on the GPU natively? If the TBB implementation is on the CPU… that’s just comparing two different pieces of hardware. Which would be confusing, bordering on dishonest.

reply
mekpro
5 hours ago
[-]
To professionals in the field, I have a question: what jobs, positions, and companies are in need of CUDA engineers? My current understanding is that while many companies use CUDA's by-products (like PyTorch), direct CUDA development seems less prevalent. I'm therefore seeking to identify more companies and roles that heavily rely on CUDA.
reply
kloop
5 hours ago
[-]
My team uses it for geospatial data. We rasterize slippy map tiles and then do a raster summary on the gpu.

It's a weird case, but the pixels can be processed independently for most of it, so it works pretty well. Then the rows can be summarized in parallel and rolled up at the end. The copy onto the gpu is our current bottleneck however.

reply
ForgotIdAgain
9 hours ago
[-]
I have not tried it yet, but seems nice : https://leetgpu.com/
reply
fifilura
3 hours ago
[-]
I am not a CUDA programmer but when looking at this, I think I can see the parallels to Spark and SQL

https://gfxcourses.stanford.edu/cs149/fall24/lecture/datapar...

So - start getting used to programming without using for loops, would be my tip.

reply
SoftTalker
5 hours ago
[-]
It's 2025. Get with the times, ask Claude to do it, and then ask it to explain it to you as if you're an engineer who needs to convince a hiring manager that you understand it.
reply
rakel_rakel
2 hours ago
[-]
Might work in 2025, 2026 will demand more.
reply
gdubs
3 hours ago
[-]
I like to learn through projects, and as a graphics guy I love the GPU Gems series. Things like:

https://developer.nvidia.com/gpugems/gpugems3/part-v-physics...

As an Apple platforms developer I actually worked through those books to figure out how to convert the CUDA stuff to Metal, which helped the material click even more.

Part of why I did it was – and this was some years back – I wanted to sharpen my thinking around parallel approaches to problem solving, given how central those algorithms and ways of thinking are to things like ML and not just game development, etc.

reply
tkuraku
6 hours ago
[-]
I think you just pick a problem you want to solve with gpu programming and go for it. Learning what you need along the way. Nvidia blog posts are great for learning things along the way such as https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kern...
reply
math_dandy
5 hours ago
[-]
Are there any GPU emulators you can use to run simple CUDA programs on a commodity laptops, just to get comfortable with the mechanics, the toolchain, etc.?
reply
corysama
4 hours ago
[-]
https://leetgpu.com/ emulates running simple CUDA programs in a web page with zero setup. It’s a good way to get your toes wet.
reply
gkbrk
5 hours ago
[-]
Commodity laptops can just use regular non-emulated CUDA if they have an Nvidia GPU. It's not just for datacenter GPUs, a ton of regular consumer GPUs are also supported.
reply
bee_rider
2 minutes ago
[-]
A commodity laptop doesn’t have a GPU these days, iGPUs are good enough for basic tasks.
reply
matt3210
2 hours ago
[-]
Just make cool stuff. Find people to code review. I learn way more during code reviews than anything else.
reply
canyp
3 hours ago
[-]
My 2 cents: "Learning CUDA" is not the interest bit. Rather, you want to learn two things: 1) GPU hardware architecture, 2) parallelizing algorithms. For CUDA specifically, there is the book CUDA Programming Guide from Nvidia, which will teach you the basics of the language. But what these jobs typically require is that you know how to parallelize an algorithm and squeeze the most of the hardware.
reply
weinzierl
8 hours ago
[-]
Nvidia itself has a paid course series. It is a bit older but I believe still relevant. I have bought it, but not yet started it yet. I intend to do so during the summer holidays.
reply
majke
7 hours ago
[-]
I had a bit, limited, exposure to cuda. It was before the AI boom, during Covid.

I found it easy to start. Then there was a pretty nice learning curve to get to warps, SM's and basic concepts. Then I was able to dig deeper into the integer opcodes, which was super cool. I was able to optimize the compute part pretty well, without much roadblocks.

However, getting memory loads perfect and then getting closer to hw (warp groups, divergence, the L2 cache split thing, scheduling), was pretty hard.

I'd say CUDA is pretty nice/fun to start with, and it's possible to get quite far for a novice programmer. However getting deeper and achieving real advantage over CPU is hard.

Additionally there is a problem with Nvidia segmenting the market - some opcodes are present in _old_ gpu's (CUDA arch is _not_ forwards compatible). Some opcodes are reserved to "AI" chips (like H100). So, to get code that is fast on both H100 and RTX5090 is super hard. Add to that a fact that each card has different SM count and memory capacity and bandwidth... and you end up with an impossible compatibility matrix.

TLDR: Beginnings are nice and fun. You can get quite far on the optimizing compute part. But getting compatibility for differnt chips and memory access is hard. When you start, chose specific problem, specific chip, specific instruction set.

reply
epirogov
8 hours ago
[-]
I bought P106-90 for 20$ and start porting my date apps to parallel processing with it.
reply
rramadass
8 hours ago
[-]
CUDA GPGPU programming was invented to solve certain classes of parallel problems. So studying these problems will give you greater insight into CUDA based parallel programming. I suggest reading the following old book along with your CUDA resources.

Scientific Parallel Computing by L. Ridgway Scott et. al. - https://press.princeton.edu/books/hardcover/9780691119359/sc...

reply
Onavo
9 hours ago
[-]
Assuming you are asking this because of the deep learning/ChatGPT hype, the first question you should ask yourself is, do you really need to? The skills needed for CUDA are completely unrelated to building machine learning models. It's like learning to make a TLS library so you can get a full stack web development job. The skills are completely orthogonal. CUDA belongs to the domain of game developers, graphics people, high performance computing and computer engineers (hardware). From the point of view of machine learning development and research, it's nothing more than an implementation detail.

Make sure you are very clear on what you want. Most HR departments cast a wide net (it's like how every junior role requires "3-5 years of experience" when in reality they don't really care). Similarly when hiring, most companies pray for the unicorn developer who can understand the entire stack from the GPU to the end user product domain when the day to day is mostly in Python.

reply
brudgers
1 hour ago
[-]
For better or worse, direct professional experience in a professional setting is the only way to learn anything to a professional level.

That doesn't mean one-eyed-king knowledge is never enough to solve that chicken-and-egg. You only have to be good enough to get the job.

But if you haven't done it on the job, you don't have work experience and you are either lying to others or lying to yourself...and any sophisticated organization won't fall for it...

...except of course, knowingly. And the best way to get someone to knowingly ignore obvious dunning-kruger and/or horseshit is to know that someone personally or professionally.

Which is to say that the best way to get a good job is to have a good relationship with someone who can hire you for a good job (nepotism trumps technical ability, always). And the best way to find a good job is to know a lot of people who want to work with you.

To put it another way, looking for a job is the only way to find a job and looking for a job is also much much harder than everything that avoids looking for a job (like studying CUDA) by pretending to be preparation...because again, studying CUDA won't ever give you professional experience.

Don't get me wrong, there's nothing wrong with learning CUDA all on your own. But it is not professional experience and it is not looking for a job doing CUDA.

Finally, if you want to learn CUDA just learn it for its own sake without worrying about a job. Learning things for their own sake is the nature of learning once you get out of school.

Good luck.

reply
sremani
4 hours ago
[-]
The book - PMPP - Programming Massively Parallel Processors

The YouTube Channel - CUDA_MODE - it is based on PMPP I could not find the channel, but here is the playlist https://www.youtube.com/watch?v=LuhJEEJQgUM&list=PLVEjdmwEDk...

Once done, you would be on solid foundation.

reply
dist-epoch
9 hours ago
[-]
As they typically say: Just Do It (tm).

Start writing some CUDA core to sort an array or find the maximum element.

reply
the__alchemist
7 hours ago
[-]
I concur with this. Then supplement with resources A/R. Ideally, find some tasks in your programs that are parallelize. (Learning what these are is important too!), and switch them to Cuda. If you don't have any, make a toy case, e.g. an n-body simulation.
reply
amelius
9 hours ago
[-]
I'd rather learn to use a library that works on any brand of GPU.

If that is not an option, I'll wait!

reply
latchkey
5 hours ago
[-]
Then learn PyTorch.

The hardware between brands is fundamentally different. There isn't a standard like x86 for CPUs.

So, while you may use something like HIPIFY to translate your code between APIs, at least with GPU programming, it makes sense to learn how they differ from each other or just pick one of them and work with it knowing that the others will just be some variation of the same idea.

reply
horsellama
3 hours ago
[-]
the jobs requiring cuda experience are most of the times because torch is not good enough
reply
moralestapia
29 minutes ago
[-]
K, bud.

Perhaps you haven't noticed, but you're in a thread that asked about CUDA, explicitly.

reply
pjmlp
9 hours ago
[-]
If only Khronos and the competition cared about the developer experience....
reply
the__alchemist
7 hours ago
[-]
This is continuously a point of frustration! Vulkan compute is... suboptimal. I use Cuda because it feels like the only practical option. I want Vulkan or something else to compete seriously, but until that happens, I will use Cuda.
reply
pjmlp
5 hours ago
[-]
It took until Vulkanised 2025, to acknowledge Vulkan became the same mess as OpenGL, and to put an action plan into action to try to correct this.

Had it not been for Apple with OpenCL initial contribution, regardless of how it went from there, AMD with Mantle as starting point for Vulkan, NVidia with Vulkan-Hpp and Slang, and the ecosystem of Khronos standards would be much worse.

Also Vulkan isn't as bad as OpenGL tooling, because LunarG exists, and someone pays them for the whole Vulkan SDK.

The attitude "we put paper standards" and the community should step in for the implementations and tooling, hardly comes to the productivity from private APIs tooling.

Also all GPU vendors, including Intel and AMD, also rather push their own compute APIs, even if based on top of Khronos ones.

reply
corysama
4 hours ago
[-]
Is https://github.com/KomputeProject/kompute + https://shader-slang.org/ getting there?

Runs on anything + auto-differentiatation.

reply
Cloudef
6 hours ago
[-]
Both zig and rust are aiming to compile to gpus natively. What cuda and hip provide is heterogeneous computing runtime, aka hiding the boilerplate of executing code on cpu and gpu seamlessly
reply
uecker
4 hours ago
[-]
GCC / clang also have support for offloading.
reply
izharkhan
8 hours ago
[-]
Haking Kase kare
reply