How to optimize a CUDA matmul kernel for cuBLAS-like performance (2022)
103 points
1 month ago
| 3 comments
| siboehm.com
| HN
aaa370
1 month ago
[-]
Another point to consider here is that this project of writing a cuBLAS level GEMM kernel becomes much more challenging if you are doing it with fp16, and are thus competing with the cuBLAS kernels that use tensor cores. The (theoretical) arithmetic throughput of tensor cores is ~8x higher as compared to fp32 math on the Turing arch, I dont know off the top of my head but I think this ratio is the same or greater for Ampere/Hopper tensor cores.

This makes the project proportionally harder in my opinion because you need to be that much more efficient with moving data through the memory hierarchy. With tensor cores, to get anywhere close to cuBLAS, you need to start with something like the most efficient kernel in simon's article, and then do stuff like shared memory swizzling, async global memory copies, double buffering, and writing a really efficient kernel epilogue to accumulate the C matrix into the product.

I came across this article a while ago and it inspired me to take a stab at this^, and as of now I have gotten to ~80% of the cuBLAS tensor core performance where the kernel is mostly compute bound, and I am close to giving up on the last ~20%, because I think I may need to write the inner loop in SASS to make sure the instruction mix between shared memory loads, mma instructions, and synchronizations is perfectly balanced so that none of the hardware pipelines get overloaded (see link below), and I have enough compassion for myself to not spend my free time doing stuff like that :). There are also certain things implemented in CUTLASS that seem important (look up serpentine traversal) but NVIDIA engineers wont talk about the hardware details required to understand why this helps.

Article on this is forthcoming

https://github.com/NervanaSystems/maxas/wiki/SGEMM

reply
gregjm
1 month ago
[-]
Re: serpentine traversal, this has to do with the .reuse suffix applied to register operands as mentioned in your link. We don’t really have control over it because it’s happening inside of ptxas during SASS generation, but when CUTLASS does serpentine traversal they’re suggesting an order of MMA instruction issues that would result in at least one operand being reused from one instruction to the next— clever stuff.

I’d be so happy if SASS were documented and ptxas were open source, sometimes I spend entire days going through whitepapers and various sources of online documentation to get more hardware details…

reply
flakiness
1 month ago
[-]
Note that this is from 2022.

My guess is that people nowadays are gradually moving away from raw CUDA programming and moving towards things like Triton etc, and you won't be focusing on pure GEMM since you tend to do some fusion.

The Triton tutorial claims their performance is on par with cuBLAS.

https://triton-lang.org/main/getting-started/tutorials/03-ma...

reply
almostgotcaught
1 month ago
[-]
> My guess is that people nowadays are gradually moving away from raw CUDA programming and moving towards things like Triton etc

Your guess is wrong. Besides the fact that there's much more to life than matmul (for which triton is just ok), the other obvious fact is that triton has exactly 1 frontend (python) and there's much more to life than that frontend.

I find that basically in every thread about low-level work there's someone making some weird comment about how triton or mojo or XYZ supplants CUDA or assembly or whatever. I can't understand how this comes about because absolutely no one working in these areas thinks XYZ is going to supplant anything. So it's invariably outsiders making these claims and I cannot fathom why any outsider would be motivated to make claims from the outside.

reply
flakiness
1 month ago
[-]
Great comment! My wrong comment at least served as a MacDonal's theory https://jonbell.medium.com/mcdonalds-theory-9216e1c9da7d

As an outsider CUDA is so intimidating so the promise of Triton etc is very appearing and I wanted to get sold.

reply
the_svd_doctor
1 month ago
[-]
FWIW Triton can actually be used pretty easily from C++ straight using MLIR. XLA does that. Maybe not exactly a frontend but not that far.
reply
almostgotcaught
1 month ago
[-]
> FWIW Triton

i have PRs in Triton - i'm well familiar with the fact that triton is an MLIR project.

> C++ straight using MLIR

that's like saying llvm ir is usable through C++ ... or hell that's like saying NVPTX is usable through C++. it's not just not a frontend it's the exact opposite: it's emitting IR using IR builders.

reply
machinekob
1 month ago
[-]
But Triton is just abstraction over CUDA for Python (same like cupy, numba etc.). If you need low lvl access you still will use CUDA if you want high level you can use Triton or numba even higer you'll just use wrappers like pytorch/jax.
reply
imjonse
1 month ago
[-]
It is still an excellent article if you care about how the GPU works. Triton doesn't magically hide all the hardware details.
reply
bee_rider
1 month ago
[-]
I’m sure they are using cool super-modern stuff but I wonder if Eigen can somehow be made to spit out CUDA code?
reply
lights0123
1 month ago
[-]
Since 2016, you can use Eigen in CUDA C++ code and it just works: https://eigen.tuxfamily.org/dox/TopicCUDA.html
reply
bee_rider
1 month ago
[-]
Oh wow, that is cool as heck.
reply
dang
1 month ago
[-]
Year added above. Thanks!
reply
joe_the_user
1 month ago
[-]
In other discussion here, people asserted that a CUDA replacement was unworkable because you couldn't replace Nvidia's CuBLAS implementation. I'm not qualified to say whether this would give info for constructing an adequate replacement but I'd interested in people's opinions.
reply
hedgehog
1 month ago
[-]
Yes and no. Read his notes about his goals and what he didn't implement, and note that his job when he wrote that was as part of a team that ports models to different hardware. That's a smaller job than writing a shippable framework and Anthropic has a team of people doing it. You can read the cuDNN docs to get an idea of some of the other stuff you'd need, there's a lot there but generally that's the point. So in one sense, yes, if you have a strong performance engineering team and a specific workload already you can make use of a variety of hardware. In another sense, no, a small team isn't likely to be able to write a direct alternative for the CUDA ecosystem that reaches devex parity. Lots of people have a pretty clear idea of what it takes to do this but none of the major vendors seem to be doing the work.

Knowing that reaching broad devex parity is very expensive I think the real win is figuring out what specific problem you have and building community and robust software support around that.

reply
imtringued
1 month ago
[-]
The problem with AMD isn't that they are only hitting 90% of the hardware capabilities vs Nvidia hitting 98%.

It's the fact that AMD doesn't prioritize the reliability of its hardware and software stack. If I run llama.cpp on Vulkan I get a reasonable speedup, but if I raise the batch size to 512, the GPU is starting to make strange noises and shuts the PC down midway. Very cool. 98% of zero is still zero.

reply
ladberg
1 month ago
[-]
cuBLAS is not really Nvidia's moat: every competitor has a workable BLAS implementation and some are even drop-in replacements for cuBLAS.

In fact cuBLAS and CUDA are kinda orthogonal in that you're either calling a pre-built cuBLAS kernel or writing your own CUDA kernel but not really combining the two.

I'd say CUDA shines more because of stability, documentation, community support + examples, and ability to use modern C++ features in GPU code.

reply
alfalfasprout
1 month ago
[-]
> In other discussion here, people asserted that a CUDA replacement was unworkable because you couldn't replace Nvidia's CuBLAS implementation

Targeting nvidia GPUs? Or in general? For whom?

Building a performant BLAS library is hard but certainly not impossible. The tricks discussed in this post are hardly anything new either. Now, making a BLAS competitive with Nvidia's on its own GPUs is bound to be tough. But not technically unfeasible (after all, you can drop down to PTX if needed).

reply
deredede
1 month ago
[-]
When I was optimizing GPU kernels a few years back, Nvidia's own kernels were getting those last few percent of performance by making use of hardware-specific features (I remember operand caches being one) that are not available through PTX.
reply
ap4
1 month ago
[-]
Beating CuBLAS is easy. I wrote a 30-line kernel that multiplies two 4096² matrices 14% faster than CuBLAS on a 4090. The question is how to earn money on that.
reply
ladberg
1 month ago
[-]
If this were true (and I highly doubt it) it's obvious how to make money from it: collect a 7 figure paycheck from Nvidia, AMD, or any FAANG.
reply
ap4
1 month ago
[-]
I swapped one of the kernels in the code from the article to my kernel, and left only the multiplication of matrices of size 4096².

On average over 20 runs:

CuBLAS (./sgemm 0) has 50.9 TFLOPS.

My kernel has 61.8 TFLOPS, so it's actually +21% speedup in this benchmark.

How do I collect my paycheck?

reply
aaa370
1 month ago
[-]
I gotta see it to believe it ;)
reply
ap4
1 month ago
[-]
For all doubters, I open-sourced it: https://github.com/arekpaterek/Faster_SGEMM_CUDA
reply
ap4
1 month ago
[-]
Believe it or not.

On a 4090 gpu, average of 20 runs of SGEMM_CUDA:

  size    tflops_cublas  tflops_my  diff
  4096²   50.8-50.9      61.8       +21%
  8192²   56.3-56.4      67.1       +19%
  16384²  53.6           66.7       +24%
I guess the right thing to do now would be to hire a B2B salesman and figure out, which company needs it.
reply
JuanJohnJames
1 month ago
[-]
Post the code and your curriculum
reply
coffeeaddict1
1 month ago
[-]
I'm a little sceptical of your claims. Care to share the kernel you wrote?
reply
david-gpu
1 month ago
[-]
I would bet actual money that they are not doing an apples-to-apples comparison.

I have seen how those high-performance libraries are made and I'm still in awe at the quality and quantity of the staffing involved. Those were the smartest and most knowledgeable engineers I met in my career.

reply
ap4
1 month ago
[-]
It is apples-to-apples. The code from the article runs a chosen kernel 50 times in a loop. In the case of CuBLAS it runs the function cublasGemmEx() 50 times.
reply
david-gpu
1 month ago
[-]
Does your implementation support the complete feature set of cublasGemmEx()? E.g. varying matrix sizes, etc. Does your implementation beat cublas on a wide range of cases, or only cherry-picked examples? Does the code from the article follow the best coding practices for cublas in the first place?

Generalizing from a micro benchmark is typically hubris.

reply
ap4
1 month ago
[-]
reply
GaggiX
1 month ago
[-]
As reported by the article: "However, for smaller matrices, we’re doing poorly in comparison to Nvidia’s library! This happens because cuBLAS contains not one single implementation of SGEMM, but hundreds of them.", it would take considerable effort to replace CuBLAS.
reply
KeplerBoy
1 month ago
[-]
Of course it can be done, it's just a lot of effort. You need parametric kernels you can find optimal configurations for all hardware and input size combinations.

Then there are also numerics: being fast is not enough if your implementation accumulates a lot of rounding errors doing so. Floating point arithmetic can and will mess up your results in unexpected ways. -funsafe famously is neither fun nor safe.

Maybe tooling will catch up and make it easier. Think tinygrad with beamsearch, triton or halide.

reply