This makes the project proportionally harder in my opinion because you need to be that much more efficient with moving data through the memory hierarchy. With tensor cores, to get anywhere close to cuBLAS, you need to start with something like the most efficient kernel in simon's article, and then do stuff like shared memory swizzling, async global memory copies, double buffering, and writing a really efficient kernel epilogue to accumulate the C matrix into the product.
I came across this article a while ago and it inspired me to take a stab at this^, and as of now I have gotten to ~80% of the cuBLAS tensor core performance where the kernel is mostly compute bound, and I am close to giving up on the last ~20%, because I think I may need to write the inner loop in SASS to make sure the instruction mix between shared memory loads, mma instructions, and synchronizations is perfectly balanced so that none of the hardware pipelines get overloaded (see link below), and I have enough compassion for myself to not spend my free time doing stuff like that :). There are also certain things implemented in CUTLASS that seem important (look up serpentine traversal) but NVIDIA engineers wont talk about the hardware details required to understand why this helps.
Article on this is forthcoming
I’d be so happy if SASS were documented and ptxas were open source, sometimes I spend entire days going through whitepapers and various sources of online documentation to get more hardware details…
My guess is that people nowadays are gradually moving away from raw CUDA programming and moving towards things like Triton etc, and you won't be focusing on pure GEMM since you tend to do some fusion.
The Triton tutorial claims their performance is on par with cuBLAS.
https://triton-lang.org/main/getting-started/tutorials/03-ma...
Your guess is wrong. Besides the fact that there's much more to life than matmul (for which triton is just ok), the other obvious fact is that triton has exactly 1 frontend (python) and there's much more to life than that frontend.
I find that basically in every thread about low-level work there's someone making some weird comment about how triton or mojo or XYZ supplants CUDA or assembly or whatever. I can't understand how this comes about because absolutely no one working in these areas thinks XYZ is going to supplant anything. So it's invariably outsiders making these claims and I cannot fathom why any outsider would be motivated to make claims from the outside.
As an outsider CUDA is so intimidating so the promise of Triton etc is very appearing and I wanted to get sold.
i have PRs in Triton - i'm well familiar with the fact that triton is an MLIR project.
> C++ straight using MLIR
that's like saying llvm ir is usable through C++ ... or hell that's like saying NVPTX is usable through C++. it's not just not a frontend it's the exact opposite: it's emitting IR using IR builders.
Knowing that reaching broad devex parity is very expensive I think the real win is figuring out what specific problem you have and building community and robust software support around that.
It's the fact that AMD doesn't prioritize the reliability of its hardware and software stack. If I run llama.cpp on Vulkan I get a reasonable speedup, but if I raise the batch size to 512, the GPU is starting to make strange noises and shuts the PC down midway. Very cool. 98% of zero is still zero.
In fact cuBLAS and CUDA are kinda orthogonal in that you're either calling a pre-built cuBLAS kernel or writing your own CUDA kernel but not really combining the two.
I'd say CUDA shines more because of stability, documentation, community support + examples, and ability to use modern C++ features in GPU code.
Targeting nvidia GPUs? Or in general? For whom?
Building a performant BLAS library is hard but certainly not impossible. The tricks discussed in this post are hardly anything new either. Now, making a BLAS competitive with Nvidia's on its own GPUs is bound to be tough. But not technically unfeasible (after all, you can drop down to PTX if needed).
On average over 20 runs:
CuBLAS (./sgemm 0) has 50.9 TFLOPS.
My kernel has 61.8 TFLOPS, so it's actually +21% speedup in this benchmark.
How do I collect my paycheck?
On a 4090 gpu, average of 20 runs of SGEMM_CUDA:
size tflops_cublas tflops_my diff
4096² 50.8-50.9 61.8 +21%
8192² 56.3-56.4 67.1 +19%
16384² 53.6 66.7 +24%
I guess the right thing to do now would be to hire a B2B salesman and figure out, which company needs it.I have seen how those high-performance libraries are made and I'm still in awe at the quality and quantity of the staffing involved. Those were the smartest and most knowledgeable engineers I met in my career.
Generalizing from a micro benchmark is typically hubris.
Then there are also numerics: being fast is not enough if your implementation accumulates a lot of rounding errors doing so. Floating point arithmetic can and will mess up your results in unexpected ways. -funsafe famously is neither fun nor safe.
Maybe tooling will catch up and make it easier. Think tinygrad with beamsearch, triton or halide.