TinyTinyTPU: 2×2 systolic-array TPU-style matrix-multiply unit deployed on FPGA
75 points
4 hours ago
| 7 comments
| github.com
| HN
mrinterweb
2 hours ago
[-]
I've been wondering when we will see general purpose consumer FPGAs, and eventually ASICs, for inference. This reminds me of bitcoin mining. Bitcoin mining started with GPUs. I think I remember a brief FPGA period that transitioned to ASIC. My limited understanding of Google's tensor processing unit chips are that they are effectively a transformer ASIC. That's likely a wild over-simplification of Google's TPU, but Gemini is proof that GPUs are not needed for inference.

I suspect GPU inference will come to an end soon, as it will likely be wildly inefficient by comparison to purpose built transformer chips. All those Nvidia GPU-based servers may become obsolete should transformer ASICs become mainstream. GPU bitcoin mining is just an absolute waste of money (cost of electricity) now. I believe the same will be true for GPU-based inference soon. The hundreds of billions of dollars being invested on GPU-based inference seems like an extremely risky bet that ASIC transformers won't happen, although Google has already widely deployed their own TPUs.

reply
fooblaster
2 hours ago
[-]
FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore. 50% of the die area or more is for fixed function matrix multiplication units and associated dedicated storage. This just isn't general purpose anymore. FPGAs cannot rival this with their configurable DSP slices. They would need dedicated systolic blocks, which they aren't getting. The closest thing is the versal ML tiles, and those are entire peoxessors, not FPGA blocks. Those have failed by being impossible to program.
reply
fpgaminer
1 hour ago
[-]
> FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore.

Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.

reply
beeflet
1 hour ago
[-]
I have wondered if it is possible to make a mining algorithm FPGA-hard in the same way that RandomX is CPU-hard and memory-hard. Relative to CPUs, the "programming time" cost is high.

Nice username btw.

reply
Lerc
1 hour ago
[-]
I think it'll get to a point with quantisation that GPUs that run them will be more FPGA like than graphics renderers. If you quantize far enough things begin to look more like gates than floating point units. At that level a FPGA wouldn't run your model, it would be one your model.
reply
ithkuil
2 hours ago
[-]
Turns out that a lot of interesting computation can be expressed as a matrix multiplication.
reply
fooblaster
2 hours ago
[-]
Yeah, I wouldn't have guessed it would be helping me write systemverilog.
reply
alanma
1 hour ago
[-]
yup, GBs are so much tensor core nowadays :)
reply
bee_rider
1 hour ago
[-]
There are also CPU extensions like AVX512-VNNI and AVX512-BF16. Maybe the idea of communicating out to a card that holds your model will eventually go away. Inference is not too memory bandwidth hungry, right?
reply
Narew
2 hours ago
[-]
There was in the past. Google had Coral TPU and Intel the Neural Compute Stick (NCS). NCS is from 2018 so it's really outdated now. It was mainly oriented for edge computing so the flops was not comparable to desktop computer.
reply
moffkalast
1 hour ago
[-]
Even for edge computing neither were really even capable of keeping up with the slowest Jetson's GPU for not much less power draw.
reply
tucnak
2 hours ago
[-]
It all comes down to memory and fabric bandwidth. For example, the state of the art developer -friendly (PCIe 5.0) FPGA platform is Alveo V80 which rocks four 200G NIC's. Basically, Alveo currently occupies this niche where it's the only platform on the market to allow programmable in-network compute. However, what's available in terms of bandwidth—lags behind even pathetic platforms like Bluefield. Those in the know are aware of what challenges are there to actually saturate it for inference in practical designs. I think, Xilinx is super well-positioned here, but without some solid hard IP it's still a far cry from purpose silicon.
reply
mrinterweb
2 hours ago
[-]
As far as I understand all the inference purpose-build silicon out there is not being sold to competitors and kept in-house. Google's TPU, Amazon's Inferentia (horrible name), Microsoft's Maia, Meta's MTIA. It seems that custom inference silicon is a huge part of the AI game. I doubt GPU-based inference will be relevant/competitive soon.
reply
nightshift1
1 hour ago
[-]
According to this semianalysis article, the Google/Broadcom TPU are being sold to others like Anthropic.

https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...

reply
nomel
1 hour ago
[-]
> It seems that custom inference silicon is a huge part of the AI game.

Is there any public info about % inference on custom vs GPU, for these companies?

reply
mrinterweb
1 hour ago
[-]
Gemini is likely the most widely used gen AI model in the world considering search, Android integration, and countless other integrations into the Google ecosystem. Gemini runs on their custom TPU chips. So I would say a large portion of inference is already using ASIC. https://cloud.google.com/tpu
reply
almostgotcaught
1 hour ago
[-]
> soon

When people say things like this I always wonder if they really think they're smarter than all of the people at Nvidia lolol

reply
mrinterweb
1 hour ago
[-]
Soon was wrong. I should have said it is already happening. Google Gemini already uses their own TPU chips. Nvidia just dropped $20B to buy the IP for Groq's LPU (custom silicon for inference). $20B says Nvidia sees the writing on the wall for GPU-based inference. https://www.tomshardware.com/tech-industry/semiconductors/nv...
reply
alanma
1 hour ago
[-]
Thanks again for the repost and all the support!! Been a blast and super cool to see the interest, if you want to follow along for more of our writeups, our blog can be found here: https://chewingonchips.substack.com/
reply
hinkley
4 hours ago
[-]
I think I could trust AI more if we used it to do heuristics for expensive deterministic processes. Sort of a cross between Bloom Filters and speculative execution. Determine the odds the expensive operation 1 will indicate that expensive operation 2 needs to happen, and then start expensive operation 2 while we determine if it’s actually needed. If its right 95% of the time, which is the sort of ranges AI can aspire to, that’s skipping the high latency task chaining 19 times out of 20, which would be pretty good.
reply
hnuser123456
3 hours ago
[-]
There are Bayesian neural networks that could apparently track probability rather than just e.g. randomly selecting one output from the top-k based on probability, but I'm still learning up on them myself. Sounds like they're not normally combined with language models.
reply
rjsw
4 hours ago
[-]
There have been comments that some leading AI researchers were switching away from working on language models to do stuff with "real world data".
reply
p1esk
26 minutes ago
[-]
What do you mean?
reply
babl-yc
2 hours ago
[-]
This is cool. I'm observing a trend of "build a tiny version from the ground-up to understand it" a la Karpathy's micrograd/minGPT. Seems like one of the best ways to learn.
reply
alanma
1 hour ago
[-]
thanks for the kind words of support! definitely taught us a thing or two, hope you enjoyed the ride along

- Alan and Abiral

reply
aunty_helen
4 hours ago
[-]
I think it’s only a matter of time before we see asic vendors making TPU devices. Same thing happened with BTC. There was enough money there to spawn an industry. Nvidias 70% margins are too hard to ignore. And if playing on the open market seems too rough, there’s always acquisition potential like what happened to groq.
reply
NitpickLawyer
3 hours ago
[-]
Aren't high end accelerators already closer to ASICs than to og GPUs, tho?
reply
tonetegeatinst
1 hour ago
[-]
Yes, but not as much as you think.

A lot of silicon on a GPU is dedicated to upscaling and matrix multiply.

Ultimately GPU's main use is multimedia and graphics focused.

See all the miners that used to do GPU based mining...or the other niche markets where eventually the cost of custom asic becomes to attractive to ignore even if you as a consume have to handle a few years of growing pains.

reply
alanma
1 hour ago
[-]
hard to argue today's GPUs are really graphics focused anymore in the training / inference race :O

really excited about Rubin CPX / Feynman generations, let's see what the LPU does to the inference stack

reply
ph4evers
3 hours ago
[-]
Such a cool project! Next one is to run jaxprs via the driver?
reply
alanma
1 hour ago
[-]
Definitely thinking about that! Would be very cool to run the JAX / Pallas stack, noted on our end :)

- Alan and Abiral

reply
fooblaster
4 hours ago
[-]
Great! How do you program it?
reply
alanma
1 hour ago
[-]
A couple core commands in our ISA detailed on our GitHub, map your problem to matrix ops, here's a brief excerpt, but our tpu_compiler and tpu_driver are the core to programming your own:

from tpu_compiler import TPUCompiler, TPURuntime

class Custom(nn.Module):

    def __init__(self):
        super().__init__()

        self.layer1 = nn.Linear(2, 2, bias=False)
        self.layer2 = nn.Linear(2, 2, bias=False)

    def forward(self, x):
        x = self.layer1(x)
        x = torch.relu(x)
        x = self.layer2(x)
        return x
model = train_model(your_data)

# compile to the tiny tiny TPU format

compiler = TPUCompiler()

compiled = compiler.compile(model)

# run and enjoy :)

runtime = TPURuntime(tpu)

result = runtime.inference(compiled, input_data)

Will update soon with some better documentation, but hopefully this will get you started!

- Alan and Abiral

reply