I suspect GPU inference will come to an end soon, as it will likely be wildly inefficient by comparison to purpose built transformer chips. All those Nvidia GPU-based servers may become obsolete should transformer ASICs become mainstream. GPU bitcoin mining is just an absolute waste of money (cost of electricity) now. I believe the same will be true for GPU-based inference soon. The hundreds of billions of dollars being invested on GPU-based inference seems like an extremely risky bet that ASIC transformers won't happen, although Google has already widely deployed their own TPUs.
Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.
Nice username btw.
https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...
Is there any public info about % inference on custom vs GPU, for these companies?
When people say things like this I always wonder if they really think they're smarter than all of the people at Nvidia lolol
- Alan and Abiral
A lot of silicon on a GPU is dedicated to upscaling and matrix multiply.
Ultimately GPU's main use is multimedia and graphics focused.
See all the miners that used to do GPU based mining...or the other niche markets where eventually the cost of custom asic becomes to attractive to ignore even if you as a consume have to handle a few years of growing pains.
really excited about Rubin CPX / Feynman generations, let's see what the LPU does to the inference stack
- Alan and Abiral
from tpu_compiler import TPUCompiler, TPURuntime
class Custom(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(2, 2, bias=False)
self.layer2 = nn.Linear(2, 2, bias=False)
def forward(self, x):
x = self.layer1(x)
x = torch.relu(x)
x = self.layer2(x)
return x
model = train_model(your_data)# compile to the tiny tiny TPU format
compiler = TPUCompiler()
compiled = compiler.compile(model)
# run and enjoy :)
runtime = TPURuntime(tpu)
result = runtime.inference(compiled, input_data)
Will update soon with some better documentation, but hopefully this will get you started!
- Alan and Abiral