Performance per dollar is getting faster and cheaper
184 points
8 hours ago
| 12 comments
| wafer.ai
| HN
minraws
6 hours ago
[-]
Can you folks add performance per watt as a metric to these comparisons, I honestly want to understand where AMD fits in the stack in terms of actual performance to dollars. I have had talks with companies wanting to build data centers outside of US and find it hard to source anything Nvidia in sufficient capacity and scale.

If AMD is competitive performance per watt and roughly reliable in terms of software support which is what most folks outside of US prioritize above all else, since outside of China and US electricity tends to at a relative premium.

Maybe if they make smaller data centers viable at the right price, AMD could be part of the stack outside of US where ever Nvidia is more limited in supply. Though I have genuinely no idea what sourcing an AMD GPU looks like.

I have never seen a company use AMD outside of wafer and a couple others mostly in US.

Genuinely intriguing or maybe not really (could be this stuff is common knowledge) and I am just stuck in my Nvidia bubble here.

reply
kingstnap
4 hours ago
[-]
A DGX B200 costs like ~$0.5 M and uses around 14 kW.

If you plan to run it straight for 8 years 100% max usage thats around 1 GWhr.

A gigawatt hour is a lot of energy but its not that much compared to the price of the actual machine. In Germany for example with its expensive energy thats about €100k worth, which spread over 8 years is pretty minor compared to the up front half mill.

The real issue with high power consumption is not really the cost of energy but the limited powersupply you can get for a datacenter. A more efficient setup is highly desirable because it means you can fit more in the limited power hookup.

reply
dannyw
1 hour ago
[-]
It’s more than power supply. Cooling and ventilation becomes a MUCH bigger deal at rack scale, and that costs electricity too.
reply
Twirrim
6 hours ago
[-]
> I have never seen a company use AMD outside of wafer and a couple others mostly in US.

There's a few using them, and even more starting to experiment with them. AMD has long been a source of disappointment around this side of things, so I'm hesitant to feel optimistic we'll finally get some competition. The market really needs viable competition to Nvidia, especially performance/watt.

reply
7thpower
32 minutes ago
[-]
Typically any company that can’t get Nvidia to fill their orders will have at least some AMD.
reply
craftkiller
6 hours ago
[-]
reply
Schiendelman
5 hours ago
[-]
It's not clear when this will be - AMD has slipped these dates likely to 2027.
reply
latchkey
4 hours ago
[-]
> I have never seen a company use AMD outside of wafer and a couple others mostly in US.

Just because you haven't seen it doesn't mean it doesn't exist.

We've serviced over 700 customers on our MI300x.

reply
technoabsurdist
5 hours ago
[-]
AMD MI355X uses 1,400W per GPU and NVIDIA B200 uses 1,200W. So AMD uses about 16% more power.
reply
vlovich123
5 hours ago
[-]
Not how you measure performance per watt but generally it’s 20-60% worse at tok/s/watt not 16. It does have ~50% more memory (~100gb) which complicates the comparison.
reply
hassaanr
2 hours ago
[-]
While cool, quantization to FP4 is practically never lossless in actual use. A lot of providers are advertising high TPS on Kimi and GLM, but the models are functionally lobotomized and no longer close to frontier quality. Would love to see this not be true.
reply
google234123
1 hour ago
[-]
First thing I noticed as well
reply
tw1984
1 hour ago
[-]
from memory, it is like 96-98% of the accuracy.
reply
lgessler
44 minutes ago
[-]
Accuracy isn't a meaningful metric here without reference to a specific task.
reply
nxtfari
3 hours ago
[-]
I think we should make it illegal to not specify the quantization in the headline for these types of posts.
reply
ahmadyan
2 hours ago
[-]
Its MXFP4
reply
p1esk
4 hours ago
[-]
There’s noticeable accuracy degradation when they switched from fp8 to mxfp4
reply
greyb
1 hour ago
[-]
Wafer discontinued their own "Wafer Pass" flagship coding plan within weeks of launch and had to issue prorated refunds. Now they're bragging about squeezing costs down even further via quantization, even though their implementation is clearly lacking.

[1] https://www.ycombinator.com/launches/Q9i-wafer-pass-flat-rat...

reply
throwdbaaway
4 hours ago
[-]
And somehow they claimed that it is "lossless".
reply
Schiendelman
5 hours ago
[-]
I'm not surprised to see competition with Blackwell. Rubin is 5x faster than Blackwell at inference - Blackwell is the last generation Nvidia didn't optimize specifically for inference.

If I'm missing something, please let me know!

reply
boroboro4
2 hours ago
[-]
It's very unclear what's special in Rubin to be optimized for inference? I can see disaggregated bit (with having separate prefill and decoding nodes), but what else?
reply
villgax
1 hour ago
[-]
Lot more SMs & Tensor Cores for NVFP4 going by the looks of it.
reply
nullc
4 hours ago
[-]
how do you get 5x faster at inference when inference is memory bandwidth limited? getting 5x the memory bandwidth of a h100 seems physically difficult.
reply
Schiendelman
4 hours ago
[-]
Rubin has 22TB/s of memory bandwidth vs Blackwell's 8TB/s. NVLink 6 doubles interconnect speed. Plus they're moving to 3nm from ~4nm.

(Previously this comment said Rubin did native NVFP4, but Blackwell does too! Rubin just also trains with native NVFP4, which Blackwell does not.)

reply
boredatoms
3 hours ago
[-]
Moving to lower bits is not a slam dunk, the model itself might degrade too much
reply
Schiendelman
3 hours ago
[-]
Of course, but for most workflows it's fine.
reply
zackangelo
3 hours ago
[-]
Blackwell supports nvfp4 natively.
reply
Schiendelman
3 hours ago
[-]
You're right - Rubin is better at NVFP4 training, not inference, thank you for catching me!
reply
boroboro4
2 hours ago
[-]
What does it mean it's better at nvfp4 training? What's different between training and inference to make this true?
reply
Schiendelman
1 hour ago
[-]
We're getting to the limit of my understanding, but I believe most Blackwell users still usually run FP8 passes through the transformer engine - they'll just store weights at NVFP4. Nvidia has model-specific stabilization recipes for NVFP4 end to end, but they're taking fixes all the time.

Nvidia says Rubin should have fewer stability problems training with FP4 because of hardware changes - "adaptive compression". There will still be outlier instability inherently, but something they're designing in reduces the cost of managing it.

But yeah, grain of salt - we haven't seen this in practice.

reply
fc417fc802
1 hour ago
[-]
I'm also puzzled by that statement. The issue with training is (as I understand it) one of precision and the associated numerical stability. You need enough bits in order for backprop to function correctly.

Of course there are techniques such as quantization aware training but I don't understand why a datatype would work for inference but not for that.

You can also abandon backprop entirely but that comes with a whole host of tradeoffs and again why would it work for inference but not for whatever alternative training regime you selected?

reply
AussieWog93
6 hours ago
[-]
The 2600 tok/s is an "aggregate", not the actual throughput.
reply
technoabsurdist
6 hours ago
[-]
yes it is 213 tok/s single stream (so per user)
reply
3836293648
5 hours ago
[-]
So per subagent*.
reply
alienbaby
3 hours ago
[-]
*per stream, I guess is more accurate than either?
reply
alienbaby
4 hours ago
[-]
I'm interested if anyone knows how much legwork the assumed 60% cache hit, plus running a quantised model is doing? Esp. compared to what the headline half implies is a full fat GLM5.2
reply
killingtime74
3 hours ago
[-]
No word on what this actually means as a consumer. What's the price. Is it lower than NVIDIA serving?
reply
mixtureoftakes
52 minutes ago
[-]
They seem to be serving it at 3x the price while also struggling with maintaining uptime on openrouter; while the vercel router advertizes even bigger speeds but has no clear uptime stats

I guess you really do have to try it at least for some time to actually know

reply
oDot
6 hours ago
[-]
Do these providers have 80+% gross margins or is something eating into them? Maybe utilization?
reply
technoabsurdist
6 hours ago
[-]
hi i work at wafer. no the margins are lower averaging at about ~40%. utilization is one of the highest order bits in determining margins here, yes.
reply
beffjezos
2 hours ago
[-]
This is very interesting and yet not at the same time. This looks to be optimized for single-stream LLM traffic which is not viable to serve in a production setting. It's only interesting to hobbyists that want to run the model locally.

It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.

reply
wmf
2 hours ago
[-]
You got it backwards; it's ~200 on single stream so the 2,600 is achieved with ~13 streams.
reply
beffjezos
1 hour ago
[-]
Yeah that makes sense. I'm more familiar with seeing tok/s/user + TTFT rather than the total node throughput.
reply
technoabsurdist
2 hours ago
[-]
hi yes it’s not optimized for single stream it’s optimized for total node throughput
reply
beffjezos
1 hour ago
[-]
Oh, that's much better then. A good metric to share is the tokens per second per user for the node rather than the total throughput of the node. It disambiguates what's being optimized for much better than your blog post currently does.
reply
technoabsurdist
4 minutes ago
[-]
sounds good feedback taken, thanks beffjezos
reply
villgax
1 hour ago
[-]
They fail to mention non speculative numbers & whether baseline was nvfp4 as well. So much for erosion against an older gen
reply
yieldcrv
6 hours ago
[-]
Agentic coding drivers for different architectures is a massive unlock for the world

So much compute is under utilized waiting for a savant or company to prioritize an architecture, and now all the other engineers can tackle this at any time if they get inspired on the right prompts

reply
technoabsurdist
5 hours ago
[-]
this is exactly our thesis at wafer :) thank you for the support
reply
yieldcrv
1 hour ago
[-]
well done
reply
yogthos
5 hours ago
[-]
Personally, I can't wait till something like this starts getting to consumer level. https://www.anuragk.com/blog/posts/Taalas.html
reply
yieldcrv
5 hours ago
[-]
That’s pretty fascinating, Apple has some innocuous LLMs and transformers baked into its devices and leveraging their neural chipset

So I could see something like this where the neural chipset has an LLM that cant be so easily updated baked into it, until you get a new device

reply