FilterHN

Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data

81 points

by tosh

4 days ago

| past

| 7 comments

| thonking.ai

| HN

▲

dan_sbl

46 minutes ago

[-]

> For example, when the GPU is fully idle, nvidia-smi tells me that it’s only pulling 88W of power.

I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?

▲

wildzzz

52 seconds ago

[-]

If my gpu is sitting idle, and I mean idle with nothing loaded into its memory, it's sitting at about 18W. If I load in model that uses nearly all of the memory but that model is idle, it's at 36W. If that model is actively thinking, it's like 118W. I think this is likely due to the GPU being aware that there is real data loaded into memory and turning up the DRAM refresh rate whereas when nothing is loaded, the dynamic power is as low as possible.

▲

Aurornis

38 minutes ago

[-]

Server cards are not optimized for idle power usage. They’re expected to be fully utilized.

For server gear it’s more common to have less dynamic power and voltage switching because it produces more predictable performance and latency.

▲

cmovq

6 minutes ago

[-]

For GeForce cards you can get similar behavior by setting “Prefer maximum performance” which disables some of the low power states.

▲

jayd16

1 hour ago

[-]

I can't tell from the blog, is this actually verified or is it theory and then numbers showing plausibility?

I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.

▲

amelius

43 minutes ago

[-]

Sounds like a side channel attack waiting to happen.

▲

unglaublich

10 minutes ago

[-]

So I guess we'll all be applying a random rotation to our matrices now to obscure their contents, like TurboQuant does. https://arkaung.github.io/interactive-turboquant/#rotation

▲

nzach

1 hour ago

[-]

I went in expecting to find 'branch prediction'[0] as the answer, but apparently things are even more complex nowadays.

[0] - https://stackoverflow.com/questions/11227809/why-is-conditio...

▲

gruez

14 minutes ago

[-]

>I went in expecting to find 'branch prediction'[0]

GPUs do branch prediction? I thought they didn't bother and try to minimize wasted effort by using high amounts of concurrent threads?

▲

jayd16

1 minute ago

[-]

They do texture prefetching, which is sorta similar.

▲

kangalioo

1 hour ago

[-]

To be fair, the culprit in the article is _less complex_ than branch prediction: "with random data, bits are flipped often, and bit flips in transistors inherently draw power" is less mental gymnastics than "with random data, the cpu fails to predict the future, causing redundant speculative execution"

▲

bitwize

40 minutes ago

[-]

It wouldn't surprise me to see some ML algorithm in silico somewhere to select faster matmul paths on favorable data. Yo dawg, I heard you like AI, so we put some AI in your AI so you can infer while you're inferring.

▲

Nevermark

10 minutes ago

[-]

Here is one: An adjustment to weight updates, that makes it more likely for weights to stay uniformly distributed.

~257.5 teraflops for normal distribution, versus ~268 teraflops uniform, reported on the first graph.

I would have liked to see a straight graph of performance vs. clock speed, for each type of data. Pick your data statistics, then pick the peak performance clock speed accordingly.

And for actual runs, from a pre-run sampled curve.

▲

falcor84

22 minutes ago

[-]

And there's at least one more level of inception at the data center level, where they use AI to optimize power usage (particularly by predictively controlling cooling, and adaptively rescheduling tasks).

▲

gdevenyi

3 hours ago

[-]

People have been noticing the effects of this in local LLM inference. Power limiting seems to improve overall performance!

▲

Aurornis

1 hour ago

[-]

This is not observable from LLM inference, where you would not encounter uniform matrices.

Power limiting does not improve performance but it does improve efficiency. You might be able to get 90% of the performance for only 70% of the power usage, for example. It does not make the card go faster though.

▲

Lerc

20 minutes ago

[-]

When thermal throttling occurs you can perform faster by running slower.

This is precicely because of the efficiency. The lower efficiency of the higher speed triggers a much lower performance sooner.

▲

gchamonlive

2 hours ago

[-]

In general, constraints require optimizations and rearchitectures. I'd also expect the ram shortage for instance to have a big impact on the software industry as a whole, specially in games. They will need to make do with what people have, a ps5/pro or similar in PC power.

▲

aNoob7000

2 hours ago

[-]

I actually think it is a good thing to introduce constraints to AI and the overall tech industry. Hopefully everyone will have to look at improving performance without having to add RAM or increase CPU/GPU performance.

▲

gchamonlive

22 minutes ago

[-]

As long as these constraints are for everyone and not just for thee and not for me, and become an instrument for big tech to keep consumers dependent on their infra.

▲

evanjrowley

1 hour ago

[-]

Designing for predictable execution flow is one of the advantages of Tenstorrent hardware.

https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrren...

https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-t...

https://arxiv.org/html/2604.03279