Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data
81 points
by tosh
4 days ago
| 7 comments
| thonking.ai
| HN
dan_sbl
46 minutes ago
[-]
> For example, when the GPU is fully idle, nvidia-smi tells me that it’s only pulling 88W of power.

I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?

reply
wildzzz
52 seconds ago
[-]
If my gpu is sitting idle, and I mean idle with nothing loaded into its memory, it's sitting at about 18W. If I load in model that uses nearly all of the memory but that model is idle, it's at 36W. If that model is actively thinking, it's like 118W. I think this is likely due to the GPU being aware that there is real data loaded into memory and turning up the DRAM refresh rate whereas when nothing is loaded, the dynamic power is as low as possible.
reply
Aurornis
38 minutes ago
[-]
Server cards are not optimized for idle power usage. They’re expected to be fully utilized.

For server gear it’s more common to have less dynamic power and voltage switching because it produces more predictable performance and latency.

reply
cmovq
6 minutes ago
[-]
For GeForce cards you can get similar behavior by setting “Prefer maximum performance” which disables some of the low power states.
reply
jayd16
1 hour ago
[-]
I can't tell from the blog, is this actually verified or is it theory and then numbers showing plausibility?

I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.

reply
amelius
43 minutes ago
[-]
Sounds like a side channel attack waiting to happen.
reply
unglaublich
10 minutes ago
[-]
So I guess we'll all be applying a random rotation to our matrices now to obscure their contents, like TurboQuant does. https://arkaung.github.io/interactive-turboquant/#rotation
reply
nzach
1 hour ago
[-]
I went in expecting to find 'branch prediction'[0] as the answer, but apparently things are even more complex nowadays.

[0] - https://stackoverflow.com/questions/11227809/why-is-conditio...

reply
gruez
14 minutes ago
[-]
>I went in expecting to find 'branch prediction'[0]

GPUs do branch prediction? I thought they didn't bother and try to minimize wasted effort by using high amounts of concurrent threads?

reply
jayd16
1 minute ago
[-]
They do texture prefetching, which is sorta similar.
reply
kangalioo
1 hour ago
[-]
To be fair, the culprit in the article is _less complex_ than branch prediction: "with random data, bits are flipped often, and bit flips in transistors inherently draw power" is less mental gymnastics than "with random data, the cpu fails to predict the future, causing redundant speculative execution"
reply
bitwize
40 minutes ago
[-]
It wouldn't surprise me to see some ML algorithm in silico somewhere to select faster matmul paths on favorable data. Yo dawg, I heard you like AI, so we put some AI in your AI so you can infer while you're inferring.
reply
Nevermark
10 minutes ago
[-]
Here is one: An adjustment to weight updates, that makes it more likely for weights to stay uniformly distributed.

~257.5 teraflops for normal distribution, versus ~268 teraflops uniform, reported on the first graph.

I would have liked to see a straight graph of performance vs. clock speed, for each type of data. Pick your data statistics, then pick the peak performance clock speed accordingly.

And for actual runs, from a pre-run sampled curve.

reply
falcor84
22 minutes ago
[-]
And there's at least one more level of inception at the data center level, where they use AI to optimize power usage (particularly by predictively controlling cooling, and adaptively rescheduling tasks).
reply
gdevenyi
3 hours ago
[-]
People have been noticing the effects of this in local LLM inference. Power limiting seems to improve overall performance!
reply
Aurornis
1 hour ago
[-]
This is not observable from LLM inference, where you would not encounter uniform matrices.

Power limiting does not improve performance but it does improve efficiency. You might be able to get 90% of the performance for only 70% of the power usage, for example. It does not make the card go faster though.

reply
Lerc
20 minutes ago
[-]
When thermal throttling occurs you can perform faster by running slower.

This is precicely because of the efficiency. The lower efficiency of the higher speed triggers a much lower performance sooner.

reply
gchamonlive
2 hours ago
[-]
In general, constraints require optimizations and rearchitectures. I'd also expect the ram shortage for instance to have a big impact on the software industry as a whole, specially in games. They will need to make do with what people have, a ps5/pro or similar in PC power.
reply
aNoob7000
2 hours ago
[-]
I actually think it is a good thing to introduce constraints to AI and the overall tech industry. Hopefully everyone will have to look at improving performance without having to add RAM or increase CPU/GPU performance.
reply
gchamonlive
22 minutes ago
[-]
As long as these constraints are for everyone and not just for thee and not for me, and become an instrument for big tech to keep consumers dependent on their infra.
reply
evanjrowley
1 hour ago
[-]
reply