Local AI is driving the biggest change in laptops in decades
136 points
21 hours ago
| 23 comments
| spectrum.ieee.org
| HN
jwr
1 hour ago
[-]
The author seems unaware of how well recent Apple laptops run LLMs. This is puzzling and puts into question the validity of anything in this article.
reply
azuanrb
43 minutes ago
[-]
You still need ridiculously high spec hardware, and at Apple’s prices, that isn’t cheap. Even if you can afford it (most won't), the local models you can run are still limited and they still underperform. It’s much cheaper to pay for a cloud solution and get significantly better result. In my opinion, the article is right. We need a better way to run LLMs locally.
reply
almosthere
27 minutes ago
[-]
749 for an M4 air at Amazon right now
reply
tossandthrow
11 minutes ago
[-]
Try running anything interesting on these 8gb of ram.

You need 96gb or 128gb to do non trivial things. That is not yet 749 usd

reply
badc0ffee
2 minutes ago
[-]
Fair enough, but they start at 16GB nowadays.
reply
whazor
47 minutes ago
[-]
But economically, it is still much better to buy a lower spec't laptop and to pay a monthly subscription for AI.

However, I agree with the article that people will run big LLMs on their laptop N years down the line. Especially if hardware outgrows best-in-class LLM model requirements. If a phone could run a 512GB LLM model fast, you would want it.

reply
ignoramous
40 minutes ago
[-]
> economically, it is still much better to buy a lower spec't laptop and to pay a monthly subscription for AI

Uber is economical, too; but folks prefer to own cars, sometimes multiple.

And how there's market for all kinds of vanity cars, fast sportscars, expensive supercars... I imagine PCs & Laptops will have such a market, too: In probably less than a decade, may be a £20k laptop running a 671b+ LLM locally will be the norm among pros.

reply
joshred
15 minutes ago
[-]
Paying $30-$70/day to commute is economical?
reply
subjectsigma
8 minutes ago
[-]
> Uber is economical, too

One time I took an Uber to work because my car broke down and was in the shop and the Uber driver (somewhat pointedly) made a comment that I must be really rich to commute to work via Uber because Ubers are so expensive

reply
fancyfredbot
52 minutes ago
[-]
I think the author is aware of Apple silicon. The article mentions the fact Apple has unified memory and that this is advantageous for running LLMs.
reply
dangus
50 minutes ago
[-]
Then idk why they say that most laptops are bad at running LLMs, Apple has a huge marketshare in the laptop market and even their cheapest laptops are capable in that realm. And their PC competitors are more likely to be generously specced out in terms of included memory.

> However, for the average laptop that’s over a year old, the number of useful AI models you can run locally on your PC is close to zero.

This straight up isn’t true.

reply
literalAardvark
22 minutes ago
[-]
Apple has a 10-18% market share for laptops. That's significant but it certainly isn't "most".

Most laptops can run at best a 7-14b model, even if you buy one with a high spec graphics chip. These are not useful models unless you're writing spam.

Most desktops have a decent amount of system memory but that can't be used for running LLMs at a useful speed, especially since the stuff you could run in 32-64GB RAM would need lots of interaction and hand holding.

And that's for the easy part, inference. Training is much more expensive.

reply
fancyfredbot
22 minutes ago
[-]
Most laptops have 16GB of RAM or less. A little more than a year ago I think the base model Mac laptop had 8GB of RAM which really isn't fantastic for running LLMs.
reply
andai
30 minutes ago
[-]
So I'm hearing a lot of people running LLMs on Apple hardware. But is there actually anything useful you can run? Does it run at a usable speed? And is it worth the cost? Because the last time I checked the answer to all three questions appeared to be no.

Though maybe it depends on what you're doing? (Although if you're doing something simple like embeddings, then you don't need the Apple hardware in the first place.)

reply
layer8
43 minutes ago
[-]
By “PC”, they mean non-Apple devices.

Also, macOS only has around 10% desktop market share globally.

reply
cmxch
2 minutes ago
[-]
Only if you want to take all the proprietary baggage and telemetry that comes with Apple platforms by default.

A Lenovo T15g with a 16gb 3080 mobile doesn’t do too badly and will run more than just Windows.

reply
dangus
50 minutes ago
[-]
Yeah, any Mac system specced with a decent amount of RAM since the M1 will run LLMs locally very well. And that’s exactly how the built-in Apple Intelligence service works: when enabled, it downloads a smallish local model. Since all Macs since the M1 have very fast memory available to the integrated GPU, they’re very good at AI.

The article kinda sucks at explaining how NPUs aren’t really even needed, they just have potential to make things more efficient in the future rather than depending on the power consumption involved with running your GPU.

reply
seunosewa
3 hours ago
[-]
"How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly."

What's he talking about? It's trivial to calculate that.

reply
RobotToaster
58 minutes ago
[-]
Isn't the ability to run it more dependant on (V)RAM? With TOPS just dictating the speed at which it runs?
reply
NitpickLawyer
10 minutes ago
[-]
A good rule of thumb is that PP (Prompt Processing) is compute bound while TG (Token Generation) is (V)RAM speed bound.
reply
zozbot234
18 minutes ago
[-]
Strictly speaking, you don't need that much VRAM or even plain old RAM - just enough to store your context and model activations. It's just that as you run with less and less (V)RAM you'll start to bottleneck on things like SSD transfer bandwidth and your inference speed goes down to a crawl. But even that may or may not be an issue depending on your exact requirements: perhaps you don't need your answer instantly and can wait while it gets computed in the background. Or maybe you're running with the latest PCIe 5 storage which overall gives you comparable bandwidth to something like DDR3/DDR4 memory.
reply
fny
2 hours ago
[-]
reply
swyx
29 minutes ago
[-]
> state-of-the-art models

> hundreds of millions of parameters

lol

lmao, even

reply
mattas
3 hours ago
[-]
See: "3D TVs are driving the biggest change in TVs in decades"
reply
eleventyseven
3 hours ago
[-]
A lazy easy cheap shot. But do you deny these aspects from the article are not coming? Or won't be still here in 5 years?

- Addition of more—and faster—memory.

- Consolidation of memory.

- Combination of chips on the same silicon.

All of these are also happening for non AI reasons. The move to SoC that really started with the M1 wasn't because of AI, but unified memory being the default is something we will see in 5 years. Unlike 3D TV.

reply
technion
1 hour ago
[-]
We just had a series of articles and sysadmin outcry that major vendors were bringing 8gb laptops back to standard models because of the ram prices. In the short term, we're seeing a reduction.
reply
estimator7292
2 hours ago
[-]
Memory is absolutely not coming in the near future. Nobody can afford it.
reply
blibble
2 hours ago
[-]
> Addition of more—and faster—memory.

probably not after scam altman bought up half the world's supply for his shit company

reply
MisterTea
1 hour ago
[-]
> The move to SoC that really started with the M1

No it did not. There were numerous SoC that came before it and was inevitable in this space.

reply
ToucanLoucan
2 hours ago
[-]
In order:

- People wanting more memory is not a novel feature. I am excited to find out how many people immediately want to disable the AI nonsense to free up memory for things they actually want to do.

- Same answer.

- I think the drive towards SOCs has been happening already. Apple's M-series utterly demolishes every PC chip apart from the absolute bleeding-edge available, includes dedicated memory and processors for ML tasks, and it's mature technology. Been there for years. To the extent PC makers are chasing this, I would say it's far more in response to that than anything to do with AI.

reply
j45
2 hours ago
[-]
This article is just saying more laptops will have power efficient GPUs in it. A bit better than 3D TVs.

They might not use Apple silicon often. Other options are encouraging.

reply
Morromist
18 hours ago
[-]
I was in the market for a laptop this month. Many new laptops now advertise AI features like this "HP OmniBook 5 Next Gen AI PC" which advertises:

"SNAPDRAGON X PLUS PROCESSOR - Achieve more everyday with responsive performance for seamless multitasking with AI tools that enhance productivity and connectivity while providing long battery life"

I don't want this garbage on my laptop, especially when its running of its battery! Running AI on your laptop is like playing Starcraft Remastered on the Xbox or Factorio on your steamdeck. I hear you can play DOOM on a pregnancy test too. Sure, you can, but its just going to be a tedious inferior experiance.

Really, this is just a fine example of how overhyped AI is right now.

reply
Legend2440
18 hours ago
[-]
Laptop manufacturers are too desperate to cash on the AI craze. There's nothing special about an 'AI PC'. It's just a regular PC with Windows Copilot... which is a standard Windows feature anyway.

>I don't want this garbage on my laptop, especially when its running of its battery!

The one bit of good news is it's not going to impact your battery life because it doesn't do any on-device processing. It's just calling an LLM in the cloud.

reply
14113
2 hours ago
[-]
That's not quite correct. Snapdragon chips that are advertised as being good for "AI" also come with the Hexagon DSP, which is now used for (or targeted at) AI applications. It's essentially a separate vector processor with large vector sizes.
reply
zamadatix
17 hours ago
[-]
> It's just a regular PC with Windows Copilot... which is a standard Windows feature anyway.

"AI PC" branded devices get "Copilot+" and additional crap that comes with that due to the NPU. Despite desktops having GPUs with up to 50x more TOPs than the requirement, they don't get all that for some reason https://www.thurrott.com/mobile/copilot-pc/323616/microsoft-...

reply
robocat
2 hours ago
[-]
Is Microsoft trying to help NPU chip makers?

When is Wintel going to finally happen?

Microsoft has roughly $102 billion in cash (+ short-term investments). Intel’s market value is approximately $176 billion.

I've never really understood why Microsoft helped Intel's bottom line over decades.

With Azure, Microsoft has even more reason to buy Intel.

reply
marcus_holmes
17 hours ago
[-]
Doesn't this lead to a lot of tension between the hardware makers and Microsoft?

MS wants everyone to run Copilot on their shiny new data centre, so they can collect the data on the way.

Laptop manufacturers are making laptops that can run an LLM locally, but there's no point in that unless there's a local LLM to run (and Windows won't have that because Copilot). Are they going to be pre-installing Llama on new laptops?

Are we going to see a new power user / normal user split? Where power users buy laptops with LLMs installed, that can run them, and normal folks buy something that can call Copilot?

Any ideas?

reply
zdragnar
17 hours ago
[-]
It isn't just copilot that these laptops come with; manufacturers are already putting their own AI chat apps as well.

For example, the LG gram I recently got came with just such an app named Chat, though the "ai button" on the keyboard (really just right alt or control, I forget which) defaults to copilot.

If there's any tension at all, it's just who gets to be the default app for the "ai button" on the keyboard that I assume almost nobody actually uses.

reply
marcus_holmes
16 hours ago
[-]
Interesting. Yeah, that'll be the argument
reply
autoexec
17 hours ago
[-]
> MS wants everyone to run Copilot on their shiny new data centre, so they can collect the data on the way.

MS doesn't care where your data is, they're happy to go digging through your C drive to collect/mine whatever they want, assuming you can avoid all the dark patterns they use to push you to save everything on OneDrive anyway and they'll record all your interactions with any other AI using Recall

reply
marcus_holmes
17 hours ago
[-]
I had assumed that they needed the usage to justify the investment in the data centre, but you could be right and they don't care.
reply
wmf
2 hours ago
[-]
reply
eterm
3 hours ago
[-]
It's just marketing. The laptop makers will market it as if your laptop power makes a difference knowing full well that it's offloaded to the cloud.

For a slightly more charitable perspective, agentic AI means that there is still a bunch of stuff happening on the local machine, it's just not the inference itself.

reply
eleventyseven
2 hours ago
[-]
There's nothing special with what Intel has lowered the bar as an AI PC so vendors can market it. Ollama can run a 4b model plenty fine on Tiger Lake with 8gb classic RAM.

But unified memory IS truly what makes an AI ready PC. The Apple Silicon proves that. People are willing to pay the premium, and I suspect unified memory will still be around and bringing us benefits even if no one cares about LLMs in 5 years.

reply
autoexec
17 hours ago
[-]
Even collecting and sending all that data to the cloud is going to drain battery life. I'd really rather my devices only do what I ask them to than have AI running the background all the time trying to be helpful or just silently collecting data.
reply
Legend2440
17 hours ago
[-]
Copilot is just ChatGPT as an app.

If you don't use it, it will have no impact on your device. And it's not sending your data to the cloud except for anything you paste into it.

reply
dijit
1 hour ago
[-]
So, the new AI features like recall don’t exist?

Windows is going more and more into AI and embedding it into the core of the OS as much as it can. It’s not “an app”, even if that was true now it wouldn't be true for very long. The strategy is well communicated.

reply
sandworm101
17 hours ago
[-]
>> I'd really rather my devices only do what I ask them to

Linux hears your cry. You have a choice. Make it.

reply
benbristow
1 hour ago
[-]
Unfortunately still loads of hurdles for most people.

AAA Games with anti-cheat that don't support Linux.

Video editing (DaVinci Resolve exists but is a pain to get up and running on many distros, KDenLive/OpenShot don't really cut it for most)

Adobe Suite (Photoshop/Lightroom specifically, and Premiere for Video Editing) - would like to see Affinity support Linux but hasn't happened so far. GIMP and DarkTable aren't really substitutions unless you pour a lot of time into them.

Tried moving to Linux on my laptop this past month, made it a month before a reinstall of Windows 11. Had issues with WiFi chip (managed to fix but had to edit config files deep in the system, not ideal), Fedora with LUKS encryption after a kernel update the keyboard wouldn't work to input the encryption key, no Windows Hello-like support (face ID). Had the most success with EndeavourOS but running Arch is a chore for most.

It's getting there, best it's ever been, but there's still hurdles.

reply
cultofmetatron
35 minutes ago
[-]
> AAA Games with anti-cheat that don't support Linux.

I really don't understand people that want to play games so badly that they are willing to install a literal rootkit on their devices. I can understand if you're a pro gamer but it feels stupid to do it otherwise.

reply
benbristow
33 minutes ago
[-]
Most of the time they're not really informed that they are. I know Valorant does (Riot Games), one I've avoided in the past because of it.

But a lot of the time it's peer-pressure for wanting to play with friends who couldn't care less.

reply
grayhatter
57 minutes ago
[-]
According to my friends, Arc Raders works well on linux. So it's very much, just a small selection of AAA games, so they can run anti-cheat, that probably doesn't even work. Can you name a triple a you want to play, that proton says is incompatible?

Gimp isn't a solution, sure but it works for what I need. Darktable does way more than I've ever wanted, so I can forgive it for the one time it crashed. Inkscape and blender both exceed my needs as well.

And Adobe is so user hostile, that I feel I need to call you a mean name to prove how I feel.... dummy!

Yes, I already feel bad, and I'm sorry. But trolling aside, listing applications that treat users like shit, aren't reasons to stay on the platform that also treats you like shit.

I get it, sometimes, being treated like shit is worth it because it's easier now that you're used to being disrespected. But an aversion to the effort it'd take for you to climb the learning curve of something different, isn't valid reason to help the disrespectful trash companies making the world worse, recruit more people for them to treat like trash.

Just because you use it, doesn't make it worth recommending.

reply
benbristow
54 minutes ago
[-]
I don't really PC game anymore, use my Xbox or a few older games my laptop's iGPU can handle, not at the moment anyway. Battlefield 6 is a big one recently that if I had a gaming PC set-up I'd probably want to play.

I know Adobe are... c-words, but their software is industry standard for a reason.

reply
grayhatter
41 minutes ago
[-]
> Battlefield 6 is a big one recently that if I had a gaming PC set-up I'd probably want to play.

We definitely play very different games, I wouldn't touch it if you paid me. So I'm sure we both have a bit of sample bias in our expected rates of linux compatibility. Especially since EA is another company like Adobe. Also, the internet seems to think they have a cheating problem. I wonder how bad it really is, and if it's worth the cost of the anti-cheat.

They're industry standard because they were first. Not necessarily because they were better. They do have a feature set that's near impossible to beat, not even I can pretend like they don't. I'm just saying, respect and fairness is more important to me, than content aware fill ever will be.

Also, doesn't the Adobe suite work on Linux?

reply
benbristow
37 minutes ago
[-]
I think older versions do, like CS6 through WINE.

Photoshop CC 2024 apparently works somewhat, but no GPU support and the removal tool doesn't work apparently.

https://appdb.winehq.org/objectManager.php?sClass=version&iI...

Basically, no.

reply
sixothree
2 hours ago
[-]
Part of me is starting to think Valve is going to be the best thing to happen to Linux (in this regard) since Ubuntu.
reply
bitwize
18 hours ago
[-]
AI PCs also have NPUs which I guess provide accelerated matmuls, albeit less accelerated than a good discrete GPU.
reply
neves
2 hours ago
[-]
I have a Snapdragon laptop and it is the best I've ever had. But the NPU is really almost useless.

This is a nice companion to the article: https://www.pcworld.com/article/2965927/the-great-npu-failur...

reply
dijit
1 hour ago
[-]
Agreed, I have the ARM based T14s for work.

The thing is nowhere near the performance as a macbook, but its silent and the battery lasts ages, which is a far cry from the same laptop with an Intel CPU, which is what many are running.

Company removes a lot of the AI bloat though.

reply
layer8
32 minutes ago
[-]
It’s true that the AI marketing is largely nonsense, but the NPUs also don’t hurt, and you don’t have to make use of them.
reply
dpedu
1 hour ago
[-]
> Running AI on your laptop is like playing Starcraft Remastered on the Xbox

A great analogy because there is Starcraft for a console - Nintendo 64 - and it is quite awkward. Split-screen multiplayer included.

reply
pluralmonad
1 hour ago
[-]
Factorio runs really well on the deck though...

But yeah, fresh install of OS is a must for any new computer.

reply
tracerbulletx
2 hours ago
[-]
This mostly just shows you how far behind the M1 (which came out 5 years ago) all the non Apple laptops are.
reply
blazingbanana
2 hours ago
[-]
Was never really into Apple hardware (mainly the price), however I recently got an M1 Mac Mini and an iPhone for app development, and the inference speed for as you say, a 5 year old chip is actually crazy.

If they made the M series fully open for Linux (I know Asahi is working away) I probably would never buy another non-M series processor again.

reply
dpedu
1 hour ago
[-]
I got an M1 Mac Mini somewhat recently as well, to replace my ~2012 Mac Mini that I use as a media center PC. And frankly, it's overkill. Used ones can be had for $200-$300 USD, lower side with cosmetic damage. An absolute steal, IMO.
reply
jeffbee
1 hour ago
[-]
You can still get an M1 Macbook Air at retail for $599 ($300 for refurbs), which is a Chromebook price for a laptop that is better in pretty much every respect than any Chromebook.
reply
tengbretson
1 hour ago
[-]
Outside of Apple laptops (and arguably the Ryzen AI MAX 390), an "AI ready" laptop is simply marketing speak for "is capable of making HTTP requests."
reply
gamblor956
9 minutes ago
[-]
The "AI laptop" boom is already fading. It turns out that LLMs, local or otherwise, just aren't very useful.

Like Big Data, LLMs are useful in a small niche of areas, like poorly summarizing meeting notes, or grammar check at a middle-school level.

On LLMs for coding tasks: I asked a programmer why they loved Claude and he showed me the output. Twenty years ago, that kind of code would have gotten someone PIP'd. Today it's considered better than most junior programmers...which is a sign of how far programming standards have fallen, and explains why most programs and apps are such buggy pieces of sh$t these days.

reply
TrackerFF
2 hours ago
[-]
With the wild ram prices, which btw are probably going to last out 2026, I expect 8 GB ram to be the new standard going on forward.

32 GB ram will be for enthusiasts with deep pockets, and professionals. Anything over that, exclusively professionals.

The conspiracy theorist inside me is telling me that big AI companies like OpenAI would rather see that people are using their puny laptops as terminals / shells only, to reach sky-based models, than to let them have beefy laptops and local models.

reply
meisel
1 hour ago
[-]
I think only a small percentage of users care that much about running LLMs locally to pay for extra hardware for it, put up with slower and lower-quality responses, etc. . It’ll never be as good as non-local offerings, and is more hassle.
reply
juancn
3 hours ago
[-]
The price of RAM is going to throw a wrench at that
reply
aappleby
18 hours ago
[-]
I predict we will see compute-in-flash before we see cheap laptops with 128+ gigs of ram.
reply
14113
2 hours ago
[-]
There was a company that did compute-in-dram, which was recently acquired by Qualcomm: https://www.emergentmind.com/topics/upmem-pim-system
reply
zamadatix
17 hours ago
[-]
I can't tell if this is optimism for compute-in-flash or pessimism with how RAM has been going lately!
reply
p1esk
17 hours ago
[-]
We’ve had “compute in flash” for a few years now: https://mythic.ai/product/
reply
wkat4242
18 hours ago
[-]
Yeah especially since what is happening in the memory market
reply
noosphr
18 hours ago
[-]
Feast and famine.

In three years we will be swimming in more ram than we know what to do with.

reply
fallat
18 hours ago
[-]
Kind of feel that's already the case today... 4GB I find is still plenty for even business workloads.
reply
autoexec
17 hours ago
[-]
Video games have driven the need for hardware more than office work. Sadly games are already being scaled back and more time is being spent on optimization instead of content since consumers can't be expected to have the kind of RAM available they normally would and everyone will be forced to make do with whatever RAM they have for a long time.
reply
znpy
14 hours ago
[-]
That might not be the case. The kind of memory that will flood the second-hand market could not be the kind of memory we can stuff in laptops or even desktop systems.
reply
aitchnyu
17 hours ago
[-]
Memristors are (IME) missing from the news. They promised to act as both persistent storage and fast RAM.
reply
ACCount37
3 hours ago
[-]
If only memristors weren't vaporware that has "shown promise" for 3 decades now and went nowhere.
reply
znpy
14 hours ago
[-]
You could get 128gb ram laptops from the time ddr4 came around: workstation class laptops with 4 ram slots would happily take 128gb of memory.

The fact that nowadays there are little to no laptops with 4 ran slots is entirely artificial.

reply
mhitza
1 hour ago
[-]
I was mussing this summer if I should get a refurbed Thinkpad P16 with 96GB of RAM to run VMs purely in memory. Now that 96GB of ram cost as much as a second P16.
reply
znpy
1 hour ago
[-]
I feel you, so much. I was thinking of getting a second 64gb node for my homelab and i thought i’d save those money… now the ram alone cost as much as the node, and I’m crying.

Lesson learned: you should always listen to that voice inside your head that say: “but i need it…” lol

reply
pluralmonad
1 hour ago
[-]
I rebuilt a workstation after a failed motherboard a year ago. I was not very excited about being forced to replace it on a days notice and cheaped out on the RAM (only got 32GB). This is like the third or fourth time I've taught myself the lesson to not pinch pennies when buying equipment/infrastructure assets. It's the second time the lesson was about RAM, so clearly I'm a slow learner.
reply
112233
10 hours ago
[-]
By "we" do you mean consumers? No, "we" will get neither. This is unexpected, irresistable opportunity to create a new class, by controlling the technology that people are required and are desiring to use (large genAI) with a comprehensive moat — financial, legislative and technological. Why make affordable devices that enable at least partial autonomy? Of course the focus will be on better remote operation (networking, on-device secure computation, advancing narrative that equates local computation with extremism and sociopathy).
reply
spullara
16 hours ago
[-]
I'm running GPT-OSS 120B on a MacBook Pro M3 Max w/128 GB. It is pretty good, not great, but better than nothing when the wifi on the plane basically doesn't work.
reply
socketcluster
18 hours ago
[-]
I feel like there's no point to get a graphics card nowadays. Clearly, graphics cards are optimized for graphics; they just happened to be good for AI but based on the increased significance of AI, I'd be surprised if we don't get more specialized chips and specialized machines just for LLMs. One for LLMs, a different one for stable diffusion.

With graphics processing, you need a lot of bandwidth to get stuff in and out of the graphics card for rendering on a high-resolution screen, lots of pixels, lots of refreshes, lots of bandwidth... With LLMs, a relatively small amount of text goes in and a relatively small amount of text comes out over a reasonably long amount of time. The amount of internal processing is huge relative to the size of input and output. I think NVIDIA and a few other companies already started going down that route.

But probably graphics cards will still be useful for stable diffusion; especially AI-generated videos as the inputs and output bandwidth is much higher.

reply
ACCount37
3 hours ago
[-]
Nah, that's just plain wrong.

First, GPGPU is powerful and flexible. You can make an "AI-specific accelerator", but it wouldn't be much simpler or much more power-efficient - while being a lot less flexible. And since you need to run traditional graphics and AI workloads both in consumer hardware? It makes sense to run both on the same hardware.

And bandwidth? GPUs are notorious for not being bandwidth starved. 4K@60FPS seems like a lot of data to push in or out, but it's nothing compared to how fast modern PCIe 5.0 x16 goes. AI accelerators are more of the same.

reply
djsjajah
1 hour ago
[-]
GPUs might not be bandwidth starved most of the time, but they absolutely are when generating text from an llm. It’s the whole reason why low precision floating point numbers are being pushed by nvidia.
reply
Legend2440
18 hours ago
[-]
LLMs are enormously bandwidth hungry. You have to shuffle your 800GB neural network in and out of memory for every token, which can take more time/energy than actually doing the matrix multiplies. GPUs are almost not high bandwidth enough.
reply
socketcluster
12 hours ago
[-]
But even so, for a single user, the output rate for a very fast LLM would be like 100 tokens per second. With graphics, we're talking like 2 million pixels, 60 times a second; 120 million pixels per second for a standard high res screen. Big difference between 100 tokens vs 120 million pixels.

24 bit pixels gives 16 million possible colors... For tokens, it's probably enough to represent every word of the entire vocabulary of every major national language on earth combined.

> You have to shuffle your 800GB neural network in and out of memory

Do you really though? That seems more like a constraint imposed by graphics cards. A specialized AI chip would just keep the weights and all parameters in memory/hardware right where they are and update them in-situ. It seems a lot more efficient.

I think that it's because graphics cards have such high bandwidth that people decided to use this approach but it seems suboptimal.

But if we want to be optimal; then ideally, only the inputs and outputs would need to move in and out of the chip. This shuffling should be seen as an inefficiency; a tradeoff to get a certain kind of flexibility in the software stack... But you waste a huge amount of CPU cycles moving data between RAM, CPU cache and Graphics card memory.

reply
djsjajah
51 minutes ago
[-]
> Do you really though?

Yes.

It stays in on the hbm but it need to get shuffled to the place where it can actually do the computation. It’s a lot like a normal cpu. The cpu can’t do anything with data in the system memory, it has to be loaded into a cpu register. For every token that is generated, a dense llm has to read every parameter in the model.

reply
visarga
1 hour ago
[-]
If we did that it would be much more expensive, keeping all weights in SRAM is done by Groq for example.
reply
Zambyte
17 hours ago
[-]
This doesn't seem right. Where is it shuffling to and from? My drives aren't fast enough to load the model every token that fast, and I don't have enough system memory to unload models to.
reply
Legend2440
17 hours ago
[-]
From VRAM to the tensor cores and back. On a modern GPU you can have 1-2tb moving around inside the GPU every second.

This is why they use high bandwidth memory for VRAM.

reply
Zambyte
5 hours ago
[-]
This makes sense now, thanks!
reply
zamadatix
17 hours ago
[-]
If you're using a MoE model like DeepSeek V3 the full model is 671 GB but only 37 GB are active per token, so it's more like running a 37 GB model from the memory bandwidth perspective. If you do a quant of that it could e.g. be more like 18 GB.
reply
smallerize
17 hours ago
[-]
You're probably not using an 800GB model.
reply
p1esk
17 hours ago
[-]
It is right. The shuffling is from CPU memory to GPU memory, and from GPU memory to GPU. If you don’t have enough memory you can’t run the model.
reply
Zambyte
5 hours ago
[-]
How can I observe it being loaded into CPU memory? When I run a 20gb model with ollama, htop reports 3gb of total RAM usage.
reply
zamadatix
4 hours ago
[-]
Think of it like loading a moving truck where:

- The house is the disk

- You are the RAM

- The truck is the VRAM

There won't be a single time you can observe yourself carrying the weight of everything being moved out of the house because that's not what's happening. Instead you can observe yourself taking many tiny loads until everything is finally moved, at which point you yourself should not be loaded as a result of carrying things from the house anymore (but you may be loaded for whatever else you're doing).

Viewing active memory bandwidth can be more complicated than it'd seem to set up, so the easier way is to just view your VRAM usage as you load in the model freshly into the card. The "nvtop" utility can do this for most any GPU on Linux, as well as other stats you might care about as you watch LLMs run.

reply
p1esk
4 hours ago
[-]
Depends on map_location arg in torch.load: might be loaded straight to GPU memory
reply
zamadatix
17 hours ago
[-]
> Clearly, graphics cards are optimized for graphics; they just happened to be good for AI

I feel like the reverse has been true since after the Pascal era.

reply
autoexec
17 hours ago
[-]
I don't doubt that there will be specialized chips that make AI easier, but they'll be more expensive than the graphics cards sold to consumers which means that a lot of companies will just go with graphics cards, either because the extra speed of specialized chips won't be worth the cost, or will they'll be flat out too expensive and priced for the small number of massive spenders who'll shell out insane amounts of money for any/every advantage (whatever they think that means) they can get over everyone else.
reply
seanmcdirmid
18 hours ago
[-]
I’ve been running LLMs on my laptop (M3 Max 64GB) for a year now and I think they are ready, especially with how good mid sized models are getting. I’m pretty sure unified memory and energy efficient GPUs will be more than just a thing on Apple laptops in the next few years.
reply
noman-land
14 minutes ago
[-]
You doing code completion and agentic stuff successfully with local models? Got any tips? I've been out of the game for [checks watch] a few months and am behind on the latest. Is Cline the move?
reply
allovertheworld
18 hours ago
[-]
Only because of Apples unified memory architecture. The groundwork is there, we just need memory to be cheaper so we can fit 512+GB now ;)
reply
seanmcdirmid
17 hours ago
[-]
Memory prices will rise short term and generally fall long term, even with the current supply hiccup the answer is to just build out more capacity (which will happen if there is healthy competition). I meant, I expect the other mobile chip providers to adopt unified architecture and beefy GPU cores on chip and lots of bandwidth to connect it to memory (at the max or ultra level, at least), I think AMD is already doing UM at least?
reply
spwa4
7 hours ago
[-]
> Memory prices will rise short term and generally fall long term, even with the current supply hiccup the answer is to just build out more capacity (which will happen if there is healthy competition)

Don't worry! Sam Altman is on it. Making sure there never is healthy competition that is.

https://www.mooreslawisdead.com/post/sam-altman-s-dirty-dram...

reply
seanmcdirmid
1 hour ago
[-]
We’ve been through multiple cycles of scarcity/surplus DRAM cycles in the last couple of decades. Why do we think it will be different now?
reply
re-thc
46 minutes ago
[-]
> Why do we think it will be different now?

Margins. AI usage can pay a lot more. Even if they sell less than can still be more profitable.

In the past there wasn’t a high margin usage. Servers didn’t charge such a high premium.

reply
zozbot234
4 minutes ago
[-]
High margins are exactly what should create a strong incentive to build more capacity. But that dynamic has been tamped down so far because we're all scared of a possible AI bubble that might pop at any moment.
reply
bfrog
18 hours ago
[-]
I suppose it depends on the model, code was useless. As a lossy copy of an interactive Wikipedia it could be ok not good or great just ok.

Maybe for creative suggestions and editing it’d be ok.

reply
fwipsy
18 hours ago
[-]
Seems like wishful thinking.

> How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly.

Why not extrapolate from open-source AIs which are available? The most powerful open-source AI (which I know of) is Kimi K2 and >600gb. Running this at acceptable speed requires 600+gb GPU/NPU memory. Even $2000-3000 AI-focused PCs like the DGX spark or Strix Halo typically top out at 128gb. Frontier models will only run on something that costs many times a typical consumer PC, and only going to get worse with RAM pricing.

In 2010 the typical consumer PC had 2-4gb of RAM. Now the typical PC has 12-16gb. This suggests RAM size doubling perhaps every 5 years at best. If that's the case, we're 25-30 years away from the typical PC having enough RAM to run Kimi K2.

But the typical user will never need that much RAM for basic web browsing, etc. The typical computer RAM size is not going to keep growing indefinitely.

What about cheaper models? It may be possible to run a "good enough" model on consumer hardware eventually. But I suspect that for at least 10-15 years, typical consumers (HN readers may not be typical!) will prefer capability, cheapness, and especially reliability (not making mistakes) over being able to run the model locally. (Yes AI datacenters are being subsidized by investors; but they will remain cheaper, even if that ends, due to economies of scale.)

The economics dictate that AI PCs are going to remain a niche product, similar to gaming PCs. Useful AI capability is just too expensive to add to every PC by default. It's like saying flying is so important, everyone should own an airplane. For at least a decade, likely two, it's just not cost-effective.

reply
sipjca
17 hours ago
[-]
> It may be possible to run a "good enough" model on consumer hardware eventually

10-15 years?!!!! What is the definition of good enough? Qwen3 8B or A30B are quite capable models which run on a lot of hardware even today. SOTA is not just getting bigger, it's also getting more intelligence and running it more efficiently. There have been massive gains in intelligence at the smaller model sizes. It is just highly task dependent. Arguably some of these models are "good enough" already, and the level of intelligence and instruction following is much better from even 1 year ago. Sure not Opus 4.5 level, but still much could be done without that level of intelligence.

reply
fwipsy
6 hours ago
[-]
"Good enough" has to mean users won't be frequently frustrated if they transition to it from a frontier model.

> it is highly task dependent... much could be done without that level of intelligence

This is an enthusiast's glass-half-full perspective, but casual end users are gonna have a glass-half-empty perspective. Quen3-8B is impressive, but how many people use it as a daily driver? Most casual users will toss it as soon as it screws up once or twice.

The phrase you quoted in particular was imprecise (sorry) but my argument as a whole still stands. Replace "consumer hardware" with "typical PCs" - think $500 bestseller laptops from Walmart. AI PCs will remain niche luxury products, like gaming PCs. But gaming PCs benefit from being part of gaming culture and because cloud gaming adds input latency. Neither of these affects AI much.

reply
epicureanideal
17 hours ago
[-]
You may be correct, but I wonder if we'll see Mac Mini sized external AI boxes that do have the 1TB of RAM and other hardware for running local models.

Maybe 100% of computer users wouldn't have one, but maybe 10-20% of power users would, including programmers who want to keep their personal code out of the training set, and so on.

I would not be surprised though if some consumer application made it desirable for each individual, or each family, to have local AI compute.

It's interesting to note that everyone owns their own computer, even though a personal computer sits idle half the day, and many personal computers hardly ever run at 80% of their CPU capacity. So the inefficiency of owning a personal AI server may not be as much of a barrier as it would seem.

reply
saltcured
2 hours ago
[-]
But will it ever lead to a Mac Mini-priced external AI box? Or will this always be a premium "pro" tier that seems to rival used car prices?
reply
seanmcdirmid
17 hours ago
[-]
> but I wonder if we'll see Mac Mini sized external AI boxes that do have the 1TB of RAM

Isn't that the Mac Studio already? Ok, it seems to max at 512 GB.

reply
marcus_holmes
17 hours ago
[-]
> In 2010 the typical consumer PC had 2-4gb of RAM. Now the typical PC has 12-16gb. This suggests RAM size doubling perhaps every 5 years at best. If that's the case, we're 25-30 years away from the typical PC having enough RAM to run Kimi K2.

Part of the reason that RAM isn't growing faster is that there's no need for that much RAM at the moment. Technically you can put multiple TB of RAM in your machine, but no-one does that because it's a complete waste of money [0]. Unless you're working in a specialist field 16Gb of RAM is enough, and adding more doesn't make anything noticeably faster.

But given a decent use-case, like running an LLM locally, and you'd find demand for lots more RAM, and that would drive supply, and new technology developments, and in ten years it'll be normal to have 128TB of RAM in a baseline laptop.

Of course, that does require that there is a decent use-case for running an LLM locally, and your point that that is not necessarily true is well-made. I guess we'll find out.

[0] apart from a friend of mine working on crypto who had a desktop Linux box with 4TB of RAM in it.

reply
wkat4242
18 hours ago
[-]
This article is so dumb. It totally ignores the memory price explosion that will make large fast memory laptops unfeasible for years and states stuff like this:

> How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly. It’s not possible to run these models on today’s consumer hardware, so real-world tests just can’t be done.

We know exactly the performance needed for a given responsiveness. TOPS is just a measurement independent from the type of hardware it runs on..

The less TOPS the slower the model runs so the user experience suffers. Memory bandwidth and latency plays a huge role too. And context, increase context and the LLM becomes much slower.

We don't need to wait for consumer hardware until we know much much is needed. We can calculate that for given situations.

It also pretends small models are not useful at all.

I think the massive cloud investments will put pressure away from local AI unfortunately. That trend makes local memory expensive and all those cloud billions have to be made back so all the vendors are pushing for their cloud subscriptions. I'm sure some functions will be local but the brunt of it will be cloud, sadly.

reply
dcreater
50 seconds ago
[-]
Horrible article. Low effort, low knowledge. Had no idea the bar was so low for an IEEE publication
reply
layer8
23 minutes ago
[-]
The article is from mid-November (and probably was written even earlier), where the RAM price explosion wasn’t as striking yet.
reply
vegabook
18 hours ago
[-]
also, state of the art models have hundreds of _billions_ of parameters.
reply
omneity
18 hours ago
[-]
It tells you about their ambitions..
reply
esses
18 hours ago
[-]
I spent a good 30 seconds trying to figure out what DDS was an acronym for in this context.
reply
zkmon
2 hours ago
[-]
You don't understand the needs of a common laptop user. Define the usecases that require reaching out to laptop instead of using the phone that is nearby. Those usecases don't need LLM for a common laptop user.
reply
superkuh
2 hours ago
[-]
The problem with this is that NPU have terrible, terrible support in the various software ecosystems because they are unique to their particular soc or whatever. No consistency even within particular companies.
reply
tehjoker
2 hours ago
[-]
I mean, having a more powerful laptop is great, but at the same time, these guys are calling for a >10x increase in RAM and a far more powerful NPU. How will this affect pricing? How will it affect power management? It made it seem like most of the laptop will be dedicated to gen AI services, which I'm still not entirely convinced are quite THAT useful. I still want a cheap laptop that lasts all day and I also want to be able to tap that device's full power for heavy compute jobs!
reply
j45
17 hours ago
[-]
This must be referring mostly to windows, or non-Apple laptops
reply
gguncth
17 hours ago
[-]
I have no desire to run an LLM on my laptop when I can run one on a computer the size of six football fields.
reply
theshrike79
1 hour ago
[-]
The point is that when you run it on your own hardware you can feed the model your health data, bank statements and private journals and can be 5000% sure they’re not going anywhere
reply
dboreham
58 minutes ago
[-]
Regular people don't understand nor care about any of that. They'll happily take the Faustian bargain.
reply
sandworm101
17 hours ago
[-]
I've been playing around with my own home-built AI server for a couple months now. It is so much better than using a cloud provider. It is the difference between drag racing in your own car, and renting one from a dealership. You are going to learn far more doing things yourself. Your tools will be much more consistent and you will walk away with a far greater understanding of every process.

A basic last-generation PC with something like a 3060ti (12GB) is more than enough to get started. My current rig pulls less than 500w with two cards (3060+5060). And, given the current temperature outside, the rig helps heat my home. So I am not contributing to global warming, water consumption, or any other datacenter-related environmental evil.

reply
HelloUsername
47 minutes ago
[-]
> I am not contributing to global warming

lol

reply