The dependency we have with anthropic and openai for coding for instance is insane. Most accept it because either they don't care, or they just hope chinese will never stop open weights. The business model of open weights is very new, include some power play between countries and labs, and move an absurd amount of money without any concrete oversight from most people.
It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.
What stops you from running the best open weighted LLMs currently available on consumer grade hardware for the rest of time? They're good enough for 95% of use cases, and they don't have a used by date. From what I can see, the "danger" is not having the next tier that comes out, but the impact of that is very low.
For quite a lot of use cases, the current systems arguably do get worse over time if not continually updated. The knowledge cutoff date will start to hurt more and more as the weights age in a hypothetical scenario where you are stuck with them forever.
Coding, one of the most popular usescases today, would not be great if it say only understood java to a version from years ago etc.
This LLM trained only and entirely on pre-1930s texts was able to code Python programs when given only a short example:
Pockets are too deep, it will only change once everyone is out of money.
Side note though, it’s the speed that bothers me more than the reasoning. Qwen 3.5 is awesome, but my Claude subscription can tear through similar workloads an order of magnitude faster than my local LLM can when using Haiku. That’ll matter a lot to some people.
The huge difference to open source is that you can't just train an LLM with free time and motivation. You need lots of data and a lot of compute.
I sure want to be wrong on that, I definitely like the open-weight version of the future more
In the same way you can imagine the Chinese government pushing the release of deepseek etc to make sure no one thinks the US has “won” and to keep everyone aware that a foreign model might leapfrog in the short term future etc.
At some point though if OpenAI/Antropic/Google plateau or go bust then the open source sponsorship becomes less likely, as making it open source was a weapon not a principle.
Effectively they are saying "yea don't crowd our data centers with small queries, go ahead and send your frontier questions to our frontier models. Oh btw those us models? You can run something about as good for free from us if you want hah." It's a power and marketing move. It's also insanely smart to keep up with it to remain sustainable as a brand. Especially given how small their investments into this are.
Look at anthropics growing pains. Deepseek has other hosts spreading their brand for free while they grow. Brilliant honestly. In my opinion it makes anthropic and openai look clueless on a lot of levels.
China is playing a different game here. To them this is commoditizing their compliment and building good will. The Chinese economy doesn't teter on the brink of collapse to deliver frontier grade LLMs. Nope, Alibaba just made qwen because it needs it. It needs efficient models. Similarly, in China they manufacture and automate so much more than the US ever could. LLMs to them are a topping not the whole meal like they are in the us.
The compute required to run these models is still very far out of reach for the average consumer, yet known enthusiast, therefore they still sell inference, whilst also getting consumer goodwill for providing open weights.
https://try.works/#why-chinese-ai-labs-went-open-and-will-re...
China? Im getting ready to watch the URKL (universal robot knockout league) go on. The USA is dicking around with failed robot dogs.
The USA has been a failed country, coasting on massive inertia. But the tech avenues from a article I cant find showed the USA 8/64 areas excelling. China was 56/64 areas excelling.
Dodging politics, the power structures in us industry need serious revamping.
Not everything good in our society needs to have a "business model". People still work on it. It's FINE.
Donations. Have you donated lately?
Wikipedia is cheap compared to creating and training models.
I don’t think donations will suffice at all.
As an example, we had millions of web developers download and install Firebug before browsers shipped their own dev tools. Donations over the course of multiple years would have paid my salary for a month if I were not a volunteer.
But from the “it’s fine” point of view, models will be baked into your OS.
Then later models will be embedded into hardware. Likely only OS makers models.
So, the business model of open models is the same as closed models: Sell inference. Open source is marketing for that inference.
https://try.works/#why-chinese-ai-labs-went-open-and-will-re...
Frontier US labs could still have an advantage for a long time, but many use cases would start gravitating towards Chinese models if they 10x the data centers and provide similar quality inference for a third of the cost.
This is what I do not understand as well and advertising the knowledge and more advanced model is also the only thing that comes to my mind.
Since a month I am using gemma4 locally successfully on a MBP M2 for many search queries (wikipedia style questions) and it is really good, fast enough (30-40t/s) and feels nice as it keeps these queries private. But I don't understand why Google does this and so I think "we" need to find a better solution where the entire pipeline is open and the compute somehow crowdfunded. Because there will be a time when these local models will get more closed like Android is closing down. One restriction they might enforce in the future could be that they cripple the models down for "sensitive" topics like cybersecurity or health topics. Or the government could even feel the need to force them to do so.
It builds good will also. it also shows research prowess.
For China it's different. They need to show Americans who don't trust them at all because of propaganda that they have no tricks up their sleeve. It also doesn't hurt when Chinese companies drop models for free people can run at home that are about as good as sonnet. Serious mic drop.
Running AI models on local hardware was exploratory at first, and if it's so easy today it's thanks to open source. It's a little bit coincidental that we have this today, and that mainstream hardware have this capability. The fact that a phone can run very small models is exploratory or some kind of marketing opportunity at best.
Why would hardware company ships cards with more AI capabilites (like more VRAM) in the foreseable future ? On what ground does the marketing for on device AI will keep generating interest ? For something as important, it's very uncertain. But above all, it should not depends on these brittle justifications.
Showing good will in distribution and research prowess today is positive communication, but it can be exactly the oppositite if/when an attack using those small models will reach a high value target.
For China the cultural difference is so huge, it's difficult to say. I would think they first and foremost need to show to evryone inside and outside of China that they match american models. Second, i would say that when americans prefer few very powerfull companies on the get go because they can leverage a lot of capital rapidly to industrialize, China will prefer leveraging a lot of smaller companies exploring a lot of things simultanously (so doing a lot of research), THEN creating legislation to let only the best (or a few) to survive effectively. In the end it's the same result (monopoly or oligopoly), but China may have a stronger core (research) and America may have stronger productive capital, that may be proved obsolete... In the long run, in either side it's a gamble, again.
I disagree on the second point. I think most Americans don't prefer fewer competition, that's a bit antithetical to the free market.
I doubt the Chinese government cares as much about controlling a few companies as you think they do.
China has a few things going for it beyond research. They are mission driven, they actually have needs for this technology, their needs will forward their entire economy as they are the world's largest manufacturers. They are also huge exporters and have buckets of customer support for various languages.
China also has considerably stronger infrastructure for electricity, etc. even with an nividia embargo they are doing more than showing up.
I don't think it's a matter of who "wins". There is no winning. I think China stands to gain far more from LLMs than the US does, and they have proven they don't need the us to do it, even with he us trying to sabotage it's every move into the space. The game is already more or less over in my mind.
If anything I see LLMs as having a huge market in China, and now the US can't even sell it to them.
All I care about is, if I have to use this technology, let me run it locally to avoid the surveillance capitalism aspect. That seems to be the real reason the us has propped up it economy in anticipation for this technology. Yet it doesn't long term benefit the us nor me.
I don't think local will necessarily be open-weight. And then it's not that different from personal computing: you're giving up the big lucrative corporate mainframe, thin-client model for "sell copies to a ton of individuals."
So it'd be someone else (an Apple, or the next-year equivalent of 1976 Apple) who'd start eating into that. There are a few on-device things today, but not for much heavy lifting. At first it's a toy, could maybe become more realized in a still-toy-like basis like a fully-local Alexa; in the future it grows until it eats 80-90% of the OpenAI/Anthropic use cases.
Incumbents would always rather you pay a subscription or per-use forever, but if the market looks big enough, someone will try to disrupt it.
The cost to transmit text is basically free and instantaneous. The rent (i.e. a GPU in a data center) vs buy is going to favor rent until buy is a trivial expense. Like 50-100 range.
Even then a LLM that just works is easier than dealing with your own
Much like the current Twitter model, being able to put your thumb on the scale of "truth". Bake a stronger bias towards their preferred narrative directly into the model. Could be as "benign" as training it to prefer Azure over AWS. Could be much worse.
How many crowdfunded projects do you know that have raised even one percent of that? Who’s going to be in charge of collecting that scale of money? Perhaps some sort of company formed for the benefit of humanity, which will promise to be a non-profit? Some sort of “Open” AI?
Oh, wait.
I can't say that you are lying and you are not exactly exaggerating either. It is true that a new SOTA model -- from literal scratch -- it would be expensive.
But, and it is not a small but, is the starting point really zero?
Sometimes there are things where the public good is best served with public expenditure.
Not every country is in a crypto-libertarian race to hoard power and wealth.
Read through a 1970s-era issue of Popular Electronics or Byte, and then spend some time surfing /r/LocalLlama. You'll get a sense of real-time deja vu, like you're watching history unfold again.
It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.
> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.
Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.
Damned if they do, damned if they don't.
This comment is quite dishonest about the nature of the discussion.
Also why doesn't their task manager show that it's actually the one downloading? Why does it go out of it's way to hide this activity?
Since I have conky on my desktop I could catch this immediately, and take the action I preferred with my own computer, which was to _immediately_ disable it.
https://developer.chrome.com/blog/new-in-chrome-148#prompt-a...
https://www.google.com/chrome/ai-innovations/
They have absolutely not been shy about any of this.
Please show me where in either of those documents it explains it's going to download a 4GB model.
It's a totally separate tab that opens. It's got nothing to do with what you use as your homepage.
Not to mention that the LLM that I choose to run requires a monster machine and is infinitely more capable than whatever google chose to put on their browser?
I mean, none of this affects me because I don't use chrome, obviously, but you don't see the difference? Bewildering.
Is there a solution for this? I'm currently just making users download onnx models if they want a feature, but it's not smooth UX
A self hosted inference solution that offer good tenant isolation guarantees (ideally zero trust) and is easy enough to deploy and maintain (think Plex for AI) would be my choice for privacy. Now to be honest I have done zero research about this and have zero idea how feasible that is, maybe it already exists and there's some discord servers I should join?
Edit: I don't need to mention it here but what's incredible is that open models are in the ballpark of the best commercial models so supposedly, the hardest part by far is already solved.
>that open models are in the ballpark of the best commercial models
This is basically true for certain tasks. As an example, chat interfaces are not well poised to take advantage of higher model intelligence than what the best open source models already provide. But coding harnesses still benefit from greater model intelligence and even more so, the reinforcement learning that tightly interlinks the provider's coding harness (claude-code, codex) with the model's tool calling interfaces is another reason for discrepancy in effectiveness even when controlled for model intelligence. The opencode founder (open source coding harness that supports different model providers) was recently complaining about the challenges making the harness work well with different providers: https://x.com/thdxr/status/2053290393727324313
They need to be able to do a small task well and they need to be able to run reasonably on consumer-class devices. Even better if they can run on mobile phones.
In my experiments with local LLMs I noticed that while increasing the size of the model is nice the real thing that turns a barely useless model into something useful is the ability to use tools. Giving my models the ability to search the web and fetch web pages did way more to solve hallucinations than getting a bigger model. And it doesn't have a training cutoff. Sure, the bigger model is probably better at using tools but I often find the smaller models to be good enough.
TFA is focused on whether big models are necessary for what users want. There's some evidence they may never actually be reliable enough unless a) mechanistic interpretation matures far enough or b) our multi-agent systems all become multi-model.
For (a), advancement in MI might fix problems with big models, but would also mean we can maybe get unified representations, and just slice and dice the useful stuff out of huge models, getting only what we need without the junk. Ability to isolate problems won't really come without bringing the ability to isolate functional subsystems. Only want logic? Only vision? Just cut it out of the big monster and enjoy reduced costs and surface area for problems.
For (b), just look at stuff like the evil vector, or the category of hallucinations specific to tool-use. Without a complete solution for helpful/honest/harmless alignment, it seems likely that creativity and rigor (and many other things) are fundamentally at odds. If you start to need many models for everything anyway, why do we need the huge expensive do-everything ones? So specialization also becomes a pressure to shrink everything towards minimal reliable experts
As OP says, it shines in constrained environments where the model is transforming user-owned data. Definitely less useful for anything more open-ended.
Yup, that's the plan. No local model, no webpage; more, better and cheaper adtech extortion/surveillance for vendors while everyone else pays for the juice and hardware degradation.
Maybe it would do better with the new Gemma 4 models, which the Chrome devs have been hinting at moving to. And why the API doesn't let you introspect / pick the model, I'm still not sure.
I think the Quixotic accelerationists of AI are more or less a vocal minority of the people who make software, and the choice of online APIs over local systems is largely a choice made for users, rather than developer’s laziness.
You can do more and better with private AI today than with local models. There is no getting around that. Even if local AIs get better, being on the cutting edge of LLM performance is often a very worthy investment.
Most people won’t settle for a product if it’s not the very best and incredibly convenient. That’s a high bar, and local AI often doesn’t meet those standards.
HN’s insistence on treating all users like they are open-source, privacy-first, self-hosted Linux fanatics is painfully corny.
I may personally be of modest intelligence, but to acquire the intelligence that I do have, I did not need to train on every book ever written, every Wikipedia article ever written, every blog post ever written, every reference manual ever written, every line of code ever written, and so on. In fact, I didn't train on even 1% of those materials, or even 0.00000000001% of those. The texts themselves were demonstrably not a prerequisite for intelligence.
At minimum, given that it only took me about 20 years of casual observation of my surroundings to approximate intelligence, this is proof positive that the only "dataset" you need is a bunch of sensors and the world around you.
And yes, of course, the human brain does not start from zero; it had a few million years of evolution to produce a fertile plot for intelligence to take root. But that fundamental architecture is fairly generic, and does not at all seem predicated on any sort of specific training set. You could feasibly evolve it artificially.
A universal translator with image and voice recognition and a decent breadth of encyclopedic knowledge in only a small fraction of an English Wikipedia dump(6GB/20+GB) is not "huge".
It is probably closer to the theoretical limit than anyone could have expected.
In the future, when regular home computers have the capabilities of modern servers, we'll be able to train the entire LLM at home.
The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.
I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.
I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.
That's not a problem, that's a feature; I have something like 8 tabs open to different free-tier providers. ChatGPT, Claude and Gemini are the SOTA ones.
I have no problem maxing one out, then moving to the next. I can do this all day, have them implement specific functions (or classes) in my code. The things is, because I actually know how to write and design software, I don't need to run an agent in a loop to produce everything in a day, I can use the web chatbots with copy/paste to literally generate thousands of lines of code per hour while still having a strong mental model of the code that I can go in and change whatever I need to.[1]
---------------------
[1] Just did that this morning on a Python project: because I designed what I needed, each generation was me prompting for a single function. So when I needed to add something this morning I didn't even bother asking an chatbot to do it, I just went ahead directly to the correct place and did it.
You can't do that if you generate the entire thing from specs.
The feature of using all these SOTAs to exhaustion on the free tiers is burning their VC money!
The more I use for free, the more of their money I burn, the closer we'll get to actual 3rd-party and independent setups (local or otherwise).
I have a sneaking suspicion this is kinda like the situation with Linux in the 90s, where it kinda worked but it reeeeeally wasn't ready for the home user, but you had a lot of people who would insist to your face everything was fine, mostly for ideological reasons.
Different usage patterns - you want to issue a single spec then walk away and come back later (when it has consumed $10k worth of API tokens inside your $200/m subscription) to a finished product.
Many people issue a spec for a single function, a single class or similar. When you break it down like that, the advantages of SOTA models shrinks.
What do you mean "trust it"? It sounds like you want to vibe-code (never look at the output), and maybe for that you need SOTA, but like I said in a different comment, I can easily generate 1000s of lines of code per hour just prompting the chatbots.
I don't, because I actually review everything, but I can, and some of those chatbots are actually SOTA anyway.
With subpar models I must be more careful on providing instructions and check it step by step because the path it chose is wrong, or I didn't ask for or the agent stuck in a loop somewhere.
I'm currently running both Sonnet 4.6 and Qwen 3.6-27b on the same codebase (via OpenCode, the parameters were carefully tuned to have a good quality/context size ratio), and on this project, they both struggle with complex non-trivial tasks, and both work flawlessly otherwise. Sonnet 4.6 understands the intent better if my task is ambiguously formulated, but otherwise the gap is pretty small for coding under a harness.
I’ve begun to suspect that most people are probably running different hardware. Sure, you run the latest deep flash on your brand new M5 128G maybe you get acceptable performance?
But honestly, how many people have an extra $9000 laying around these days?
Right now, running with acceptable performance is kind of a luxury. I wish the people who always say - “This is great!” - would realize that not everyone has their hardware.
A smaller cheaper local model can delivery most the value for coding, while we still use some services for code review and security compliance.
Once the VC money runs out and they start to charge the real price, the C-level will have to impose budges or limits. The current pissing contest over who can expend the most tokens is both ridiculous and shortsighted
On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.
Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.
This is the wrong way of putting it. Local inference with SOTA models is all about slowing down compute for the sake of fitting on bespoke repurposed hardware. You don't need to go fast if you have the whole machine to yourself 24/7. Cloud AI vendors can't match that kind of economics.
https://news.ycombinator.com/item?id=48050751
A specialist handrolls a cut-down framework to power a 1 or 2 bit quantised version of a cut-down sort-of-frontier model.
It can be yours if you have 128GB or 256GB of RAM.
``` harbor pull unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL
# Open WebUI -> llama.cpp + SearXNG for Web RAG + OpenTerminal as sandbox harbor up searxng webui llamacpp openterminal ```
That's it, it's already better than Claude's or ChatGPT's app.
* What is the answer to local AI for native apps on Windows?
* What is the answer to local AI for Linux?
This is a big opportunity for Linux, given the high quality of open-weight models. I hope some answer emerges before designs fracture and we get a dozen mutually incompatible answers.
run an ai api endpoint on a unix domain socket
We are at least 5 years away from that. And DRAM needs a substantial breakthrough in cost reduction.
The promised mega-data center deals are meant to boost valuations today, not serve tons of customers three years from now.
Seriously. I have never ever seen so many people so willingly drink the marketing kool-aid from companies selling their product before. It's scarier to me than any threats of AI actually disrupting society (because it is so far from being capable of doing that).
Basically small and medium models that are crazy well trained for their sizes.
Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.
Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.
That's irrelevant to my decision to use local or not.
I didn't read "and how were those models trained" as "Are we there yet?"
I have to assume current architectures aren't optimal though, the idea that we stumbled into the one and only optimal solution seems almost impossible.
If you project out that hardware just a couple of years, and the trained models out a couple of years, you end up in a place where it makes so much more sense to run them locally, for all sorts of latency, privacy, efficacy, and domain-specific reasons.
Not all that different from the old terminal & mainframe->pc shifts.
Finally - hardware has seemingly gotten out ahead of software that most folks use - watching YouTube, listening to music, playing a game or two. There was a time when playing an mp3 or watching a 4k video really taxed all but the nicest systems. Hardware fixed that problem, like it very well could this one.
Definitely not the high end local LLMs. The small ones, yes, absolutely.
> If you project out that hardware just a couple of years
One of the biggest bottlenecks for LLMs is memory capacity and bandwidth. With the current glut for memory, it's unlikely we'll see lots of advancements in terms of average memory available or its bandwidth on regular (not super high end devices) in the coming years.
Alternatively, it's possible we get dedicated SMLs for e.g. phone specific use cases, that are optimised and run well.
A useful framing over “local vs cloud AI” can be split along two axes: does the task touch private data, and does it need frontier intelligence? You can use frontier models for developing the software (doesn’t touch data), but open-source models running locally for ops: maintenance, debugging and monitoring (touches data). If you need to fall back to frontier intelligence at some point for a particularly hard to resolve problem, you can still rely on local models for pre-transforming and filtering input in a way that's privacy-preserving or satisfies some constraint before it’s sent off to the cloud for processing. OpenAI's privacy filter is a good example of a model that can be used to mask PII and secrets and that can run locally: https://openai.com/index/introducing-openai-privacy-filter/, before sending any data externally for processing.
Another framing for local vs frontier closed which the article mentions is whether the task saturates model capability. With certain tasks like PDF processing or voice or summarization, adding more intelligence isn't necessarily useful. Arguably we've approached that point for chat interfaces already with frontier open-source models. But for coding and ops through well structured tool use inside a coding capable harness, we're still a ways away.
Tangentially, a contrarian take here is that AI can actually enable more privacy preserving software if you’re so inclined. You can just build personalized software and it lowers the barrier to entry and the effort required to self host. SaaS complexity often comes from scaling and supporting features for all types of customers, and if you're building software for personal use, you don't need all that additional complexity. Additionally, foundational and infra software that is harder to vibecode with AI is often already open source.
Isn’t this true of any application that accesses anything not running on your computer? This is just describing what it means to add an API call to your app. Nothing to do with AI (?)
Not saying it’s _wrong_ either – maybe it doesn’t use a backend of its own (the client downloads content directly from some predefined set of sites), maybe there is functionality to adjust how the summaries work that benefit from doing it on device, etc. Just doesn’t convince me that ”local AI should be the norm”.
Well there’s your problem, control needs to go the other way. If you want your app to be AI-enabled, you need to make it easy for AI to control your app. Have you used OpenClaw? It’s awesome!
> “But Local Models Aren’t As Smart”
> Correct.
> But also so what?
> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.
> And for those tasks, local models can be truly excellent.
I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.
All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.
I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.
2) It's probably not the time/place to trouble-shoot your "consumer grade server" LLM experience, but if you're running on CPU (you don't mention a GPU) then yeah, your inference speed will be slow.
3) Counterpoint: my consumer-grade Macbook Pro (M1 Max, 64GB) runs Qwen3.6-35B-A3B fast enough to be very usable for regular interactive coding support. (And it would fly with smaller models performing simpler tasks.)
Used to take me maybe 10-20 minutes per sheet.
Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.
My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…
If you are simply measuring Watt Cost per Token, you are missing the mark drastically. You have to measure quality output per Watt.
It sounds reasonably difficult to benchmark this, maybe I'm wrong though.
Work? I don't want it local at all. I want it all cloud agent.
proceeds to brutalise the reader with an 88-point headline font.
who can afford a house?
If we could even get something like GPT 5.5 running locally that would be quite useful.
1. Local models are likely to be more power-expensive to run (per-"unit-of-intelligence") than remote models, due to datacenter economies of scale. People do not like to engage with this point, but if you have environmental concerns about AI, this is a pretty important one.
2. Using dumb models for simple tasks seems like a good idea, but it ends up being pretty clear pretty quick that you just want the smartest model you can afford for absolutely every task.
And you can't take comfort in knowing that you, personally, will remain in control of your own computing. The majority will let the range and direction of their thoughts and output be determined by the will of the tech giant whose AI they adopt. And that will shape society.
NVidia segments the market by limiting the amount of memory on GPUs. It currently tops out at 32GB (on a 5090) but it has excellent memory bandwidth (~1.8TB/s). If you want more than the you need to buy an RTX Pro (eg RTX 6000 Pro w/ 96GB for ~$10K) or you get into high high end solutions like H100, H200, etc that have significantly more memory and even higher bandwidth on HBM memory (eg 3.2TB/s+).
NVidia has released the DGX Spark w/ 128GB of memory for ~$4k. The problem is the memory bandwidth. It's only 273GB/s, which is less than the M5 Pro (307GB/s) but more than the M5. You can buy a 16" Macbook Pro with an M5 Max and 128GB of memory for $6k and it has a bandwidth of 614GB/s. So the DGX Spark is a joke, really.
In case it wasn't clear, Apple is interesting in this space because it has a shared memory architecture so the GPU can use all the memory.
Many, myself include, expect there to be no refresh to the 5000 series consumer GPUs this year, which would otherwise happen based on product cycles. So no 5080 Super, for example. And I wouldn't expect a 6090 before 2028 realistically.
One thing Apple hasn't done yet is release the M5 Mac Studios, which are widely expected in Q3 this year. They are interesting because, for example, the M3 Ultra has a memory bandwidth of 819GB/s and previously had a max spec of 512GB but that got discontinued (and the 256GB version also got discontinued more recently).
So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.
You really have to use this hardware almost 24x7 for it to be economical because otherwise H100 computer hours are probably cheaper.
But what happens when the next generation of GPUs comes out to the trillions in AI DC investment? It's going to halve its value. That's over $1 trillion in capex that will disappear overnight, effectively.
I think Apple is the dark horse here because they have no interest in NVidia's psuedo-monopoly. I'm just waiting for them to realize it.
Now CUDA is an issue here still but I think as time goes on it's going to be less of an issue. Memory is still a huge constraint both in terms of price and just general supply because NVidia can justify paying way more for it than you can, probably.
It's still sad to see that 128GB (2x64GB) DDR5 kits are almost $2k now and werre $400 a year ago. Expect that to continue until this bubble pops (which IMHO it will) and we're likely in a global recession.
So the other issue is models. OpenAI and Anthropic are built on proprietary models. Their entire valuation depends on this moat. I don't think this last so both companies are doomed because open source models are going to be sufficiently good.
We can already do some reasonably cool stuff on local hardware that isn't that expensive and even more so once you get to $5-10k hardware. That's going to be so much better in 2 years that I'm hesitant to spend any amount of money now.
Plus the code for running these things is getting better. Just in the last month there have been huge speed ups in local LLMs with MTP.
This is what I'm really waiting for. It will enable models comparable to current SOTA at the enthusiast price range.
Not at all sure about that. They have really good compute, and DeepSeek V4 (with antirez's 2-bit expert layer quant) may be able to leverage that compute via parallel inference - the jury is still out on that. Now if you had said Strix Halo/Strix Point or perhaps the Intel close equivalents, that would've been a slightly stronger case.
Welcome back to 2014. Let us now continue yelling at the cloud.
I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.
Local models need to be resident in expensive RAM, the kind that has fat pipes to compute. And if you have a local app, how do you take a dependency on whatever random model is installed? Does it support your tool calling complexity? Does it have multimodal input? Does it support system messages in the middle of the conversation or not? Is it dumb enough to need reminders all the time?
Spend enough time building against local models and you'll see they're jagged in performance. You need to tune context size, trade off system message complexity with progressive disclosure. You simply can't rely on intelligence. A bunch of work goes into the harness.
Meanwhile, third party inference is getting the benefits of scale. You only need to rent a timeslice of memory and compute. It's consistent and everybody gets the same experience. And yes, it needs paying for, but the economics are just better.
Reading the tea leaves here, it will probably be common for OS’s to have built in models that can be accessed via API. Apple already does this.
Local models are absolutely going to be the future for things like simple automation and classification tasks that run occasionally and don't need to rely on internet access.
But for all of the serious stuff where you are doing knowledge work, the models will simply continue to be too big, and too slow to run locally.
The article says:
> Use cloud models only when they’re genuinely necessary.
But at least for me, they're genuinely necessary for 99+% of my LLM usage.
At the end of the day, the constraint here really is efficiency and cost.
Privacy can be ensured with the legal system, the same way that businesses that compete with Google still have no problem storing their data in Google Workspace and Google Cloud. The contractual guarantees of privacy are ironclad, and Google would lose its entire cloud business overnight as its customers fled if it ever violated those contractual agreements (on top of whatever penalties they allow for).
Why not ship your own model? In the age of Electron apps, 10GB+ apps are not unheard of.
It seems easier to have industry specs that define a common interface for local models.
I also assume the OS can, or would need to, be involved in proving the models. That may not be a good thing depending on your views of OS vendors, but sharing a single local model does seem more like an OS concern.