I consider Gemma 4 31B (dense / no MoE), the new baseline for local models. It's obviously worse than the frontier models, but it feels less like a science experiment than any previous local model I’ve run, including GPT OSS 120B and Nemotron Super 120B.
On my M5 Max with 128 GB of RAM and the full 256K context window, I see RAM use spike to about 70 GB, with something like 14 GB of system overhead. A 64 GB Panther Lake machine with the full Arc B390, or a 48 GB Snapdragon X2 Elite machine, could probably run it with a 128K to 256K context window. Maybe you can squeeze it into 32GB (27.5GB usable) with a 32K context window?
Even last year, seeing this kinda performance on a mainstream-ish/plus configuration would have seemed like a pipe dream.
https://thot-experiment.github.io/gradient-gemma4-31b/
This is a relatively complex piece of tooling built entirely by Gemma 4 inside OpenCode where I manually intervened maybe only 4 times over the course of a few hours.
running Q6_K_XL, 128k context @ q8 ~ 800tok/s read 16tok/sec write
eagerly awaiting turboquant and MTP in llama.cpp, should take me to 256k and 25-30tok/s if the rumors are true
I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.
idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.
And we progress on so many different frontiers in parallel: Agent harness, Agent model, hardware etc.
Edit: For comparison with the other poster, same setup as above, but with Gemma 4 31B Instruct 8bit MLX (not sure if exactly the same model): time to first token 4.62s, 7.20 tokens/s; with a different prompt, 1.17s and 7.24 tokens/s.
Same laptop, and my contrived test was having it fix 50 or so lint errors in a small vibe-coded C++ repo. I wanted it to be able to handle a bunch of small tasks without getting stuck too often.
GPT OSS 20B was usable but slow, and actually frequently made mistakes like adding or duplicating statements unnecessarily, listing things as fixed without editing the code, and so on.
Qwen 3.5 9B with Opencode was much faster and actually able to work through a majority of the lint warnings without getting stuck, even through compaction and it fixed every warning with a correct edit.
I tried 4bit MLX quants of Qwen 3.5 9B but it eventually would crash due to insufficient memory. I switched to GGUF, which I run with llama.cpp, and it runs without crashing.
It is absolutely not comparable to frontier models. It’s way slower and gets basic info wrong and really can’t handle non trivial tasks in one go. I asked it for an architecture summary of the project and it claimed use of a library that isn’t present anywhere in the repo. So YMMV, but it’s still nice to have and hopefully the local LLM story can get much better on modest hardware over time.
This is not said often enough.
Yes, local LLMs are great! But reading most HN posts on the subject, you'd think they're within reach of Opus 4.7.
There is a very small, very vocal, very passionate crowd that dramatically overstates the capabilities of local LLMs on HN.
That all being said I've spent hundreds (maybe thousands?) of hours on this stuff over the past few years so I don't see a lot of the rough edges. I really believe capability is there, Gemma 4 31B is a useful agent for all sorts of stuff, and anything you can reasonably expect an LLM to oneshot Qwen 3.6 35b MoE will handle at like 90tok/sec, absolutely fantastic for tasks that don't require a huge amount of precision.
EDIT: Here's another sample for ya. I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.
idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.
I have seen way too many people who are overly optimistic about local LLMs.
Having spent a decent amount of time playing with them on consumer nvidia GPUs, I understand well that they not going to be widely usable any time soon. Unfortunately not many people share that.
Relatively speaking local models might always be behind the curve compared to frontier ones. You can tell by the hardware needed to run each. But in absolute terms they're already past the performance threshold everyone praised in the past.
Right now in a lab somewhere there's a model that's probably better than anything else. There's a ChatGPT 5.6, an Opus 4.8. Knowing that do you suddenly feel a wave of disappointment at the current frontier models?
A local model is as good as a frontier model for responding on a signal threat with you which requieres basic tool calling.
A local model is as good as a frontier model of writing a joke.
A local model is as good as a frontier model at responding to an email.
Not sure what needs to be said often enough, no one without a clue would play around with local model setup and would compleltly ignore frontier models and their capabilities?!
I find them useful in basic research and learning and question asking tasks. Although at the same time, a Wikipedia page read or a few Google searches likely could accomplish the same and has been able to for decades.
You can have good local LLM performance through agents, but you need fast inference. Generally, 2x 3090 or at the minimum 2x3080s (you need 2 to speed up prefill processing to build KV Cache). You just ironically need to be good at prompt engineering, which has a lot of analogue in real world on being able to manage low skilled people in completing tasks.
Edit: TIL it is MoE and only has 3.6B active, explains a lot.
Case in point, JPMorgan London Whale incident, $6 billion loss caused by an excel error...
But the are interesting and fun to play with! I do a LOT of work on local agent harnesses etc, mostly for fun.
My current project is a zero install agent: https://gemma-agent-explainer.nicklothian.com/ - Python, SQL and React all run completely in browser. Gemma E4B is recommended for the best experience!
This is under heavy development, needs Chrome for both HTML5 Filesystem API support and LiteRT (although most Chromium based browsers can be made to work with it)
It's different to most agents because it is zero install: the model runs in the browser using LiteRT/LiteLLM (which gives better performance than Transformers.js), and Filesystem API gives it optional sandbox access to a directory to read from.
It is self documenting - you can ask questions like "How is the system prompt used" in the live help pane and it has access to its own source code.
There's quite a lot there: press "Tour" to see it all.
Will be open source next week.
Actually....
I write and publish my own benchmark for this stuff. It's an agentic SQL benchmark which isn't in the training data yet and I've found can separate frontier models from close-followers (the only models to get 100% are Opus 4.6 and GPT 5.5).
The best small model I've found is a fine-tune of Opus-3.5 9B which scores 18/25: https://sql-benchmark.nicklothian.com/?highlight=Jackrong_Qw...
Haiku 4.5 scores 20/25, and Haiku is certainly better than Sonnet 3.6. GPT 3.5 scores 13/25.
As much fun as it is to run these things locally don’t forget that your time is not free. I am slowly migrating my use cases to openrouter and run the largest qwen model for < $2-3/day with serious use for personal projects.
I have a brand new M5 MacBook Pro - top end with all the specs and I've tried local models and they're barely functional.
1) control 2) privacy 3) transparent cost model
Cloud has tremendous value for speed, plug and play, and performance. You need to decide how those compete with the benefits of local - both today, and a year from now, e.g.
Some reference code if you want to throw your agent at it. https://github.com/rapatel0/rq-models
I use it occassionally for very easy tasks, fix typos or update meta data in blog posts. So yeah, it improves productivity. But coding-wise it's far away from Codex, Claude et al.
Agree but only for small projects. SOTA from a year ago still wins on larger projects
Vibe coders out here thinking all software development is solved by because they made an (ugly and unoriginal) dashboard for their SaaS clone and their single column with 3x3 feature card landing page thats identical to every other vibe coders "startup"
Makes me feel we are nowhere near the optimum yet.
Examples: https://dasroot.net/posts/2026/05/gemma-4-speed-hacks-mtp-df...
How long do people realistically expect a laptop to stay competitive with SOTA local models? Especially in a space where model sizes, context windows, and inference requirements keep moving every year.
And even if the hardware lasts, the local experience usually doesn’t. A heavily quantized local model running at tolerable speeds on consumer hardware is still nowhere near frontier hosted models in reasoning, coding, multimodal capability, tool use, or reliability.
The economics just don’t make sense to me unless you specifically need offline inference, privacy guarantees, or low latency for a niche workflow. Otherwise you’re tying up $10k upfront to run an approximation of what you can already access through a subscription that continuously improves over time.
You could literally put the difference into index funds and probably cover the subscription indefinitely from the returns alone, even accounting for gradual price increases.
In the UK, it's currently an extra £800 to get a 128 GB vs the 64 GB equivalent. So that's more like 3 years of Claude - I think? - assuming current prices stay the same.
Or: you might just feel like £800 isn't an unjustifiable amount of money (one way or another), and tick the box, on the basis that it might just work out. As the saying goes, in for 459,900 pennies, in for £5,399...
I don't think that's true. Plenty of people can run basic workflows at 8GB on the MacBook Neo and most others are fine at 16 GB.
I just hate paying money for cloud subscriptions, and work has given me a decent laptop
This sort of thing is key to knowing what's going on and bit having your brain fully atrophy.
I wouldn't bother with less than 32GB of VRAM. With 16GB you can already run something usable, but 32GB gives you much more power. 9B and 14B are only interesting if you want to tune models yourself. The sweet spot now seem to be around 27B-35B.
If you are spending $800/month on tokens you are likely to notice degradation for local models compared to near-frontier models. The models I can run locally are consistently worse than Claude Sonnet 4.6 (again for the work I give them), although Qwen3.6 does feel almost like magic for its size because it can do a lot. The really big open-weight models should be better, but they want 200+GB RAM, which will need a correspondingly expensive CPU.
For instance, if you are an independent inventor trying to write a patent while keeping your patent lawyer expenses to a minimum, you want to write as much of the first draft(s) of the patent as possible yourself. (You’ll save billable hours with your patent lawyer, and you’ll end up with a better patent because you’ll communicate your innovations more clearly to your lawyer.)
However, and this is the big thing, you absolutely do not want to be asking a SOTA LLM for help with the language in your patent application. This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically… and thereby prevent you (or anyone else) from being able to ever patent the invention. Plus, you know, a random unscrupulous employee at the SOTA company could be reviewing logs and notice your great idea, and file a patent on it before you do. Remember, the United States patent office went to “first inventor to file” in 2013.
Oh and don’t take legal advice from random people on the internet by the way.
Imagine you're a contractor. You have a client who knows nothing about software development that wants you to write some software for them. They give you some code they generated with an LLM to get you started. Would you use the code or start over?