i have an m4 studio with a lot of unified memory and i’m still no where near running a 120b model. i’m at like 30b
apple or nvidia’s going to have to sell 1.5 tb ram machines before benchmark performance is going to be comparable
Plus when you use claude or openai, these days it’s performing google searches etc that my local model isn’t doing.
I agree with other comments that there are productive uses for them. Just not on the scale of o4-mini/o3/claude 4 sonnet/opus.
So imo open weights larger models from big US labs is a big deal! Glad to see it. Gemma models, for example, are great for their size. They’re just quite small.
I'm running a 400B parameter model at FP8 and it still took a lot of post-training to get an even somewhat comparable performance
-
I think a lot of people implicitly bake in some grace because the models are open weights, and that's not unreasonable because of the flexibility... but in terms of raw performance it's not even close.
GPT-3.5 has better world knowledge than some 70B models, and a few even larger.
Without constantly refreshing the underlying LLM and the expert system layer, these models would be outdated in months. Language and underlying reality would shift from under their representations and they would rot quick.
That's my reasoning for considering this a bubble. There has been zero indication that the R&D can be frozen. They are stuck burning increasing amouts of cash for as long as they want these models to be relevant and useful.
"the hacker news dream" - a house, 2 kids, and a desktop supercomputer that can run a 700B model.
I'm not saying those machines can't be usefull or fun, but it's not in the range of the 'fantasy' thing you're responding to.
I'm on a 128GB M4 Max, and running models locally is a curiosity at best given the relative performance.
I'm more than a bit overwhelmed with what I've gotten on my plate and have completely missed the boat on ex. understanding what MLX is, really curious for a thought dump if you have some opinionated experience/thoughts here. (ex. never crossed my mind until now that you might get better results on the NPU than GPU)
I should try Kimi K2 too.
You get the picture. Sure, even last year's local LLM will do well in capable hands in that scenario.
Now try pushing over 100,000 tokens in a single call, every call, in an automated process. I'm talking the type of workflows where you push over a million tokens in a few minutes, over several steps.
That's where the moat, no, the chasm, between local setups and a public API lies.
No one who does serious work "chats" with an LLM. They trigger workflows where "agents" chew on a complex problem for several minutes.
That's where local models fold.
Many small models are supposedly good for controlled tasks, but given a detailed prompt, I can't get any of them to follow simple instructions. They usually just regurgitate the examples in the system prompt. Useless.
You should also remember that there's no free lunch. If you see models below a certain size fail consistently, don't expect a model that is even smaller to somehow magically succeed, no matter how much pixie dust the developer advertises.
If you asked "What's the best bicycle", most enthusiasts would say one you tried, works for your usecase, etc.
Benchmarks should be for pruning models you try at the absolute highest level, because at the end of the day it's way too easy to hack them without breaking any rules (post-train on the public, generate a ton of synthetic examples, train on those, repeat)
Even if it does poorly in all areas (like Llama 4 [0]), there is still a lot the community and industry can learn from even an uncompetitive model.
[0] Llama 4 technically has a massive 10M token context as a differentiator, however in my experience, it is not reliably usable beyond 100k.
Another reason people are 'hyped' for open models is that access to them can not be taken away or price gauged at the whim of the provider, and that their use can not be restricted in arbitrary ways, although I'm sure that on the latter part they will have a go at it through regulation.
Grab'em while you can.
Not their proprietary model, but maybe other open source models, or closed source models of their competitors. That way they can first ensure they are the only player on both sides, and then can kneecap their open source models just enough to drive the revenue to their proprietary one.
I have Ollama installed (only a small proportion of their clients would have a large enough GPU for this) and have download deepseek and played with it, but I still pay for an OpenAI subscription because I want the speed of a hosted model, and never mind the luxuries of things like Codex's diffs/pull request support, agents on new models, deep research etc. - I use them all at least weekly.
Are you using it everyday for programming? If so, how much more or less does it cost you per month? More or less than $100?
Ah; this definitely makes sense! I do this myself and then paste back only the relevant part of the log so as to limit this. I suspect I am being more conservative than others.
Are you using a proxy to connect Claude code to Kimi?
And how much do you estimate it would cost in a month of daily usage?
They are fully trying to be a consumer product, developer services be damned. But they can’t just get rid of the API because it’s a good incremental source of revenue, and thanks to the Microsoft deal, all that revenue would end up in Azure. Maintaining their API is basically just a way to get a slice of that revenue.
But if they open sourced everything, it might sour the relationship more with Microsoft, who would lose azure revenue and might be willing to part ways. It would also ensure that they compete on consumer product quality not (directly) model quality. At this point, they could basically put any decent model in their app and maintain the user base, they don’t actually need to develop their own.
Pretty much give away Sonnet level coding model and have it work with GPT-5 for harder tasks / planning.
https://llm-stats.com/models/compare/claude-3-7-sonnet-20250...
I wish they released a nano model for local hackers instead
The size in bytes of this 120B model is about 65 GB according to the screenshot, and elsewhere it's said to be trained in FP4, which matches.
That makes this model small enough to run locally on some laptops without reading from SSD.
The Apple M2 Max 96GB from January 2023, which is two generations old now, has enough GPU-capable RAM to handle it, albeit slowly. Any PC with 96 GB of RAM can run it on the CPU, probably more slowly. Even a PC with less than 64 GB of RAM can run it but it will be much slower due to having to read from the SSD constantly.
If it's a 20B MoE, it will read about one fifth of the data per token, making it about 5x faster than a 120B FP4 non-MoE would be, but it still needs all the data readily available for multiple tokens.
Alternatively, someone can distill and/or quantize the model themselves to make a smaller model. These things can be done locally, even on a CPU if necessary if you don't mind how long it takes to produce the smaller model. Or on a cloud machine rented long enough to make the smaller model, which you can then run locally.
126G /llmzoo/models/Qwen3-235B-InstructQ4
126G /llmzoo/models/Qwen3-235B-ThinkingQ4
189G /llmzoo/models/Qwen3-235B-InstructQ6
219G /llmzoo/models/glm-4.5-air
240G /llmzoo/models/Ernie
257G /llmzoo/models/Qwen3-Coder-480B
276G /llmzoo/models/DeepSeek-R1-0528-UD-Q3_K_XL.b.gguf
276G /llmzoo/models/DeepSeek-TNG
276G /llmzoo/models/DeepSeek-V3-0324-UD-Q3_K_XL.gguf
422G /llmzoo/models/KimiK2
I’m running a gaming rig and could swap one in right now without having to change anything compared to my 5090, so no $5000 Threadripper or a $1000 HEDT motherboard with a ton of RAM slots, just a 1000 watt PSU and a dream.
Would be interesting to know how it performs in terms of quality and token/sec.
It doesn't mean you can grab your work laptop from 5 years ago and run it there.
I will be running the 120B on my 2x4090-48GB, though.
As far as dense models go, it’s larger than many but Mistral has released multiple 120B dense models, not to mention Llama3 405B.