GLM 5.1 is an excellent model, but even at Q4 you're looking at ~400GB. Kimi K2.5 is really good too, and at Q4 quantization you're looking at almost ~600GB.
This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).
For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable. This beats the latest Sonnet while running locally, without anyone charging you extra for having HERMES.md in your repo, or locking you out of your account on a whim.
Mistral has never been competitive at the frontier, but maybe that is not what we need from them. Having Pareto models that get you 80% of the frontier at 20% of the cost/size sounds really good to me.
The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.
For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware.
For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token.
> This beats the latest Sonnet while running locally
Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.
This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here
My notes so far:
"us.anthropic.claude-sonnet-4-6" # working, good results
"us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions
"us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results
"us.anthropic.claude-opus-4-5-20251101-v1:0"
"us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive
"amazon.nova-pro-v1:0" # completely fails
"openai.gpt-oss-120b-1:0" # tool calling broken
"zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet
"minimax.minimax-m2.5" # didn't diagnose correctly
"zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet
"mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved
"moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly
"moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination
Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching
The Kimi ones were close to working but didn't quite make the mark
Very valid. This is an active area of research, and there are a lot of options to try out already today.
- People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.
- Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.
- DFlash (block diffusion for speculative decoding) needs a good drafting model compatible with the big model, but can provide an uplift up to 5x in decoding (although usually in the 2-2.5x range)
- Forcing a model's thinking to obey a simple grammar has been shown to improve results with drastically lower thinking output (faster effective result generation) although that has been more impactful on smaller models.
We should be skeptical, but it's definitely trending in the right direction and I wouldn't be surprised if we are indeed able to run it at acceptable speeds.
> Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.
This hasn't been my experience. After Anthropic's started their shenanigans I've switched to exclusively using open-weights models via OpenRouter and OpenCode and I can't really tell a difference (for better or for worse).
> - People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.
Where can I find more info on this? I’d like to convert models to onnx this way.
> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.
Where can I find more info on this? I’d like to convert models to onnx this way.
The most difficult environment for small models is in the browser. Would be great to push the SOTA in that environment.
Cloud hardware can run the original model. Quantization will reduce quality. The quality drop to Q4 is not trivial.
Cloud hardware is also massively faster in time to first token and token generation speed.
> there's nothing wrong per se about targeting slower inference speeds in a local single-user context.
If that's what the user wants and expects then it's fine
Most people working interactively with an LLM would suffer from slower turns.
New models are often being released in quantized format to begin with. This is true of both Kimi and the new DeepSeek V4 series. There is no "original model", the model is generated using Quantization Aware Training (QAT).
The original model is the model used for the benchmarks
People will say "You can run it locally!" then show the benchmarks of the original model, but what they really mean is that you can run a heavily quantized adaptation of the model which has difference performance characteristics.
As for other models, we quantize them because we are generally constrained by the model's total footprint in bytes, and running a larger model that's been quantized to fit in the same footprint as a smaller one improves performance compared to a smaller original, generally up to Q4 or so, with even tighter quantizations (up to Q2) being usable for some uses such as general Q&A chat.
Also what's with @sasha-id talking to himself? Looks weird as all get out.
Anthropic apparently won't take responsibility for issues their own systems handling billing cause. You think they'll take responsibility in your system when a bug in their models can be demonstrated as the cause?
I think with every org, especially the big ones, trying to dodge responsibility (setting the intent of "customer support" to be annoying them enough for them to buzz off), the only recourse people have is to give them enough bad press where they wake up and do the refund, it's less than a rounding error for them.
I think Anthropic is hardly unique in that position and being able to chat with a human with any sort of power to actually make things right is becoming more and more rare. If any human eyes saw that, the correct thing to do would probably be passing the message up the chain like "Hey, this will have really bad optics if we don't do the right thing. Can you take like 5 minutes and hit the refund button while I draft up a nice message about it?"
I really wish it carried any weight. It just doesn't. If someone at the organization just says "never admit fault, always attack", it's very likely they'll get away with it.
Sad to see all the non Chinese open source models being at least one generation behind.
Before February I was able to use Opus on High exclusively on my Max plan no problem. Now I've shifted to just using Sonnet on high and yeah, its pretty capable. I love that, Claude Pilled. ;)
Not really.
- The benchmarks are based on F8_E4M3 and you’re not running that on any Mac.
- Sonnet has a 1M token context window. This is 256k but again you’re probably not even getting that locally.
- Sonnet is fast over the wire. This is going to be much slower.
i too am desperate to just sever ties with these big providers, my fingers are crossed we get there within the constraints of local hardware even if that means me spending 3-5k i just want off this wild ride.
That's the edge of Apple Silicon for AI. When they scale up the chip they add more memory controllers which adds more channels and more bandwidth.
But yeah in the end it's still going to be only a handful of people that can run it.
What I meant is that I think researching and developing smaller more powerful model is more interesting than chasing the next 3T parameter model while burning through VC money and squeezing your customer base more and more aggressively.
[1]: There is no other common benchmark in the blog.
I'm rooting for Mistral, I want them to release good models. This just isn't one. It's a little sad since they once were so prominent for open-source.
Who knows — if they have the compute to train this, they have the compute to train an MoE that's 3-4T total params with 128B active. Maybe they'll make a comeback (although using Llama 2 attention is... not promising). I hope they do.
Not sure it will beat Sonet at Q4.
>This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).
For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality.
Very valid. Importance-weighted quantization and TurboQuant on model weights can reduce loss a lot compared to "traditional" Q4 so one can be hopeful.
> For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality
But you will own no computer, and that's also assuming prices stay what they are. Anyway my point was not whether or not it makes financial sense for everyone. A lot of people are very happy not owning their movies, software, games, cars or house. I'm just happy there is a future where the people can own and locally run the tech that was trained on their stolen data.
mind sharing where's the go to place to pay for open models?
In the US with our broken system of capitalism, it’s the only way we can tether these companies to reality. Left to their own devices, I’m not convinced they would actually compete with each other on price.
Buy nobody like to talk about how “moat” building is fundamentally anti-competitive, even in name.
Funny that self proclaimed capitalists hate the system in practice. Commodity pricing is what truly terrifies them.
https://chatgpt.com/share/69f239e8-7414-83a8-8fdd-6308906e5f...
Tldr: qwen3.6-27b, a 4.7x smaller model, have similar performance.
UPD. NVM, Mistral Medium 3.5 is dense. So yes, it is worse in every way.
The different results on some benchmarks vibes as if this is truly an independently trained model, not just exfiltrated frontier logs, which I think is also really important - having different weight architectures inside a particular model seems like a benefit on its own when viewed from a global systems architecture perspective.
Note: I have more uncommitted speed improvements in my tree that I'll push soon, the current tree could be a little bit slower but not much, still super usable.
I don't understand one thing about Mistral, which I'm a fan being in Europe: they opened the open weights MoE show with Mixtral. Why are they now releasing dense models of significant sizes? In this way you don't compete in any credible space, nor local inference, nor remote inference since the model is far from SOTA and not cheap to serve. So why they are training such dense big models? Dense models have a place in the few tens of billion parameters, as Qwen 3.6 27B shows, but if you go 5 times that, it is no longer a fit, unless you are crushing with capabilities anything requiring the same VRAM, which is not the case.
It's cool that they added comparisons to their own Mistral Small 4 119B A7B, which kind of shows that! They could have also included comparisons to something like Qwen Coder Next 80B A3B (or maybe the newer Qwen 3.6 35B A3B, or the 27B dense one), maybe DeepSeek V4 Flash 284B A13B, or the older GPT-OSS 120B A5B to illustrate that difference and where their model sits even better, it would probably give a more positive picture than just comparing themselves against a bunch of bigger models!
Come to think of it, alongside throwing some money at DeepSeek not just Anthropic, I probably should get a Mistral subscription as well sometime, to see how they perform on various tasks - cause they seem pretty cost effective and it's nice to support at least some EU orgs: https://mistral.ai/pricing
Sometimes when a new release comes around from any provider I just want to test it a bit on the web. without paying and using an agent harness.
Why are they like this ;_;
Edit: Christ on a bike it's bad at drawing SVGs https://chat.mistral.ai/chat/23214adb-5530-4af9-bb47-90f5219...
On the bike would be an improvement. Geez.
I know SVGs may not be the best benchmark, but that matches my experience of trying to run a (previous) Mistral model in Mistral Vibe, asking it to help me configure an MCP server in Vibe. It confidently explained that MCP is the MineCraft Protocol and then began a search of my computer looking for Minecraft binaries.
(I think the hair was unintentional, but it is impossible to be sure.)
Is it not? It's html and javascript. And not even attempting to draw details that other models do.
When I try other html / js prompts it also lacks behind china models from over half a year ago. I mean worse then GLM 4.7.
So I was waiting for this release and it's... 5x more expensive than the latest Mistral Large. So now I'm worried they'll pull the plug on the cheap Large when their releases roll over to that one.
Anyhow, competition is fierce. I'll have some model I can use in the future, even if it's not dirt cheap like current Mistral Large is.
There are none. Mistral Small 4 is pareto-competitive in its pricing bracket at $0.15/$0.60, at worst it's second to Gemma 4 26B A4B. The above countries have never had a model that is even close to being so.
This particular Mistral Medium looks to be uncompetitive at that pricing. I'm surprised it's so expensive given its size. Wonder if we'll see other providers offer it for cheaper.
but that doesn't mean Mistral has never produced anything useful.
EXAONE from LG AI Research https://huggingface.co/LGAI-EXAONE
They had one of the best small models a few months ago and they released a new model just last week.
There's also HyperCLOVA X (haven't tested it, but maybe it is also good) https://huggingface.co/naver-hyperclovax
> India
India has the Sarvam model series, which admittedly are not SotA, but they have pretty good voice capabilities https://huggingface.co/sarvamai
The UAE (not part of the list above) also has a few noteworthy models: https://huggingface.co/tiiuae
> (haven't tested it, but maybe it is also good)
I have. It is not.
More than anything the availability speaks for itself. If it was indeed pareto competitive, all dozens of model providers would be doing their best to offer it for serverless inference. They don't. There's maybe one that does. Do you think a lot of companies wouldn't prefer a Korean model over a Chinese one? In this case, the market speaks. Go talk to people who run business based on putting billions or trillions of tokens through open weights models. And how much time they put into optimization of model selection to save money and latency. And ask why none of them are using EXAONE models. It's not because we're not aware of their existence. There's also reason to believe they've been benchmaxxing more than Chinese models, btw. Have you done the vibecheck?
I wish they were strong, I hope that in the future, they are. More diversity is better. So far they have not yet been a serious option at any point.
Yes, it might be a problem that the UK allows companies like this to be bought up by foreign countries.
Unless the moved to US for funding while keeping a back office in the UK.
It’s strange to expect anything significant to come out from Europe when VCs there are either very risk averse and/or don’t have enough cash to begin with. It’s not like government or EU funding can replace that since its almost always wasted or missdirected
It’s not like VCs are only allowed to invest in companies in their own country.
Europe doesn't invest nowhere near as much as the US does into tech, so we need to either figure out how to be at least as, and hopefully more, efficient as the Chinese models are (at least in terms of training) or there's little point in trying.
I suspect this is one of the reasons why Mistral's models are somewhat struggling; i.e. US style training costs, but nowhere near as much cash as OpenAI/Anthopic have.
There are multiple European Google alternatives as well for example, but being 80% as good just doesn't cut it. Chinese models win because they are 95-98% as good as the SotA US ones but at a fraction of the cost.
A few months ago China was being criticized left and right on how somehow it was not able to compete, and once DeepSeek showed up then all the hatred shifted onto how China was actually competing but exploring unfair competitive advantages.
Funny how that works.
Also, aren't the likes of OpenAI burning through over $2 of investment for each $1 of revenue?
China and rest of the world has sane leadership that aren't mentally retarted.
Chinese AI companies are just trying to make money. They are also publicly contributing to forward the field. We all get to decide, but claiming deepseek is involved in genocide is beyond a stretch. Claiming anthropic and chatgpt are... Actually not so much given the president was threatening it and enabling it with an ally...
They were perhaps right.
But yes, perhaps it would have been better for all of us if they haven't.
- Mythos wasn't released widely.
- But Anthropic shared info on it and said it was dangerous.
- Anthropic is a company.
- Companies like money.
- Therefore Mythos is marketing hype.
- Remember GPT-2? That also wasn't released. They said it was dangerous.
- But, GPT-3, GPT-4, GPT-5, etc. were released.
- Therefore GPT-2 being dangerous was marketing hype.
I've seen the idea that GPT-2 not being released was marketing hype at least 6 times since Mythos was shared.
It's Not Even Wrong, in the Pauli sense: they weren't selling anything! They weren't raising funding! What were they marketing!?
And there's a lot more elided from history, ex. they didn't have an API yet.
GPT-3 was released, a year or two later, and did have an API. But, no one used it, it wasn't good enough yet. And they did treat it as dangerous, it was wildly over-the-top manually monitored for anything resembling not-intended-use. I got permanently suspended for using the word "twink"
That's not what I am saying.
It's not that GPT-2 not being released was marketing hype, it's that OpenAI themselves claiming it's too dangerous to release specifically, implying it's close to AGI, (or something like that), was marketing hype.
I don't think you do.
I only mention reading it because that would clear it up, and you seem interested, and your parenthetical indicates A) you're aware you're claiming something a bit silly and B) you don't know what was actually said.
Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.
I've been a big fan of the smaller labs like Mistral and especially Cohere but it's been a while since I've been excited by a release by either company.
That said, I'm using mistral voxtral realtime daily – it's great.
A lot of us have been agentic coding since almost 2 years ago, mid-2024. I have. The productivity gap of "best vs 2nd vs 3rd best model" was biggest back then and has slowly been shrinking ever since.
It's just apples to oranges.
There is not a clear, across the board, winner on non-agentic tasks between Gemini, ChatGPT, and Claude - the simple chatbot interface.
But Claude Code is substantially better than Codex which itself is notably better than Gemini-cli.
In this vein, it should not be surprising that Claude Code is way better than non-frontier models for agentic coding... It's substantially better than other frontier models at specialized agentic tasks.
From my perspective, Claude Code is decidedly not better than Codex. They’re slightly different and work better together. I would have no issues dropping CC entirely and using codex 100%.
If you’re working off of “defaults”, in other words no custom prompting, Claude Code does perform a lot better out of the box. I think this matters, but if you’re a professional software developer, I’d make the case that you should be owning your tools and moving beyond the baked in prompts.
This is a very naive and misguided opinion. In most tasks, including complex coding tasks, you can hardly tell the difference between a frontier model and something like GPT4.1. You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences. To make matters worse, frontier models are taking a brute force approach to results which ends up making them far more expensive to run, both in terms of what shows up on your invoice and how much more you have to wait to get any resemblance of output.
And I won't even go into the topic or local models.
This is like saying "the current models and the old models are the same if you ignore every important advance they've made"
Their model listing API returns this:
{
"id": "mistral-medium-2508",
"object": "model",
"created": 1777479384,
"owned_by": "mistralai",
"capabilities": {
"completion_chat": true,
"function_calling": true,
"reasoning": false,
"completion_fim": false,
"fine_tuning": true,
"vision": true,
"ocr": false,
"classification": false,
"moderation": false,
"audio": false,
"audio_transcription": false,
"audio_transcription_realtime": false,
"audio_speech": false
},
"name": "mistral-medium-2508",
"description": "Update on Mistral Medium 3 with improved capabilities.",
"max_context_length": 131072,
"aliases": [
"mistral-medium-latest",
"mistral-medium",
"mistral-vibe-cli-with-tools"
],
"deprecation": null,
"deprecation_replacement_model": null,
"default_model_temperature": 0.3,
"type": "base"
},
So that has the alias "mistral-medium-latest", but the official ID is "mistral-medium-2508" which suggests it's the model they released in August 2025.But... that 1777479384 timestamp decodes to Wednesday, April 29, 2026 at 04:16:24 PM UTC
So is that the new Mistral Medium?
curl https://api.mistral.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(llm keys get mistral)" \
-d '{
"model": "mistral-medium-3.5",
"messages": [
{"role": "user", "content": "Generate an SVG of a pelican riding a bicycle"}
]
}'
Which did work: https://gist.github.com/simonw/f3158919b18d2c47863b0a5dc257a... - it's pretty disappointing.Weird that it doesn't show up in the model list:
curl https://api.mistral.ai/v1/models \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(llm keys get mistral)" | jqhttps://chat.mistral.ai/chat/897fbe7d-b1ae-4109-9b29-f3ccc4f...
But what stands out to me is that it's barely able to draw a "recognizable" pelican at all. The Devstral 2 model even looks slightly better, though maybe I'm splitting hairs: https://simonwillison.net/2025/Dec/9/
Gemini fast could do that in under 5 seconds.
I've gotten a lot of use out of Mistral models, and I imagine this model is pretty good at other things, but it really feels like a 128B parameter dense model should be at least a little better than this.
That said, when I stop spending money on Gemini Ultra, I will give Mistral Vibe another 1-month test.
I like the entire business model and vibe of Mistral so much more than OpenAI/Anthropic/Google but I also have stuff to get done. I am curious if Mistral Vibe for $15/month is a stable business model (i.e., can they make a profit).
One thing in particular I was disappointed in was its bad explanations when asking about French grammar. It made multiple mistakes and the other models got it right, even Qwen 3.6 27b!
Anyway, I'm hoping they catch up some more.
The only benefit of leading is mindshare. OpenAI is doubling down on that, by investing in communication companies. That's their pathetic attempt at a "moat".
That is what has happened until now though
The advantage to a dense model like this Mistral one is that it is as smart as a much larger MoE model so it can fit on less GPUs. The tradeoff is that it is much slower since it has to read 100% of its weights for every token, MoE models typically only read about a tenth (though sparsity levels vary).
Doesn't look to promising. Is there any reason to consider Mistral other than it's not US?
And on top of it a range of providers like Fireworks and so on that offer it for Chinese models. This seems such an obvious thing for Mistral to offer.
Funny detail: Google AI (the one they use in search) can't spell évidemment correctly.
I have been using DeepSeek and GLMnmodels with OpenCode and Codex and Claudr side by side.
I have not found the Chinese models lacking. I enjoy for coding and like to maintain full control of my codebade and deeply care about the GOF patterns. So I am very stringent in terms of what I want the LLM to code and how to code.
So from my perspective, they are all about the same.
https://huggingface.co/mistralai/Mistral-Medium-3.5-128B
They more or less claim this exceeds Claude Sonnet 3.5 on most things, but is worse than Sonnet 3.6, and exceeds all other open models.
Oh and they have a cloud service that will code your apps "in the cloud". But, yeah, at this point, so does my cat.
And, yes, unsloth is on it: https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF (but 4bit quant is 75G)
There is no way it exceeds “all other” open models - but it does exceed all of Mistral’s past models.
You can see it getting blown past by GLM 5.1 and Kimi in this.
Still excited to give it a try
Difficult to say, this information is not really public. That said, those investors include EU agencies and European multinational companies and governments. It’s not as flashy as the ridiculous sums OpenAI is getting but it should be enough to keep them going for a while.
They also have a different business model. They are selling their expertise to fine tune and adapt their models to on-premises computers (which they can help you build) to handle confidential data and information. I would not be surprised that the revenue they get from normal people is negligible in comparison.