Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?
On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.
> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.
Many comments here point out that Mistral's models are not keeping up with other frontier models - this has been my personal experience as well. However, we need more diversity of model alignment techniques and companies training them - so any company taking this seriously is valuable.
This model is specifically trained on this task and significantly[1] underperforms opus.
Opus costs about 6x more.
Which seems... totally worth it based on the task at hand.
[1]: based on the total spread of tested models
That would also help to reduce our dependency on American Hyperscalers, which is much needed given how untrustworthy the US is right now. (And also hostile towards Europe as their new security strategy lays out)
Still, the more interesting comparison would be against something such as Codex.
> I'm interested to see what it is in the age of LLMs or similar future tools. I suspect a future phase change might be towards disregarding how easy it is for humans to work with the code and instead focus on provability, testing, perhaps combined with token efficiency.
> Maybe Lean combined with Rust shrunk down to something that is very compiler friendly. Imagine if you could specify what you need in high level language and instead of getting back "vibe code", you get back proven correct code, because that's the only kind of code that will successfully compile.
Could definitely be interesting for having another model run over the codebase when looking for improvements
It does actually significantly boost performance. There was an article on here about it recently, I'll see if I can find it.
Edit: https://news.ycombinator.com/item?id=44630724
They found the more different the models were (the less overlap in correctly solved problems), the more it boosted the score.
I've also used Devstral Small to make a simple raytracer[5][6] (it was made using the "classic" chat by copy/pasting code, not any agentic approach and i did fix bits of it in the process) and a quick-and-dirty "games database" in Python+Flask+Sqlite for my own use (mainly a game backlog DB :-P).
I also use it to make various small snippets, have it generate some boilerplate stuff (e.g. i have an enum in C and want to write a function that prints names for each enum value or have it match a string i read from a json file with the appropriate enum value), "translate" between languages (i had it recently convert some matrix code that i had written in Pascal into C), etc.
[0] https://i.imgur.com/f4OrNI5.png
[1] https://i.imgur.com/Zac3P4t.png
[2] https://i.imgur.com/jPYYKCd.png
[3] https://i.imgur.com/WZGfCdq.png
[4] https://i.imgur.com/ytYkyQW.png
[5] https://i.imgur.com/FevOm0o.png (screenshot)
[6] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92... (C code)
I was surprised: even tho it was the cheapest option (against other small models from Anthropic) it performed the best in my benchmarks.
I actively use gemini-3.1-pro-preview, claude-4.6-opus-high, and gpt-5.3-codex as well. I prefer them all for different reasons, however I usually _start_ with mistral if it's an option.