Leanstral: Open-source agent for trustworthy coding and formal proof engineering
272 points
3 hours ago
| 18 comments
| mistral.ai
| HN
andai
3 hours ago
[-]
Trustworthy vibe coding. Much better than the other kind!

Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?

On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.

reply
flowerbreeze
2 hours ago
[-]
They haven't made the chart very clear, but it seems it has configurable passes and at 2 passes it's better than Haiku and Sonnet and at 16 passes starts closing in on Opus although it's not quite there, while consistently being less expensive than Sonnet.
reply
andai
2 hours ago
[-]
Oh my bad. I'm not sure how that works in practice. Do you just keep running it until the tests pass? I guess with formal verification you can run it as many times as you need, right?
reply
DrewADesign
3 hours ago
[-]
It’s really not hard — just explicitly ask for trustworthy outputs only in your prompt, and Bob’s your uncle.
reply
miacycle
1 hour ago
[-]
Assuming that what you're dealing with is assertable. I guess what I mean to say is that in some situations is difficult to articulate what is correct and what isn't depending in some situations is difficult to articulate what is correct and what isn't depending upon the situation in which the software executes.
reply
DrewADesign
37 minutes ago
[-]
And Bob’s your uncle.
reply
lsb
2 hours ago
[-]
The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern...

> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.

reply
skanga
1 hour ago
[-]
TDD == Prompt Engineering, for Agentic coding tasks.
reply
thoughtfulchris
1 hour ago
[-]
There have been a lot of conversations recently about how model alignment is relative and diversity of alignment is important - see the recent podcast episode between Jack Clark (co-founder of Anthropic) and Ezra Klein.

Many comments here point out that Mistral's models are not keeping up with other frontier models - this has been my personal experience as well. However, we need more diversity of model alignment techniques and companies training them - so any company taking this seriously is valuable.

reply
jasonjmcghee
2 hours ago
[-]
Curious if anyone else had the same reaction as me

This model is specifically trained on this task and significantly[1] underperforms opus.

Opus costs about 6x more.

Which seems... totally worth it based on the task at hand.

[1]: based on the total spread of tested models

reply
beernet
2 hours ago
[-]
Agreed. The idea is nice and honorable. At the same time, if AI has been proving one thing, it's that quality usually reigns over control and trust (except for some sensitive sectors and applications). Of course it's less capital-intense, so makes sense for a comparably little EU startup to focus on that niche. Likely won't spin the top line needle much, though, for the reasons stated.
reply
segmondy
27 minutes ago
[-]
Ha, keep putting your prompts and workflows into cloud models. They are not okay with being a platform, they intend to cannibalize all businesses. Quality doesn't always reign over control and trust. Your data and original ideas are your edge and moat.
reply
miohtama
2 hours ago
[-]
Alignment tax directly eats to model quality, double digit percents.
reply
hermanzegerman
1 hour ago
[-]
EU could help them very much if they would start enforcing the Laws, so that no US Company can process European data, due to the Americans not willing to budge on Cloud Act.

That would also help to reduce our dependency on American Hyperscalers, which is much needed given how untrustworthy the US is right now. (And also hostile towards Europe as their new security strategy lays out)

reply
bcye
46 minutes ago
[-]
This would be unfortunately a rather nuclear option due to the continent’s insane reliance on technology that breaks its unenforced laws.
reply
DarkNova6
2 hours ago
[-]
I'm never sure how much faith one can put into such benchmarks but in any case the optics seem to shift once you have pass@2 and pass@3.

Still, the more interesting comparison would be against something such as Codex.

reply
esperent
1 hour ago
[-]
I absolutely called this a couple of weeks ago, nice to be vindicated!

> I'm interested to see what it is in the age of LLMs or similar future tools. I suspect a future phase change might be towards disregarding how easy it is for humans to work with the code and instead focus on provability, testing, perhaps combined with token efficiency.

> Maybe Lean combined with Rust shrunk down to something that is very compiler friendly. Imagine if you could specify what you need in high level language and instead of getting back "vibe code", you get back proven correct code, because that's the only kind of code that will successfully compile.

https://news.ycombinator.com/item?id=47192116

reply
piyh
26 minutes ago
[-]
Automated theorem provers running on a $5k piece of hardware is a cool version of the future
reply
JoshTriplett
1 hour ago
[-]
Pleasant surprise: someone saying "open source" and actually meaning Open Source. It looks like the weights are Apache-2.0 licensed.
reply
jasonjmcghee
36 minutes ago
[-]
Based on community definitions I've seen, this is considered "open weights". If you can't reproduce the model, it's not "open source"
reply
flakiness
2 hours ago
[-]
reply
jasonjmcghee
21 minutes ago
[-]
What about pass@2 for haiku and sonnet?
reply
Havoc
2 hours ago
[-]
What are these "passes" they reference here? Haven't seen that before in LLM evals

Could definitely be interesting for having another model run over the codebase when looking for improvements

reply
rockinghigh
2 hours ago
[-]
It's the number of attempts at answering the question.
reply
patall
2 hours ago
[-]
Maybe a naive question: given that they see better performance with more passes but the effect hits a limit after a few passes, would performance increase if they used different models per pass, i.e leanstral, kimi, qwen and leanstral again instead of 4x leanstral?
reply
andai
2 hours ago
[-]
This is called a "LLM alloy", you can even do it in agentic, where you simply swap the model on each llm invocation.

It does actually significantly boost performance. There was an article on here about it recently, I'll see if I can find it.

Edit: https://news.ycombinator.com/item?id=44630724

They found the more different the models were (the less overlap in correctly solved problems), the more it boosted the score.

reply
patall
2 hours ago
[-]
That sounds quite interesting. Makes me wonder if sooner or later they will have to train multiple independent models that cover those different niches. But maybe we will see that sooner or later. Thanks for the link.
reply
cyanydeez
2 hours ago
[-]
One would think that LoRAs being so successful in StableDiffusion, that more people would be focused on constructing framework based LoRas; but the economics of all this probably preclude trying to go niche in any direction and just keep building the do-all models.
reply
elAhmo
1 hour ago
[-]
I don’t know a single person using Mistral models.
reply
consumer451
1 hour ago
[-]
Isn't their latest speech to text model SOTA? When I tested it on jargon, it was amazing.

https://news.ycombinator.com/item?id=46886735

reply
badsectoracula
25 minutes ago
[-]
Pretty much all of my LLM usage has been using Mistral's open source models running on my PC. I do not do full agentic coding as when i tried it with Devstral Small 2 it was a bit too slow (though if i could get 2-3 times the speed of my PC from a second computer it'd be be a different story and AFAIK that is doable if i was willing to spend $2-3k on it). However i've used Mistral's models for spelling and grammar checks[0], translations[1][2], summaries[3] and trying to figure out if common email SPAM avoidance tricks are pointless in the LLM age :-P [4]. FWIW that tool you can see in the shots is a Tcl/Tk script calling a llama.cpp-based command-line utility i threw together some time ago when experimenting with llama.cpp.

I've also used Devstral Small to make a simple raytracer[5][6] (it was made using the "classic" chat by copy/pasting code, not any agentic approach and i did fix bits of it in the process) and a quick-and-dirty "games database" in Python+Flask+Sqlite for my own use (mainly a game backlog DB :-P).

I also use it to make various small snippets, have it generate some boilerplate stuff (e.g. i have an enum in C and want to write a function that prints names for each enum value or have it match a string i read from a json file with the appropriate enum value), "translate" between languages (i had it recently convert some matrix code that i had written in Pascal into C), etc.

[0] https://i.imgur.com/f4OrNI5.png

[1] https://i.imgur.com/Zac3P4t.png

[2] https://i.imgur.com/jPYYKCd.png

[3] https://i.imgur.com/WZGfCdq.png

[4] https://i.imgur.com/ytYkyQW.png

[5] https://i.imgur.com/FevOm0o.png (screenshot)

[6] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92... (C code)

[7] https://i.imgur.com/BzK8JtT.png

reply
Fnoord
11 minutes ago
[-]
I use them solely.
reply
pelagicAustral
1 hour ago
[-]
Me neither, they're not ready for prime imo. I have a yearly sub and the product is just orders of magnitude behind Anthropic's offering. I use Code for real world stuff and I am happy with the result, Mistral is just not something I can trust right now.
reply
Adrig
1 hour ago
[-]
I used Ministral for data cleaning.

I was surprised: even tho it was the cheapest option (against other small models from Anthropic) it performed the best in my benchmarks.

reply
Bombthecat
53 minutes ago
[-]
Mistral is super smart in smaller context and asking questions about it
reply
miacycle
1 hour ago
[-]
The TDD foundation! We might need one of those. :)
reply
htrp
1 hour ago
[-]
is the haiku comparison because they've distilled from the model?
reply
lefrenchy
2 hours ago
[-]
Does Mistral come close to Opus 4.6 with any of their models?
reply
chucky_z
2 hours ago
[-]
I use mistral-medium-3.1 for a lot of random daily tasks, along with the vibe cli. I'd state from my personal opinion that mistral is my preferred 'model vendor' by far at this point. They're extremely consistent between releases while each of them just feels better. I also have a strong personal preference to the output.

I actively use gemini-3.1-pro-preview, claude-4.6-opus-high, and gpt-5.3-codex as well. I prefer them all for different reasons, however I usually _start_ with mistral if it's an option.

reply
sa-code
2 hours ago
[-]
Why not Large 3? It's larger and cheaper
reply
DarkNova6
2 hours ago
[-]
Not at the moment, but a release of Mistral 4 seems close which likely bridges the gap.
reply
re-thc
2 hours ago
[-]
Mistral Small 4 is already announced.
reply
androiddrew
1 hour ago
[-]
MOE but 120B range. Man I wish it was an 80B. I have 2 GPUs with 62Gib of usable VRAM. A 4bit 80B gives me some context window, but 120B puts me into system RAM
reply
tjwebbnorfolk
2 hours ago
[-]
Mistral hasn't been in the running for SOTA for quite awhile now
reply
hnipps
1 hour ago
[-]
Here we go.
reply
kittikitti
2 hours ago
[-]
This is great, congratulations to the Mistral team! I'm looking forward to the code arena benchmark results. Thanks for sharing.
reply
blurbleblurble
3 hours ago
[-]
Truly exciting
reply