Lean proved this program correct; then I found a bug
110 points
2 hours ago
| 16 comments
| kirancodes.me
| HN
sebstefan
58 seconds ago
[-]
There should be more energy spent towards making agents good at formal verification

It's the most sensible way to get out of the woods we're putting ourselves in with vibe coded projects that no one's read the code of

If we are heading in the direction of "code is an artifact for the machine", at least make it proved. Humans can review and edit a spec and leave the implementation to a machine.

reply
porcoda
2 hours ago
[-]
I’ve had similar experiences with code I’ve proven correct, although my issues were of the more common variety than the overflow issue - subtle spec bugs. (I think the post mentions the denial of service issue as related to this: a spec gap)

If you have a spec that isn’t correct, you can certainly write code that conforms to that spec and write proofs to support it. It just means you have verified a program that does something other than what you intended. This is one of the harder parts of verification: clearly expressing your intention as a human. As programs get more complex these get harder to write, which means it isn’t uncommon to have lean or rocq proofs for everything only to later find “nope, it has a bug that ultimately traces back to a subtle specification defect.” Once you’ve gone through this a few times you quickly realize that tools like lean and rocq are tricky to use effectively.

I kinda worry that the “proof assistants will fix ai correctness” will lead to a false sense of assurance if the specs that capture human intention don’t get scrutinized closely. Otherwise we’ll likely have lots of proofs for code that isn’t the code the humans actually intended due to spec flaws.

reply
jmalicki
1 hour ago
[-]
I have experience with similar things!

But that's not saying the proofs are an issue - usually the spec you can reasonably prove in lean or another prover, say TLA+ or Z3 depending on your kind of program - has to be overly simplified and have a lot of assumptions.

However, that is powerful.

It doesn't mean your program doesn't have bugs.

It means this big scary complicated algorithm you think works but are skeptical doesn't have bugs - so when you encounter one, you know the bug is elsewhere, and you start really looking at the boundaries of what could be misspecified, if the assumptions given to the prover are actually true, etc.

It eliminates the big scary thing everyone will think is the cause of the bug as the actual cause.

This has been insanely valuable to me lately. It is also something I never really was able to do before the help of AI - vibe coding proofs about my programs is IMO one of the killer apps of AI, since there aren't a ton of great resources yet about how to do it well since it is rarely done.

reply
somat
1 hour ago
[-]
Whenever I read an article about formal verification systems there is always that nagging thought in the back of my head. Why can you trust your formal verification system to be bug free but you can't trust the program. should not the chance of bugs be about equal in both of them?

You have a program that does something and you write another program to prove it. What assurance do you have that one program has fewer bugs then the other? Why can one program have bugs but the other can't? How do you prove that you are proving the right thing? It all sort of ties into Heisenberg's uncertainty theorem. A system cannot be fully described from within that system.

Don't get me wrong, I think these are great systems doing great work. But I always feel there is something missing in the narrative.

I think a more practical view is that a program is already a sort of proof. there is a something to be solved and the program provides a mechanism to prove it. but this proof may be and probably is incorrect, as bugs are fixed it gets more and more correct. A powerful but time consuming tool to try and force correctness is to build the machine twice using different mechanisms. Then mismatched output indicates something is wrong with one of them. and your job as an engineer is to figure out which one. This is what formal verification brings to the table. The second mechanism.

reply
majormajor
33 minutes ago
[-]
> Whenever I read an article about formal verification systems there is always that nagging thought in the back of my head. Why can you trust your formal verification system to be bug free but you can't trust the program. should not the chance of bugs be about equal in both of them?

A bug in the formal verification tool could be potentially noticed by any user of that formal verification tool. (And indirectly by any of their users noticing a bug about which they say "huh, I thought the tool told me that was impossible.")

A bug in your program can only be potentially noticed by you and your users.

There are also entirely categories of bugs that may not be relevant. For instance, if I'm trying to prove correctness of a distributed concurrent system and I use a model+verifier that verifies things in a sequential, non-concurrent way, then I don't have to worry about the prover having all the same sort of race conditions as my actual code.

But yeah, if you try to write your own prover to prove your own software, you could screw up either. But that's not what is being discussed here.

reply
zarzavat
23 minutes ago
[-]
Formal verification is just an extra step up from the static types that you might have in a language such as Rust.

Common static types prove many of the important properties of a program. If I declare a variable of type String then the type checker ensures that it is indeed a String. That's a proof. Formal verification takes this further and proves other properties such as the string is never empty.

Common static types are very effective. Many users of Rust or Haskell will claim that if a program compiles then it usually works correctly when they go to run it.

However there is a non-linear relationship between probability of program correctness and the amount of types required to achieve it. Being almost certain requires vastly more types than just being confident.

That's the real issue with formal verification, being 75% sure and having less code is better than being 99% sure in most situations, though if I were programming a radiotherapy machine I might think differently.

reply
CJefferson
45 minutes ago
[-]
The chances of significant bugs in lean which lead to false answers to real problems are extremely small (this bug still just caused a crash, but is still bad). Many, many people try very hard to break Lean, and think about how proofs work, and fail. Is it foolproof? No. It might have flaws, it might be logic itself is inconsistent.

I often think of the ‘news level’ of a bug. A bug in most code wouldn’t be news. A bug which caused lean to claim a real proof someone cared about was true, when it wasn’t, would in the proof community the biggest news in a decade.

reply
rdevilla
45 minutes ago
[-]
> It all sort of ties into Heisenberg's uncertainty theorem. A system cannot be fully described from within that system.

Surely you are talking about Godel incompleteness, not Heisenberg's uncertainty principle; in which case they're actually not the same system - the verification/proof language is more like a metalanguage taking the implementation language as its object.

(Godel's observation for mathematics was just that for formal number systems of sufficient power, you can embed that metalanguage into the formal number system itself.)

reply
ngruhn
32 minutes ago
[-]
I think formal verification brings a bit more to the table. The logical properties are not just a second implementation. They can be radically simpler. I think quantifiers are doing a lot of work here (forall/exists). They are not usable directly in regular code. For example, you can specify that a shortest path algorithm must satisfy:

    forall paths P from A to B:
        len(shortest(A,B)) <= len(P)
That's much simpler than any actual shortest path algorithm.
reply
raincole
42 minutes ago
[-]
> should not the chance of bugs be about equal in both of them?

Why?

Are you saying that all the programs ever written have the exact same chance of bugs? A hello world is as buggy as a vibe-coded Chromium clone?

If you accept the premise that different programs have different chances to have bugs, then I'd say:

1. Simpler programs are likely less buggy.

2. Programs used by more people are likely less buggy.

3. Programs maintained by experts who care about correctness are likely less buggy.

4. Programs where the stakes are higher are likely less buggy.

All things considered, I think it's fair to say Lean is likely less buggy then a random program written by me at weekend.

> Heisenberg's uncertainty theorem

It has nothing to do with the uncertainty principle. If you think otherwise, it means your understanding of uncertainty principle comes from sci-fi :)

reply
brookst
1 hour ago
[-]
Been thinking about this a lot recently.

I think we need a way to verify the specs. A combo of formal logic and adversarial thinking (probably from LLMs) that will produce an exhaustive list of everything the program will do, and everything it won’t do, and everything that is underspecified.

Still not quite sure what it looks like, but if you stipulate that program generation will be provable, it pushes the correctness challenge up to the spec (and once we solve that, it’ll be pushed up to the requirements…)

reply
frumplestlatz
1 hour ago
[-]
What’s important is to prove useful, high-level properties derived from the specs. The specs of program behavior are just the price of admission.
reply
brookst
19 minutes ago
[-]
I agree. It’s kind of like secure boot, in reverse: the high level stuff has to be complete and correct enough that the next level down has a chance to be complete and correct.
reply
ctmnt
1 hour ago
[-]
This article’s framing and title are odd. The author, in fact, found no bugs or errors in the proven code. She says so at the end of the article:

> The two bugs that were found both sat outside the boundary of what the proofs cover. The denial-of-service was a missing specification. The heap overflow was a deeper issue in the trusted computing base, the C++ runtime that the entire proof edifice assumes is correct.

Still an interesting and useful result to find a bug in the Lean runtime, but I’d argue that doesn’t justify the title. Or the claim that “the entire proof edifice” is somehow shaky.

It’s important to note that this is the Lean runtime that has a bug, not the Lean kernel, which is the part that actually does the verification (aka proving). [1] So it’s not even immediately clear what this bug would really apply to, since obviously no one’s running any compiled Lean code in any kind of production hot path.

[1] https://lean-lang.org/doc/reference/latest/Elaboration-and-C...

reply
gopiandcode
1 hour ago
[-]
Repeating myself, when we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.

If a buffer overflow causes the system to be exploited and all your bitcoins to be stolen, I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.

Second, there was a bug in the code. Maybe not a functional correctness bug, but I, along with many and most end users, would consider a crashing program buggy. Maybe we just have different tastes or different standards on what we consider an acceptable level of software quality.

W.r.t people running Lean in production, you'd be surprised...

reply
raincole
1 hour ago
[-]
I found the list of articles on this site amusing:

https://kirancodes.me/posts/log-who-watches-the-watchers.htm...

You can see the clickbaitiness increases over time.

reply
minimaltom
1 hour ago
[-]
Looks like a normal distribution about the chaos mean to me. I appreciate its not everyones cup of tea, but I like this style of writing.
reply
youknownothing
1 hour ago
[-]
I'll probably get a lot of hate mail for this but here goes nothing... Despite what many people like to claim, you cannot prove that a program has no bugs. That means proving the absence of bugs, and you cannot prove a negative. The best thing you can do is fail to find a bug, but that doesn't mean it isn't there.

Before everyone starts blabbing about formal verification, etc., consider this: how do you know that you didn't make a mistake in your formal verification? IOW, how do you know your formal verification is bug-free? Answer: you don't. Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.

You can get asymptotically close to zero-bug proof, but you can never get there 100% of the way.

reply
thatguysaguy
57 minutes ago
[-]
What is up with people saying you cannot prove a negative? Of course you can! (At least in formal settings)

For example it's extremely easy to prove there is no square with diagonals of different lengths. I'm the hard end, Andrew Wiles proved Fermat's Last Theorem which expresses a negative.

That's just a nit though, you're right about the infinite regress problem.

reply
pxc
50 minutes ago
[-]
It's pretty easy to prove lots of negatives outside of mathematics, too. It's easy to prove there's no elephant in my living room at the moment.
reply
Paracompact
23 minutes ago
[-]
Formal methods practitioner here.

> That means proving the absence of bugs, and you cannot prove a negative. The best thing you can do is fail to find a bug, but that doesn't mean it isn't there.

You can conclusively (up my next point) prove a specific bug or class of bugs aren't there. But "entirely free of (all) bugs" is indeed a big misnomer for what formal methods does.

> how do you know your formal verification is bug-free? Answer: you don't. Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.

It's another misnomer of formal methods to say that any result is established conclusively, without any caveats whatsoever. But then again neither is mathematics, or any other intellectual discipline. What formal methods does is reduce the surface area where mistakes could reasonably be expected to reside. Trusting the Rocq kernel, or a highly scrutinized model of computation and language semantics, is much easier than trusting the totality of random unannotated code residing in the foggiest depths of your average C compiler, for instance.

reply
zitterbewegung
1 hour ago
[-]
The better look at this is not to prove that you have no bugs is that to prove that a program conforms to a specification that Lean can verify.

Even with the addition of two numbers the execution of a program can be wrong if the CPU has a fault or if the runtime of the program has a bug.

I think you just need to look at why formal verification exists.

reply
lmm
52 minutes ago
[-]
> That means proving the absence of bugs, and you cannot prove a negative.

You can prove that the program implements a specification correctly. That doesn't require proving a negative, but it does prove the absence of bugs. (I think you know this argument is weak, since your next paragraph is a complete non-sequitur)

> Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.

All proofs ultimately rest on axioms. That's normal and not really an argument unless you want to doubt everything, in which case what's the point in ever saying anything?

reply
_alternator_
54 minutes ago
[-]
This comment conflates a number of ideas that obscures its own epistemological point. For example, yes, you can prove a negative (for instance, you can prove there is no largest integer).

The question of who validates the validator is real, but the abstraction of "formal verification" does serve a purpose, like any good mathematical or programming abstraction. Whole classes of bugs are removed; what's left to verify is usually much shorter.

reply
kvuj
52 minutes ago
[-]
This feels like a thought exercise rather than an argument.

By your logic, it's impossible to prove that a car is driving at 60mph. There could be an error in the speedometer which makes it impossible to verify that said car is going at the speed. You can get asymptomatically close to being sure that you're driving at 60 mph but you can never be 100% sure.

This is useless and serves no purpose.

reply
Gigachad
48 minutes ago
[-]
That's not correct. You can prove a car is driving 60mph as soon as you measure it doing that. "proving a negative" is for statements like "There are no purple zebra". You can never prove this because there is always the possibility the purple zebra is somewhere you haven't looked yet. As soon as you find one the statement becomes falsified, but until then it always remains unresolved even if almost certainly true.

Linking back to the parent statement, it's hard to prove a program has no bugs when there is always the possibility the bug just hasn't been found yet. On the flip side it's easy to prove something does have bugs as soon as you find one.

reply
michaelmrose
38 minutes ago
[-]
You can prove there is no purple zebra on earth by actually surveying the population of zebras which is finite.
reply
aaronblohowiak
59 minutes ago
[-]
I disagree but the constraints required to get there are probably untenable for most practical applications, so in practice I agree.
reply
grg0
1 hour ago
[-]
Clickbait title, the proved part of the program had no bugs?

As an aside, why can't people just write factually? This isn't a news site gamed for ad revenue. It's also less effort. I felt this post was mostly an insulting waste of time. I come to HN to read interesting stuff.

reply
minimaltom
59 minutes ago
[-]
Is it fair when it comes to formally-verified software to only consider bugs that violate a proof, and ignore everything else?

Formally-verified software is usually advertised "look ma no bugs!" Not "look ma no bugs*" *As long as this complicated chain of lemmas appropriately represents the correctness of everything we care about.

In boating theres often debate of right of way rules in certain situations, and some people are quick to point out that giant tanker ships should be giving way to tiny sailboats and get all worked up about it*. The best answer I've heard: they're dead right! that is to say as right as they are dead (if they didnt yield) lol. In the same vein, I think someone who assumed that a formally-verified software was perfect and got hacked or whatever is going to be a bit wiggly about the whole thing.

* = Technically the rules prioritize the tankers if they are "restricted in ability to maneuver" but everyone loves to argue about that.

reply
grg0
5 minutes ago
[-]
Nobody with experience in the field advertises formally-verified software like that, and it is understood that the spec may as well be wrong. It is also understood that the non-verified parts may have bugs (surprise). There is no news here.
reply
nickvec
1 hour ago
[-]
Yeah, extremely misleading title even if it is technically true semantically. The phrasing gives the impression that a bug was found in `lean-zip` as part of the proof boundary when it was part of the unverified archive-handling code.
reply
minimaltom
53 minutes ago
[-]
The archive-handling code was in lean-zip, it just seems the verifiers forgot to write proofs for it (still a bug).

Thats not the main finding of the article however. The main bug found was actually in the lean runtime, affecting all proofs using scalar arrays where the size of the array is not bounded.

reply
grg0
1 hour ago
[-]
And it took them several thousand words to explain what you just said in a sentence.
reply
ozten
2 minutes ago
[-]
HN to be renamed Flex Social. Damn.
reply
davesque
1 hour ago
[-]
Am I reading the article wrong? It appears that the author did not test the claims of the proof. Wouldn't a "bug" in this case mean she found an input that did not survive a round trip through the compression algorithm?

Update: Actually, I guess this may have been her point: "The two bugs that were found both sat outside the boundary of what the proofs cover." So then I guess the title might be a bit click baity.

reply
gopiandcode
1 hour ago
[-]
Hi! Author here. When we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.

If a buffer overflow causes the system to be exploited and all your bitcoins to be stolen, I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.

Secondly, I did find a bug in the algorithm. in Archive.lean, in the parsing of the compressed archive headers. That was the crashing input.

reply
davesque
44 minutes ago
[-]
Thanks for responding!

> When we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.

Yeah, I would actually agree. We wouldn't want to advertise that a system is formally verified in some way if that creates a false sense of security. I was just pointing out that, by my reading, the title appears to suggest that the core mechanism of the Lean proof is somehow flawed. When I read the title, I immediately thought, "Oooh. Looks like someone demonstrated a flaw in the proof. Neat." But that's not what is shown in the article. Just feels a bit misleading is all.

reply
quantummagic
1 hour ago
[-]
> I think it's fair to consider the entire binary a fair target.

Yes, it's still very much a bug. But it has nothing to do with your program being formally verified or not. Formal verification can do nothing about any unverified code you rely on. You would really need a formal verification of every piece of hardware, the operating system, the runtime, and your application code. Short of that, nobody should expect formal verification to ensure there are no bugs.

reply
appplication
58 minutes ago
[-]
I read it as that’s also the point. Adding formal verification is not a strict defense against bugs. It is in a way similar to having 100% test coverage and finding bugs in your untested edge cases.

I don’t think the author is attempting to decry formal verification, but I think it a good message in the article everyone should keep in mind that safety is a larger, whole system process and bugs live in the cracks and interfaces.

reply
lmm
1 hour ago
[-]
Not verifying the parser seems like a pretty big oversight. Parsing binary formats is notoriously dangerous!
reply
mindcrime
1 hour ago
[-]
"Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth
reply
vatsachak
40 minutes ago
[-]
You guys are missing the forest for the trees.

They used an AI agent sending ideas to a fuzzer and discovered a heap buffer overflow in Lean. This is big.

reply
dchftcs
1 hour ago
[-]
This is analogous to the fundamental problem of better automation in programming - eventually, the complexity and correctness of of the spec takes over, and if we don't manage that well, creating the spec is not that much less work than the programming part. If your program was for the wrong thing, a proof of it is also wrong.
reply
butvacuum
1 hour ago
[-]
everybody also ignores that even hello world isn't deterministic anymore. It just doesn't matter to execution if something broke unless it kicks back an error.

although, this is the best example of how quickly a trivality can knock so called "correct" programs over.

reply
jongjong
1 hour ago
[-]
Fully agree. I started hitting this bottleneck when I combined a low-code backend I built with Claude Code to generate web applications.

I can build applications rapidly but the requirements and UX are the bottleneck. So much so that I often like to sit on a concept for multiple days to give myself the time to fully absorb the goal and refine the requirements. Then once I know what to build, it snaps together in like 4 hours.

There are a lot of ambiguities which need to be resolved ahead of time. Software engineering becomes a kind of detailed business strategy role.

reply
spullara
1 hour ago
[-]
claude making a statement that sounds impressive but it is actually the first codebase it has ever analyzed.

"This is genuinely one of the most memory-safe codebases I've analyzed."

reply
akoboldfrying
23 minutes ago
[-]
Nice work. Amusing that Lean's own standard library has a buffer overflow bug, which "leaked out" due to being exempted from the verification.

Regarding the DoS in the lean-zip application itself: I think this is a really neat example of the difficult problem of spec completeness, which is a subcase of the general problem (mentioned by porcoda in a sibling comment) of being sure that the spec is checking the right things. For a compression program, the natural, and I would also say satisfyingly beautiful thing to prove is that decomp(comp(x)) = x for all possible inputs x. It's tempting to think that at that point, "It's proven!" But of course the real world can call decomp() on something that has never been through comp() at all, and this simple, beautiful spec is completely silent on what will happen then.

reply
readthenotes1
1 hour ago
[-]
I didn't like the clickbait title. I would have preferred something along the lines of

"Lean proves other program correct but not itself"

reply
jongjong
1 hour ago
[-]
Formal proofs can only ever work to validate against requirements... But a major issue with a lot of modern software is that the requirements themselves can be incorrect and some requirements can logically conflict with other requirements... So long as we have non-technical people formulating requirements, we can never have provably correct software.

I've come across issues in the past which weren't actually bugs. For example, the software was behaving exactly as intended but it looked like a bug to a user who didn't understand the nuances and complexities of what was happening.

I also cannot count how many times a non-technical person asked me to implement conflicting functionality or functionality which would have made the UX incredibly confusing for the user.

reply
ernsheong
1 hour ago
[-]
Alan Turing already proved with the Halting Problem that reasoning about program correctness is not possible. But we still try.

Wikipedia: [1] Turing proved no algorithm exists that always correctly decides whether, for a given arbitrary program and input, the program halts when run with that input. The essence of Turing's proof is that any such algorithm can be made to produce contradictory output and therefore cannot be correct.

[1] https://en.wikipedia.org/wiki/Halting_problem

reply
muraiki
1 hour ago
[-]
When you write a recursive function, Lean’s kernel requires a termination proof, unless the function is a partial or marked as unsafe. In those cases, they can’t be used in proofs. https://lean-lang.org/doc/reference/latest/Definitions/Recur...
reply
DonaldPShimoda
1 hour ago
[-]
> Alan Turing already proved with the Halting Problem that reasoning about program correctness is not possible.

This is so reductive a framing as to be essentially useless [0]. I think maybe you want to learn more about program correctness, formal verification, and programming language semantics before you make such statements in the future.

[0] See, e.g., type-checking.

reply
ashton314
1 hour ago
[-]
> reasoning about program correctness is not possible

Not possible for all problems. We cannot decide correctness (ie adherence to a specification) for all programs, but we can definitely recognize a good chunk of cases (both positive and negative) that are useful.

The Halting Problem itself is recognizable. The surprising result of Turing’s work was that we can’t decide it.

reply
lmm
49 minutes ago
[-]
And Alonzo Church proved in 1940 that you can avoid this problem by using a typed language in which all programs halt. But sadly some programmers still resist this.
reply
AnimalMuppet
42 minutes ago
[-]
Some correct programs are supposed to run forever.

When is an OS supposed to halt? When you shut it down, or when you power down the hardware, and no other times. So if you don't do either of those things, then the OS is supposed to run forever. Does that, by itself, mean that the program is incorrect, or that the language is inadequate? No, it means that the definition is worthless (or at least worthless for programs like OSes).

reply
gopiandcode
20 minutes ago
[-]
you can still verify arbitrarily long running programs - there are instances of such software, such as sel4 (https://sel4.systems/) and certikos (https://flint.cs.yale.edu/certikos/), you simply model them as finite programs that run on an infinite stream of events.
reply
Rusky
18 minutes ago
[-]
This is not actually a problem for total languages, which simply model these kinds of processes using corecursion/coinduction/codata.
reply
mkl
1 hour ago
[-]
We care only about a very small and narrow subset of possible programs, not any arbitrary one. It's possible to solve this kind of problem in large enough classes of program to be useful.
reply
raincole
1 hour ago
[-]
> a given arbitrary program and input

Keyword emphasis mine.

reply
grg0
1 hour ago
[-]
That one little word that changes everything in how you interpret the statement.
reply