If we disregard programming and just look at formalizing math (Christian Szegedy has been doing it for a long time now), the length of proofs that are being formalized are exponentially growing and there's a good chance that in 2026 close to 100% of human written big/important proofs will be translated to and verified by Lean.
Just as an example for programming / modelling cache lines and cycle counts: we have quite good models for lots of architectures (even quite good reverse engineered model for NVIDIA GPUs in some papers). The problem is that calculating exact numbers for cache reads / writes is boring with lots of constants in them, and whenever we change the model a little bit the calculations have to be remade.
It's a lot of boring constraints to solve, and the main bottleneck for me when I was trying to do it by hand was that I couldn't just trust the output of LLMs.
Not that relevant in context as the code in question is used to conclude a formal proof, not the other way around. Buy hey, it is a common quote when talking about proving software and someone has to do it...
Context: https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf
Type errors, especially once you have designed your types to be correct by construction, is extremely, extremely useful for LLMs. Once you have the foundation correct, they just have to wiggle through that narrow gap until it figures out something that fits.
But from what I understood and read so far, I am not convinced of OP's "formal verification". A simple litmus test is to take any of your recent day job task and try to describe a formal specification of it. Is it even doable? Reasonable? Is it even there? For me the most useful kind of verification is the verification of the lower level tools i.e. data structures, language, compilers etc
For example, the type signature of Vec::operator[usize] in Rust returns T. This cannot be true because it cannot guarantee to return a T given ANY usize. To me, panic is the most laziest and worst ways to put in a specification. It means that every single line of Rust code is now able to enter this termination state.
I wanted to throw a shoe at him. A static type check doesn't stand in for "a" unit test; static typing stands in for an unbounded number of unit tests.
Put another way, this common misconception by users of languages like Javascript and Python that unit testing is just as good as type checking (plus more flexible) is a confusion between the "exists" and "for all" logical operators.
I have tried both and I have no idea what you're talking about.
> Making yourself think about “for all x” rather than a concrete x forces your brain to consider deeply the properties of x being used.
The entire point of dynamic typing is that you can think about interfaces rather than concrete types, which entails deep consideration of the properties of the object (semantics of the provided interface).
This is not an original argument. Rich Hickey made a similar argument in his "Simple made easy" talk in 2011, though his focus was on a fact that every bug that easiest in a software system has passed unnoticed through both a type checker and a test suit. And even before that similar ideas of test suits being a suitable replacement for a type checker have percolated through Python and Ruby communities, too.
I distinctly remember that the "tests makes static type checks unnecessary" was in fact so prevalent in JavaScript community that TypeScript had really hard time getting adoption in its first 3-4 years, and only the introduction of VSCode in 2015 and subsequent growth of its marketshare over Atom and SublimeText got more people exposed to TypeScript and the benefits of a type checker. Overall it took almost 10 years for Typescript to become the "default" language for web projects.
Besides, it's not like types don't matter in dynamically typed languages. The (competent) programmer still needs to keep types in their head while programming. "Can this function work with a float, or must I pass an int?" "This function expects an iterable, but what happens if I pass a string?" Etc.
I started my career with JavaScript and Python, but over the years I've come to the conclusion that a language that hides types from programmers and does implicit conversion magic in the background does not deliver a better DX. It might make the language more approachable initially, and the idea of faster prototyping might be appealing, but it very quickly leads to maintenance problems and bugs. Before type hinting tools for Python became popular, I worked on many projects where `TypeError` was the #1 exception in Sentry by a large margin.
Gradual and optional typing is better than nothing, but IME if the language doesn't require it, most programmers are lazy and will do the bare minimum to properly add type declarations. Especially with things like TypeScript, which makes many declarations difficult to read, write, and understand.
I think that type inference is a solid middle ground. Types are still statically declared, but the compiler is smart enough to not bother the developer when the type is obvious.
My experience is radically different. `ValueError` is far more common in my un-annotated Python, and the most common cause of `TypeError` anyway is the wrong order or number of arguments after a refactoring.
How is your experience different?
You have conflated "a static type check" with "static typing". Unit tests stand in, in the same way, for an unbounded number of states of real-world input. They're simply being subjected to a trial verification system rather than a proof system. It turns out that writing proofs is not very many people's idea of a good time, even in the programming world. And the concept of "type" that's normally grokked is anemic anyway.
> Put another way...
Rhetoric like this is unconvincing and frankly insulting. You pass off your taste and opinion as fact, while failing to understand opposed arguments.
You should have!
Start using properties and it is in the thousands.
Most code should be typed. Python is great for prototypes, but once the prototype gels, you need types.
As a trivial example, if I create a type alias from “string” to “foobarId,” I now (assuming a compliant language) can prevent code that consumes foobarIds from accidentally consuming a string.
Finally set up neovim with pyright; use types on every single fucking thing, and now I love Python[1].
Can't wait to see TC39 become a reality (learned about it just this past week on HN, actually). Maybe I'll enjoy Javascript too.
--------------------
[1] Within reason; the packaging experience is still extremely poor!
When you're at a level of theory where terms like "type constructor" are natural, it's unreasonable to expect any of it to be applicable to Python. This is why the Haskell people speak of dynamically-typed languages in the Python mold as "untyped" regardless of their attitude towards implicit casts.
And I love it, and have been using it for decades, and write beautiful things where the annotations hardly ever seem worth the effort — perhaps for documentation, but not for a static checker. Then I look at other, newer Pythonistas trying to figure out how to write complex generic type expressions (and sacrificing backwards compatibility as they keep up with the churn of Python figuring out how to offer useful annotation syntax) and deal with covariance vs contravariance etc. and I just smile.
If the type was a type, you'd not be able to use it as a value.
Erlang and Clojure were the early ones, TypeScript followed, and now Python, Ruby, and even Perl have ways to specify types and type check your programs.
You can run a third party linter on those comments, but you must hope that they're correct. There are usually some checks for that, but they're only reliable in trivial cases.
This is not static typing any more than "you can use emscripten to transpile JavaScript to C" means that JavaScript is a low level language with native assembly support. It's a huge step forward from "no system at all" and I'm thrilled it exists, but it's hardly the same thing.
install uv and then
uvx mypy
should do the rest."Optional typing" is not the same as "Static typing".
Great, my program will crash, because I forgot to opt-in to typing :-/
And? Void pointers are not the default type :-/
With Python I have to do extra work to get type errors.
With C I have to do extra work to hide the type errors.
I am battling to understand the point you are making.
C is statically typed, but weakly typed - you need to throw away types to do a bunch of run of the mill things. Python is dynamically typed, but strongly typed, where it will just fail if typed don't resolve.
C# and C++ are both statically typed and strongly typed, although C# more than C++ in practice.
What's wrong with C#'s:
System.Collections.Generic.SortedList<DoBDateTime, PersonRecord>
?That is the literal converse of the claim in the response to that comment arguing that the comment stated that all unit tests can be replaced with type checks. Those are not at all the same claim.
To make it even more clear the comment said: I saw a talk that said Type Check -> Unit Test. I said that is silly.
Response said: Unit Test -> Type Check is not reasonable. So clearly your claim that Type Check -> Unit Test is silly is wrong.
They stand in for the banal unit tests.
Now, I wouldn't necessarily use Clojure on a huge multi-organization codebase (maybe it's fine, this is outside of my experience with it), but it can be the right tool for some jobs.
Like any random JS/php app is probably a huge pile of loops and if statements. To track what happens to the data, you need to run the whole program in your head. “And now it adds that property to the object in the outer scope, and now that object gets sorted, now it hits the database… ok…”. Whereas in clojure most functions are either a single atomic transformation to a set of data, or batch of side effects. You still have to run it through your head, but you can do it more piece-by-piece instead of having to understand a 1,000 method with class states being auto loaded and mutated all over the place. Also you have a REPL to try stuff out as you go.
Dont get me wrong, I LOVE static types. Statically typed clojure would be the best fckin language ever. But there is definitely a wide gulf between a dynamic language like JS, and one like clojure!
Nothing really forces you to write imperative code in a large fraction of cases, and typically the state-change operations can be quite localized within the code. And of course JavaScript and Python both also have REPLs.
Though these days fresh typescript codebases are usually pretty decent. I love typescript and it’s really nice to work with a well-typed, modern project with proper schema validation and such. Def miss that in clojure.
Also I wouldn’t really compare JS or pythons REPL to clojure’s. Python’s is useful, but I pretty much live inside the clojure repl
Well, if you also like Common Lisp, there's Coalton, which is Common Lisp with a Haskell-like type system: https://coalton-lang.github.io/
Regarding your point on Rust, the vast majority of software has nowhere near the amount of static guarantees provided by Rust. If you need more, use static memory allocation, that's what people do for safety critical systems. By the way, it seems that Rust aborts on OOM errors, not panics: https://github.com/rust-lang/rust/issues/43596
Lean or TLA+ are to Rust/Java/Haskell's type systems what algebraic topology and non-linear PDEs are to "one potato, two potatoes". The level of "correctness" achievable with such simple type systems is so negligible in comparison to the things you can express and prove in rich formal mathematics languages that they barely leave an impression (they do make some grunt work easier, but if we're talking about a world where a machine can do the more complicated things, a little more grunt work doesn't matter).
This why the "existing programs don't have specs!" Hand-ringing is entirely premature. Just about every code base today has error modes the authors think won't happen.
All you have to do is start proving they won't happen. And if you do this, you will begin a long journey that ends up with a formal spec for, at least, a good part of your program.
Proving the panics are dead code is a Socratic method, between you and the proof assistant / type checker, for figuring out what your program is and what you want it to be :).
“It affects point number 1 because AI-assisted programming is a very natural fit fot specification-driven development.”
made me smile. Reading something hand made that hadn’t been through the filters and presses of modern internet writing.
This really resonates. We can write code a lot faster than we can safely deploy it at the moment.
We always could. That has been true since the days we programmed computers by plugging jumper wires into a panel.
That's news to me, and I'm an ancient greybeard in development.
If you have a team of 1x f/time developer and 1x f/time tester, the tester would be spending about half their day doing nothing.
Right now, a single developer with Claude code can very easily overwhelm even a couple of testers with new code to test.
That's because the developer would be spending 2/3 of their day fixing the problems the tester already found.
And the time spent writing new code has always been a rounding error from 0.
Y'all have dedicated testers!? In 14 years of development, across FAANG and startup, this has never been true for me. The closest I've come is a brief period when a group of ~7 teams were able to call on the services of two testers. As you can imagine, with that ratio, the testers were not spending much time doing nothing.
In the FAANG and startup world that I worked in, there was no QA department, so I assume that FAANGs and startups don't have a dedicated and autonomous/independent QA department.
That's not the point I was making, though. The point is that we could never emit code faster than it was to deploy. Deployment (including QA) was always 2x to 4x as fast. Sometimes as much as 10x as fast.
=================
EDIT: Of course, I've been working for about twice the number of years as you, and back in those days it was pretty common for large companies to have dedicated QA. Even Microsoft had those :-)
This has always been the case?
But you don't really need complete formal verification to get these benefits. TDD gets you a lot of them as well. Perhaps your verification is less certain, but it's much easier to get high automated test coverage than it is to get a formally verifiable codebase.
I think AI assisted coding is going to cause a resurgence of interest in XP (https://en.wikipedia.org/wiki/Extreme_programming) since AI is a great fit for two big parts of XP. AI makes it easy to write well-tested code. The "pairing" method of writing code is also a great model for interacting with an AI assistant (much better than the vibe-coding model).
But if you only fill out one side of the ledger, so to speak, an LLM will happily invent something that ensures that it is balanced, even where your side of the entry is completely wrong. So while this type of development is an improvement over blindly trusting an arbitrary prompt without any checks and balances, it doesn't really get us to truly verifying the code to the same degree we were able to achieve before. This remains an unsolved problem.
Also tests and code are independent while you always affect both sides in double-entry always. Audits exist for a reason.
I don’t quite agree with that reasoning, however, because a test that fails to test the property it should test for is a very different kind of error than having an error in the implementation of that property. You don’t have to make the “same” error on both sides for an error to remain unnoticed. Compared to bookkeeping, a single random error in either the tests or the implementation is more likely to remain unnoticed.
Yeah but it's very different from tests vrs code though, right? Every entry has two sides at least and you do it together, they are not independent like test and code.
You can easily make a mistake if you write a wrong entry and it will still balance. Balanced books =/= accurate books is my point. And there is no difference between "code" and "tests" in double entry, it's all just "code".
So it seems like the person who made the metaphor doesn't really know how double-entry works or took maybe one accounting class.
The point of the current thread is that the use of AI coding agents threatens to disrupt that. For example, they could observe a true positive test failure and opt to modify the test to ensure a pass instead.
Naturally. Hence "high confidence" and not "full confidence". But let's not travel too far into the weeds here. Getting us back on track, what about the concept of "high confidence" is not understandable?
- Lean will optimize peano arithmetic with binary bignums underneath the hood
- Property based checking and proof search already exist on a continuum, because counterexamples are a valid (dis)proof technique. This should surprise no writer of tactics.
- the lack of formal specs for existing software should become less a problem for greenfield software after these techniques go mainstream. People will be incentivized to actually figure out what they want, and successfully doing so vastly improves project management.
Finally, and most importantly, people thinking that there is a "big specification" and then "big implementation" are totally missing the mark. Remember tools like lean are just More Types. When we program with types, do we have a single big type and a single untyped term, paired together? Absolutely not.
As always, the key to productive software development is more and more libraries. Fancier types will allow writing more interesting libraries that tackle the "reusable core" of many tasks.
For example, do you want to write a "polymorphic web app" that can be instantiated with a arbitrary SQL Schema? Ideas like that become describable.
You had me until this statement. The idea that "more and more libraries" is going to solve the (rather large) quality problems we have in the software industry is .. misguided.
see:
Most people use shit libraries in shit languages. NPM slopfests have no bearing on what I'm talking about.
Less is more, including other people’s libraries.
That's the AGI I want to see.
AI will make formal verification go mainstream
All the problems mentioned in the article are serious. They're also easier than the problem of getting an AI to automatically prove at least hundreds of correctness properties on programs that are hundreds of thousand, if not millions of lines long. Bringing higher mathematics into the discussion is also unhelpful. Proofs of interesting mathematical theorems require ingenuity and creativity that isn't needed in proving software correct, but they also require orders of magnitude fewer lemmas and inference steps. We're talking 100-1000 lines of proof per line of program code.
I don't know when AI will be able to do all that, but I see no reason to believe that a computer that can do that wouldn't also be able to reconcile the formal statements of correctness properties with informal requirements, and even match the requirements themselves to market needs.
> This makes formal verification a prime target for AI-assisted programming. Given that we have a formal specification, we can just let the machine wander around for hours, days, even weeks.
Is this sentiment completely discounting that there can be many possible ways to write program that satisfies certain requirements that all have correct outputs? Won’t many of these be terrible in terms of performance, time complexity, etc? I know that in the most trivial case, AI doesn’t jump straight to O(n)^3 solutions or anything, but also there’s no guarantee it won’t have bugs that degrade performance as long as they don’t interfere with technical correctness.
Also, are we also pretending that having Claude spin for “even weeks” is free?
Verifying realtime software goes even further and enforces an upper bound on the maximum number of ticks it takes to complete the algorithm in all cases.
To me, this reads as an insurmountably high hurdle for the application domain. We're talking about trying to verify systems which are produced very quickly by AIs. If the verification step is glacially slow (which, by any measure, a million cycles to add two integers is), I don't see how this could be considered a tractable solution.
Some software needs formal verification, but all software needs testing.
On another subject...
> Tests are great at finding bugs ... but they cannot prove the absence of bugs.
I wish more people understood this.
Why can't we just prove theorems about the standard two's complement integers, instead of Nat?
"Imagine market infrastructure where agents must prove, before executing, that their actions satisfy regulatory constraints, risk limits, fairness properties, and eventually machine-checkable proofs of Pareto efficiency of market mechanisms. This is a big, hairy, ambitious goal. Not “we reviewed the code” but “the system verified the proof.” The agent that cannot demonstrate compliance cannot act."
https://sdiehl.github.io/zero-to-qed/20_artificial_intellige...
How I learned to deploy faster.
...but we will be modelling those 5-10kLOC modules across multiple services doing critical business logic or distributed transactions. This has been unthinkable a couple months ago and today is a read-only-Friday experiment away (try it with a frontier model and you'll be surprised).
Thanks for the article. Perhaps you could write a follow-up article or tutorial on your favored approach, Verification-Guided Development? This is new to most people, including myself, and you only briefly touch on it after spending most of the article on what you don't like.
Good luck with your degree!
P.S. Some links in your Research page are placeholders or broken.
1: https://blog.regehr.org/archives/482 there were many issues here, not just with compcert
This nonsense again. No. No it isn’t.
I’m sure the people selling it wish it was, but that doesn’t make it true.
The fact that we're reading about it here today and have read about it in the past weeks is one piece of evidence. Another is that we hadn't been reading about it in the past months before November. Opus 4.5 and GPT 5.2 have crossed an usefulness frontier.
Anecdotally, I've been having some success (guiding LLMs) writing Alloy models in the past month and ensuring conformance with code. Making these would've been unjustifiable from ROI perspective fairy tales just this summer. The landscape has changed qualitatively.
The hobby project to day job methodology pipeline is real.
It’s the same design as giving LLMs the current time, since they can’t tell time themselves, either.