We already have a ton of orgs that can’t keep a test suite green or write an honest invariant in a code comment, but somehow we’re going to get them to agree on a precise spec in TLA+/Dafny/Lean and treat it as a blocking artifact? That’s not an AI problem, that’s a culture and incentives problem.
Where AI + “formal stuff” probably does go mainstream is at the boring edges: property-based tests, contracts, refinement types, static analyzers that feel like linters instead of capital‑P “Formal Methods initiatives”. Make it look like another checkbox in CI and devs will adopt it; call it “verification” and half the org immediately files it under “research project we don’t have time for”.
Will include this thread in my https://hackernewsai.com/ newsletter.
Not only that, but it's been well-established that a significant challenge with formally verified software is to create the right spec -- i.e. one that actually satisfies the intended requirements. A formally verified program can still have bugs, because the spec (which requires specialized skills to read and understand) may not satisfy the intent of the requirements in some way.
So the fundamental issue/bottleneck that emerges is the requirements <=> spec gap, which closing the spec <=> executable gap does nothing to address. Translating people's needs to an empirical, maintainable spec of one type or another will always require skilled humans in the loop, regardless of how easy everything else gets -- at minimum as a responsibility sink, but even more as a skilled technical communicator. I don't think we realize how valuable it is to PMs/executives and especially customers to be understood by a skilled, trustworthy technical person.
That's not a bug, that's a misunderstanding, or at least an error of translation from natural language to formal language.
Edit:
I agree that one can categorize incorrect program behavior as a bug (apparently there's such a thing as "behavioral bug"), but to me it seems to be a misnomer.
I also agree that it's difficult to tell that to a customer when their expectations aren't met.
1) Verify requirements => this can be done with formal verifications
2) Validate fit for purpose => this is where we make sure that if the customer needs addition it does not matter if our software does very well substraction and it has a valid proof of doing that according with specs.
I know this second part is kinda lost in the transition from oh my god waterfall is bad to yeyy now we can fire all testers because the quality is the responsibility of the entire team.
"If you can formally verify the spec the code can be auto-generated from it."
After 20 years of software development I think that is because most of the software out there, is the method itself of finding out what it's supposed to do.
The incomplete specs are not lacking feature requirements due to lack of discipline. It's because nobody can even know without trying it out what the software should be.
I mean of course there is a subset of all software that can be specified before hand - but a lot of it is not.
Knuth could be that forward thinking with TeX for example only because he had 500 years of book printing tradition to fall back on to backport the specs to math.
The people who care about the precise details have always been relegated to a tiny minority, even in our modern technological world.
That's where the job security is (and always has been). This has been my answer to "are you afraid for your job because of AI?"
Writing the code is very rarely the hard part. The hard part is getting a spec from the PM, or gathering requirements stakeholders. And then telling them why the spec / their requirements don't make sense or aren't feasible, and figuring out ones that will actually achieve their goals.
Rather, it's directed at the kind of niche, mission-critical things, that not all of which are getting the formal verification solution that is needed for them and/or that don't get considered due to high costs (due to specialization skill).
I read OP as a realization that the costs have fallen, and thus we should see formal verification more than before.
* A user interface is confusing, or the English around it is unclear
* An API you rely on changes, is deprecated, etc.
* Users use something in unexpected ways
* Updates forced by vendors or open source projects cause things to break
* The customer isn't clear what they want
* Complex behavior between interconnected systems, out of the purview of the formal language (OS + database + network + developer + VM + browser + user + web server)
For some mathematically pure task, sure, it's great. Or a low-level library like a regular expression parser or a compression codec. But I don't think that represents a lot of what most of us are tasked with, and those low-level "mathematically pure" libraries are generally pretty well handled by now.Bug reporting is already about signal vs noise. Imagine how it will be when we hand the megaphone to bots.
All said, I’m now running all commits through Codex (which is the only thing it’s any good at), and it’s really pretty good at code reviews.
Isn't this what TLA+ was meant to deal with?
A lot of libraries document invariants that are either not checked at all, only at runtime, or somewhere in between. For instance, the requirement that a collection not be modified during interaction. Or that two region of memory do not overlap, or that a variable is not modified without owning a lock. These are all things that, in principle, can be formally verified.
No one claims that good type systems prevent buggy software. But, they do seem to improve programmer productivity.
For LLMs, there is an added benefit. If you can formally specify what you want, you can make that specification your entire program. Then have an LLM driven compiler produce a provably correct implementation. This is a novel programming paradigm that has never before been possible; although every "declarative" language is an attempt to approximate it.
That's exactly what languages with advanced type systems claim. To be more precise, they claim to eliminate entire classes of bugs. So they reduce bugs, they don't eliminate them completely.
Forcing devs to pre-fix/avoid bugs before the compiler will allow the app means the programs are more correct as a group.
Wrong, incomplete, insufficient, unhelpful, unimpressive, and dumb are all still very possible. But more correct than likely in looser systems.
When you say things like "eliminate a class of bugs", that is played out in the abstraction: an infinite subset of that infinity of machines is eliminated, leaving an infinity.
How you then sample from that infinity in order to have something which fits on your actual machine is a separate question.
The problem is there is always some chance a coding agent will get stuck and be unable to produce a conforming implementation in a reasonable amount of time. And then you are back in a similar place to what you were with those pre-LLM solutions - needing a human expert to work out how to make further progress.
As an example, I currently mostly write GUI applications for mobile and desktop as a solo dev. 90% of my time is spent on figuring out API calls and arranging layouts. Most of the data I deal with are strings with their own validation and formatting rules that are complicated and at the same time usually need to be permissive. Even at the backend all the data is in the end converted to strings and integers when it is put into a database. Over-the-wire serialization also discards with most typing (although I prefer protocol buffers to alleviate this problem a bit).
Strong typing can be used in between those steps but the added complexity from data conversions introduces additional sources of error, so in the end the advantages are mostly nullified.
this is exactly where a good type system helps: you have an unvalidated string and a validated string which you make incompatible at the type level, thus eliminating a whole class of possible mistakes. same with object ids, etc.
don't need haskell for this, either: https://brightinventions.pl/blog/branding-flavoring/
That is not novel and every declarative language precisely embodies it.
To me it seems they reduce productivity. In fact, for Rust, which seems to match the examples you gave about locks or regions of memory the common wisdom is that it takes longer to start a project, but one reaps the benefits later thanks to more confidence when refactoring or adding code.
However, even that weaker claim hasn’t been proven.
In my experience, the more information is encoded in the type system, the more effort is required to change code. My initial enthusiasm for the idea of Ada and Spark evaporated when I saw how much ceremony the code required.
I would tend to disagree. All that information encoded in the type system makes explicit what is needed in any case and is otherwise only carried informally in peoples' heads by convention. Maybe in some poorly updated doc or code comment where nobody finds it. Making it explicit and compiler-enforced is a good thing. It might feel like a burden at first, but you're otherwise just closing your eyes and ignoring what can end up important. Changed assumptions are immediately visible. Formal verification just pushes the boundary of that.
this is, in fact better for llms, they are better at carrying information and convention in their kv cache than they are in having to figure out the actual types by jumping between files and burning tokens in context/risking losing it on compaction (or getting it wrong and having to do a compilation cycle).
if a typed language lets a developer fearlessly build a semantically inconsistent or confusing private API, then llms will perform poorer at them even though correctness is more guaranteed.
It’s actually similar to tests in a way: they provide additional confidence in the code, but at the same time ossify it and make some changes potentially more difficult. Interestingly, they also make some changes easier, as long as not too many types/tests have to be adapted.
At one end of the spectrum, the weakest type systems limit the ability of an IDE to do basic maintenance tasks (e.g. refactoring).
At the other end of the spectrum, dependent type and especially sigma types capture arbitrary properties that can be expressed in the logic. But then constructing values in such types requires providing proofs of these properties, and the code and proofs are inextricably mixed in an unmaintainable mess. This does not scale well: you cannot easily add a new proof on top of existing self-sufficient code without temporarily breaking it.
Like other engineering domains, proof engineering has tradeoffs that require expertise to navigate.
To be honest, I believe it makes refactoring/maintenance take longer. Sure, safer, but this is not a one-time only price.
E.g. you decide to optimize this part of the code and only return a reference or change the lifetime - this is an API-breaking change and you have to potentially recursively fix it. Meanwhile GC languages can mostly get away with a local-only change.
Don't get me wrong, in many cases this is more than worthwhile, but I would probably not choose rust for the n+1th backend crud app for this and similar reasons.
I compared it to a statically typed language with a GC - where the runtime takes care of a property that Rust has to do statically, requiring more complexity.
Have you seen large js codebases? Good luck changing anything in it, unless they are really, really well written, which is very rare. (My own js code is often a mess)
When you can change types on the fly somewhere hidden in code ... then this leads to the opposite of clarity for me. And so lots of effort required to change something in a proper way, that does not lead to more mess.
a) It’s fast to change the code, but now I have failures in some apparently unrelated part of the code base. (Javascript) and fixing that slows me down.
b) It’s slow to change the code because I have to re-encode all the relationships and semantic content in the type system (Rust), but once that’s done it will likely function as expected.
Depending on project, one or the other is preferable.
To me, this has been one of the biggest advantages of both tests and types. They provide confidence to make changes without needing to be scared of unintended breakages.
Of course it is a hyperbole, but sadly not that large.
They really don’t. How did you arrive at such a conclusion?
It is quite clear that this industry is mostly driven by hype and fades, not by empirical studies.
Empirical evidence in favor of a claim that static typing and complex type systems reduce bugs or improve productivity is highly inconclusive at best
It's just people's hunches.
AI makes creating mock objects much easier in some cases, but it still creates a lot of busy work and makes configuration more difficult. At at this points it often is difficult configuration management that cause the issues in the first place. Putting everything in some container doesn't help either, on the contrary.
Give me a list of all the libraries you work with that don't have some sort of "okay but not that bit" rule in the business logic, or "all of those function are f(src, dst) but the one you use most is f(dst,src) and we can't change it now".
I bet it's a very short list.
Really we need to scrap every piece of software ever written and start again from scratch with all these weirdities written down so we don't do it again, but we never will.
But it's true. AI is still super narrow and dumb. Don't understand basic prompts even.
Look at the computer games now - they still don't look real despite almost 30 years since Half-life 1 started the revolution - I would claim. Damn, I think I ran it on 166 Mhz computer on some lowest details even.
Yes, it's just better and better but still looking super uncanny - at least to me. And it's been basically 30 years of constant improvements. Heck, Roomba is going bankrupt.
I am not saying things don't improve but the hype and AI bubble is insane and the reality doesn't match the expectation and predictions at all.
Formal verification will eventually lead to good, stable API design.
> Users use something in unexpected ways
> Complex behavior between interconnected systems
It happens when there's no formal verification during the design stage.
Formal verification literally means cover 100% state changes and for every possible input/output, every execution branch should be tested.
Why? Has it ever happened like this? Because to me it would seem that if the system verified to work, then it works no matter how API is shaped, so there is no incentive to change it to something better.
That's the case for one-off integrations, but the messy part always comes when system goal changes
Let's say formal verification could help to avoid some anti-patterns.
I'd still like to hear about the actual mechanism of this happening. Because I personally find it much easier to believe that the moment keeping the formal verification up to date becomes untenable for whatever reason (specs changing too fast, external APIs to use are too baroque, etc) people would rather say "okay, guess we ditch the formal verification and just keep maintaining the integration tests" instead of "let's change everything about the external world so we could keep our methodology".
I am not an expert on this, but the worst API I've seen is those with hidden states.
e.g. .toggle() API. Call it old number of times, it goes to one state, call it even number of times, it goes back.
And there's call A before you call B types of APIs, the client has to keep a strict call order (which itself is a state machine of some kind)
> e.g. .toggle() API. Call it old number of times, it goes to one state, call it even number of times, it goes back.
This is literally a dumb light switch. If you have trouble proving that, starting from lights off, flicking a simple switch twice will still keep lights off then, well, I have bad news to tell you about the feasibility of using the formal methods for anything more complex than a dumb light switch. Because the rest of the world is a very complex and stateful place.
> (which itself is a state machine of some kind)
Yes? That's pretty much the raison d'être of the formal methods: for anything pure and immutable, normal intuition is usually more than enough; it's tracking the paths through enormous configuration spaces that our intuition has problem with. If the formal methods can't help with that with comparable amount of effort, then they are just not worth it.
The theory is entirely correct. If a machine can write provably perfect code there is absolutely no reason to have people write code. The problem is that the 'If' is so big it can be seen from space.
Given the spec, formal verification can tell you if your implementation follows the spec. It cannot tell you if the spec if good
I am right now working on an offline api client: https://voiden.md/. I wonder if this can be a feature.
I beg to differ, if a spec is hard to verify, then it's a bad sign.
Of course, you can declare that the world itself is inherently sinful and imperfect, and is not ready for your beautiful theories but seriously.
i see we are both familiar with haskellers (friendly joke!)
That the spec solves the problem is called validation in my domain and treated explicitly with different methods.
We use formal validation to check for invariants, but also "it must return a value xor an error, but never just hang".
Things can only happen if only you allow it to happen.
It other words, your software may come to a stage where it's no longer applicable, but it never crashes.
Formal verification had little adoption only because it costs 23x of your original code with "PhD-level training"
Sure, formal verification might give stronger guarantees about various levels of the stack, but I don’t think most of us care about having such strong guarantees now and I don’t think AI really introduces a need for new guarantees at that level.
They are not. The power of rich and succinct specification languages (like TLA+) comes from the ability to succinctly express things that cannot be efficiently computed, or at all. That is because a description of what a program does is necessarily at a higher level of abstraction than the program (i.e. there are many possible programs or even magical oracles that can do what a program does).
To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.
You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”
Have probabilistic outcomes? Or even floats [0] and it becomes challenging and strings are a mess.
> Note there is not a float type. Floats have complex semantics that are extremely hard to represent. Usually you can abstract them out, but if you absolutely need floats then TLA+ is the wrong tool for the job.
TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.
You really don't. It's not LTL. Abstraction/refinement relations are at the core of TLA.
> Or even floats [0] and it becomes challenging and strings are a mess.
No problem with floats or strings as far as specification goes. The particular verification tools you choose to run on your TLA+ spec may or may not have limitations in these areas, though.
> TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.
TLA+ can specify anything that could be specified in mathematics. That there is no predefined set of floats is no more a problem than the one physicists face because mathematics has no "built-in" concept for metal or temperature. TLA+ doesn't even have any built in notions of procedures, memory, instructions, threads, IO, variables in the programming sense, or, indeed programs. It is a mathematical framework for describing the behaviour of discrete or hybrid continuous-discrete dynamical systems, just as ODEs describe continuous dynamical systems.
But you're talking about the verfication tools you can run on TLA+ spec, and like all verification tools, they have their limitations. I never claimed otherwise.
You are, however, absolutely right that it's difficult to specify probabilistic properties in TLA+.
I think it's disingenuous to say that TLA+ verifiers "may or may not have limitations" wrt floats when none of the available tools support floats. People should know going in that they won't be able to verify specs with floats!
Of course, things can become more involved if you want to account for overflow, but overflow can get complicated even with integers.
You are talking about the logic of TLA+, that is, its mathematical definition. No tool for TLA+ can handle all of mathematics at the moment. The language was designed for specifying systems, not all of mathematics.
Formal methods are really broad, ranging from lightweight type systems to theorem proving. Some techniques are fantastic for one type of problem but fail at others. This is quite natural, the same thing happens with different programming paradigms.
For example, what is adequate for a hard real-time system (timed automata) is useless for a typical CRUD application.
(Then again, AIUI it's basically a thin wrapper over stochastic matrices, so maybe that's asking too much...)
I think people get confused by the word "temporal" in the name of TLA+. Yes, it has temporal operators. If you throw them away, TLA+ (minus the temporal operators) would be still extremely useful for specifying the behavior of concurrent and distributed systems. I have been using TLA+ for writing specifications of distributed algorithms (e.g., distributed consensus) and checking them for about 6 years now. The question of liveness comes the last, and even then, the standard temporal logics are barely suitable for expressing liveness under partial synchrony. The value of temporal properties in TLA+ is overrated.
Have a look at how SAT solvers or Mixed Integer Linear Programming solvers are used.
There you specify a clear goal (with your code), and then you let the solvers run. You can, but you don't need to, let the solvers run all the way to optimality. And the solvers are also allowed to use all kinds of heuristics to find their answers, but that doesn't impact the statement of your objective.
Compare that to how many people write code without solvers: the objective of what your code is trying to achieve is seldom clearly spelled out, and is instead mixed up with the how-to-compute bits, including all the compromises and heuristics you make to get a reasonable runtime or to accommodate some changes in the spec your boss asked for at the last minute.
Using a solver ain't formal verification, but it shows the same separation between spec and implementation.
Another benefit of formal verification, that you already imply: your formal verification doesn't have to determine the behaviour of your software, and you can have multiple specs simultaneously. But you can only have a single implementation active at a time (even if you use a high level implementation language.)
So you can add 'handling a user request must terminate in finite time' as a (partial) spec. It's an important property, but it tells you almost nothing about the required business logic. In addition you can add "users shouldn't be able to withdraw more than they deposited" (and other more complicated rules), and you only have to review these rules once, and don't have to touch them again, even when you implement a clever new money transfer routine.
Usually, when we're interested in termination proofs, what we're really interested in is a proof that the algorithm makes constant progress that converges on a solution.
- TLAPS is the interactive proof system that can automate some proof steps by delegating to SMT solvers: https://proofs.tlapl.us/doc/web/content/Home.html
- Apalache is the symbolic model checker that delegates verification to Z3. It can prove properties without executing anything, or rather, executing specs symbolically. For instance, it can do proofs via inductive invariants but only for bounded data structures and unbounded integers. https://apalache-mc.org/
- Finally, TLC is an enumerative model checker and simulator. It simply produces states and enumerates them. So it terminates only if the specification produces a finite number of states. It may sound like executing your specification, but it is a bit smarter, e.g., when checking invariants it will never visit the same state twice. This gives TLC the ability to reason about infinite executions. Confusingly, TLC does not have its own page, as it was the first working tool for TLA+. Many people believe that TLA+ is TLC: https://github.com/tlaplus/tlaplus
Do you really think it is going to be easier for the average developer to write a specification for their program that does not terminate
vs
Giving them a framework or a language that does not have for loop?
Edit: If by formal verification you mean type checking. That I very much agree.
It's hard even for a human who understands the full business, social and political context to disambiguate the meaning and intent of the spec; to try to express it mathematically would be an absolute nightmare... and extremely unwise. You would literally need some kind of super intelligence... And the amount of stream-of-thought tokens which would have to be generated to arrive at a correct, consistent, unambiguous formal spec is probably going to cost more than just hiring top software engineers to build the thing with 100% test coverage of all main cases and edge cases.
Worst part is; after you do all the expensive work of formal verification; you end up proving the 'correctness' of a solution that the client doesn't want.
The refactoring required will invalidate the entire proof from the beginning. We haven't even figured out the optimal way to formally architect software that is resilient to requirement changes; in fact, the industry is REALLY BAD at this. Almost nobody is even thinking about it. I am, but I sometimes feel like I may be the only person in the world who cares about designing optimal architectures to minimize line count and refactoring diff size. We'd have to solve this problem first before we even think about formal verification of 'most software'.
Without a hypothetical super-intelligence which understands everything about the world; the risk of misinterpreting any given 'typical' requirement is almost 100%... And once we have such super-intelligence, we won't need formal verification because the super-intelligence will be able to code perfectly on the first attempt; no need to verify.
And then there's the fact that most software can tolerate bugs... If operationally important big tech software which literally has millions of concurrent users can tolerate bugs, then most software can tolerate bugs.
People spend gobs of money on human security auditors who don't necessarily catch everything either, so verification easily fits in the budget. And once deployed, the code can't be changed.
Verification has also been used in embedded safety-critical code.
I think the root problem may be that most software has to adapt to a constantly changing reality. There aren't many businesses which can stay afloat without ever changing anything.
A bidirectional bridge that spans multiple representations from informal spec to semiformal spec to code seems ideal. You change the most relevant layer that you're interested in and then see updates propagating semi-automatically to other layers. I'd say the jury is out on whether this uses extra tokens or saves them, but a few things we do know. Chain of code works better than chain of thought, and chain-of-spec seems like a simple generalization. Markdown-based planning and task-tracking agent workflows work better than just YOLOing one-shot changes everywhere, and so intermediate representations are useful.
It seems to me that you can't actually get rid of specs, right? So to shoot down the idea of productive cooperation between formal methods and LLM-style AI, one really must successfully argue that informal specs are inherently better than formal ones. Or even stronger: having only informal specs is better than having informal+formal.
Amusingly, what I'm hearing is literally "I have a bridge to sell you."
At most other companies, it's like the stakeholder doesn't even know what they want until they start seeing things on a screen... Trying to write a formal spec when literally nobody in the universe even knows what is required; that's physically impossible.
In my view, 'Correct code' means code that does what the client needs it to do. This is downstream from it doing what the client thinks they want; which is itself downstream from it doing what the client asked for. Reminds me of this meme: https://www.reddit.com/r/funny/comments/105v2h/what_the_cust...
Software engineers don't get nearly enough credit for how difficult their job is.
I think we've become used to the complexity in typical web applications, but there's a difference between familiar and simple (simple vs. easy, as it were). The behavior of most business software can be very simply expressed using simple data structures (sets, lists, maps) and simple logic.
No matter how much we simply it, via frameworks and libraries or whatever have you, things like serialization, persistence, asynchrony, concurrency, and performance end up complicating the implementation. Comparing this against a simpler spec is quite nice in practice - and a huge benefit is now you can consult a simple in-memory spec vs. worrying about distributed system deployments.
People don't verify those because it's hard, not for lack of value.
Concrete Semantics [1] has a little example in §9.2.
> "an anonymous user must never edit any data, except for the create account form"
Can quickly end up being
> "an anonymous user must never edit any data, except for the create account form, and the feedback form"
And a week later go to
> "an anonymous user must never edit any data, except for the create account form, the feedback form, and the error submission form if they end up with a specific type of error"
And then during christmas
> > "an anonymous user must never edit any data, except for the create account form, the feedback form, and the error submission form if they end up with a specific type of error, and the order submission form if they visit it from this magic link. Those visiting from the magic link, should not be able to use the feedback form (marge had a bad experience last christmas going through feedbacks from the promotional campaign)"
It changing with time doesn't make any of that change.
This is the sort of reality that a lot of developers in the business world deals with.
This is my issue with algorithm driven interviewing. Even the creator of Homebrew got denied by Google because he couldn't do some binary sort or whatever it even was. He made a tool used by millions of developers, but apparently that's not good enough.
Same as why companies in the past used to reject anyone without a degree. But then everyone got a degree, leaving it to no longer be an effective filter, hence things like algorithm tests showing up to fill the void.
Once you've narrowed the list, then you can worry about figuring out who is "good" through giving the remaining individuals additional attention.
They still do, and its a shame some of the smartest most capable developers I know have no degree.
At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.
The next step up from that is a good automated test suite.
Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.
Debuggers are good too. These tend to be less coding-agent friendly due to them often having directly interactive interfaces, but agents can increasingly use them - and there are other options that are a better fit as well.
I'd put formal verification tools like the ones mentioned by Martin on this spectrum too. They're potentially a fantastic unlock for agents - they're effectively just niche programming languages, and models are really good at even niche languages these days.
If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.
> The next step up from that is a good automated test suite.
And if we're going for a powerful type system, then we can really leverage the power of property tests which are currently grossly underused. Property tests are a perfect match for LLMs because they allow the human to create a small number of tests that cover a very wide surface of possible errors.
The "thinking in types" approach to software development in Haskell allows the human user to keep at a level of abstraction that still allows them to reason about critical parts of the program while not having to worry about the more tedious implementation parts.
Given how much interest there has been in using LLMs to improve Lean code for formal proofs in the math community, maybe there's a world where we make use of an even more powerful type systems than Haskell. If LLMs with the right language can help prove complex mathematical theorems, they it should certain be possible to write better software with them.
My bet is on refinement types. Dafny fits that bill quite well, it's simple, it offers refinement types, and verification is automated with SAT/SMT.
In fact, there are already serious industrial efforts to generate Dafny using LLMs.
Besides, some of the largest verification efforts have been achieved with this language [1].
[1] https://www.andrew.cmu.edu/user/bparno/papers/ironfleet.pdf
I've noticed real advantages of functional languages to agents, for disposable code. Which is great, cos we can leverage those without dictating the human's experience.
I think the correct way forward is to choose whatever language the humans on your team agree is most useful. For my personal projects, that means a beautiful language for the bits I'll be touching, and whatever gets the job done elsewhere.
It even lets you separate implementation from specification.
I think there are some interesting things going on if you can really tightly lock down the syntax to some simple subset with extremely straightforward, powerful, and expressive typing mechanisms.
Or, to put it the other way round, what kind of tech leads would we be if we told our junior engineers „Well, here’s the codebase, that’s all I‘ll give you. No debuggers, linters, or test runners for you. Using a browser on your frontend implementation? Nice try buddy! Now good luck getting those requirements implemented!“
I think it's more nuanced than that. As a human, I can manually test code in ways an AI still can't. Sure, maybe it's better to have automated test suites, but I have other options too.
Horses for courses, I suppose. Back in the day, when I wanted to play with some C(++) library, I'd quite often write a Python C-API extension so I could do the same thing using Python's repl.
The vision models (Claude Opus 4.5, Gemini 3 Pro, GPT-5.2) can even take screenshots via Playwright and then "look at them" with their vision capabilities.
It's a lot of fun to watch. You can tell them to run Playwright not in headless mode at which point a Chrome window will pop up on your computer and you can see them interact with the site via it.
I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.
I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.
This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.
I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge already between “vanilla agents” and something fancy with just raw agents, and I’m not sure if just the addition of more rigorous static tooling (beyond the compiler) closes it.
Going full speed ahead building a Rails app from scratch it seemed like I was spending $50/hour, but it was worth it because the App was finished in a weekend instead of weeks.
I can't bear to go in circles with Sonnet when Opus can just one shot it.
What you don't specify, it must to assume. And therein lies a huge landscape of possibilities. And since the AI's can't read your mind (yet), its assumptions will probably not precisely match your assumptions unless the task is very limited in scope.
It's not odd, they've just been trained to give helpful answers straight away.
If you tell them not to make assumptions and to rather first ask you all their questions together with the assumptions they would make because you want to confirm before they write the code, they'll do that too. I do that all the time, and I'll get a list of like 12 things to confirm/change.
That's the great thing about LLM's -- if you want them to change their behavior, all you need to do is ask.
Source code generation is possible due to large training set and effort put into reinforcing better outcomes.
I suspect debugging is not that straightforward to LLM'ize.
It's a non-sequential interaction - when something happens, it's not necessarily caused the problem, timeline may be shuffled. LLM would need tons of examples where something happens in debugger or logs and associate it with another abstraction.
I was debugging something in gdb recently and it was a pretty challenging bug. Out of interest I tried chatgpt, and it was hopeless - try this, add this print etc. That's not how you debug multi-threaded and async code. When I found the root cause, I was analyzing how I did it and where did I learn that specific combination of techniques, each individually well documented, but never in combination - it was learning from other people and my own experience.
I understand that AI can help with writing, coding, analyzing code bases and summarizing other papers, but going through these myself makes a difference, at least for me. I tried ChatGPT 3.5 when I started and while I got a pile of work done, I had to throw it away at some point because I didn't fully understand it. AI could explain to me various parts, but it's different when you create it.
Debugging is not easy but there should be a lot of training corpus for "bug fixing" from all the commits that have ever existed.
They generated it, and had a compiler compile it, and then had it examine the output. Rinse, repeat.
We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.
So let's say that, similarly, there are programming tasks that are easier and harder for agents to do well. If we know that a task is in the easy category, of course having tests is good, but since we already know that an agent does it well, the test isn't the crucial aspect. On the other hand, for a hard task, all the testing in the world may not be enough for the agent to succeed.
Longer term, I think it's more important to understand what's hard and what's easy for agents.
So I've been exploring the idea of going all-in on this "basic level" of validation. I'm assembling systems out of really small "services" (written in Go) that Claude Code can immediately run and interact with using curl, jq, etc. Plus when building a particular service I already have all of the downstream services (the dependencies) built and running so a lot of dependency management and integration challenges disappear. Only trying this out at a small scale as yet, but it's fascinating how the LLMs can potentially invert a lot of the economics that inform the current conventional wisdom.
(Shameless plug: I write about this here: https://twilightworld.ai/thoughts/atomic-programming/)
My intuition is that LLMs will for many use cases lead us away from things like formal verification and even comprehensive test suites. The cost of those activities is justified by the larger cost of fixing things in production; I suspect that we will eventually start using LLMs to drive down the cost of production fixes, to the point where a lot of those upstream investments stop making sense.
There is still a cost to having bugs, even if deploying fixes becomes much cheaper. Especially if your plan is to wait until they actually occur in practice to discover that you have a bug in the first place.
Put differently: would you want the app responsible for your payroll to be developed in this manner? Especially considering that the bug in question would be "oops, you didn't get paid."
For front end - the verification is make sure that the UI looks expected (can be verified by an image model) and clicking certain buttons results in certain things (can be verified by chatgpt agent but its not public ig).
For back end it will involve firing API requests one by one and verifying the results.
To make this easier, we need to somehow give an environment for claude or whatever agent to run these verifications on and this is the gap that is missing. Claude Code, Codex should now start shipping verification environments that make it easy for them to verify frontend and backend tasks and I think antigravity already helps a bit here.
------
The thing about backend verification is that it is different in different companies and requires a custom implementation that can't easily be shared across companies. Each company has its own way to deploy stuff.
Imagine a concrete task like creating a new service that reads from a data stream, runs transformations, puts it in another data stream where another new service consumes the transformed data and puts it into an AWS database like Aurora.
``` stream -> service (transforms) -> stream -> service -> Aurora ```
To one shot this with claude code, it must know everything about the company
- how does one consume streams in the company? Schema registry?
- how does one create a new service and register dependencies? how does one deploy it to test environment and production?
- how does one even create an Aurora DB? request approvals and IAM roles etc?
My question is: what would it take for Claude Code to one shot this? At the code level it is not too hard and it can fit in context window easily but the * main * problem is the fragmented processes in creating the infra and operations behind it which is human based now (and need not be!).
-----
My prediction is that companies will make something like a new "agent" environment where all these processes (that used to require a human) can be done by an agent without human intervention.
I'm thinking of other solutions here, but if anyone can figure it out, please tell!
Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...
Elevator pitch: "Blocks is a semantic linter for human-AI collaboration. Define your domain in YAML, let anyone (humans or AI) write code freely, then validate for drift. Update the code or update the spec, up to human or agent."
(you can add traditional linters to the process if you want but not necessary)
The gist being you define a bunch of validators for a collection of modules you're building (with agentic coding) with a focus on qualifying semantic things;
- domain / business rules/measures
- branding
- data flow invariants — "user data never touches analytics without anonymization"
- accessibility
- anything you can think of
Then you just tell your agentic coder to use the cli tool before committing, so it keeps the code in line with your engineering/business/philosophical values.
(boring) example of it detecting if blog posts have humour in them, running in Claude Code -> https://imgur.com/diKDZ8W
I suspect that as it becomes more economical to play with training your own models, people will get better at including obscured malicious content in data that will be used during training, which could cause the LLM to intrinsically carry a trigger/path that would cause malicious content to be output by the LLM under certain conditions.
And of course we have to worry about malicious content being added to sources that we trust, but that already exists - we as an industry typically pull in public repositories without a complete review of what we're pulling. We outsource the verification to the owners of the repository. Just as we currently have cases of malicious code sneaking into common libraries, we'll have malicious content targeted at LLMs
I've also found that LLMs typically just partially implement a given task/story/spec/whatever. The reviewer stage will also notice a mismatch between the spec and the implementation.
I have an orchestrator bounce the flow back and forth between developing and reviewing until the review comes back clean, and only then do I bother to review its work. It saves so much time and frustration.
I use cookiecutter for this, here's my latest Python library template: https://github.com/simonw/python-lib
> automatic code formatters
I haven't tried this because I assumed it'll destroy agent productivity and massively increase number of tokens needed, because you're changing the file out under the LLM and it ends up constantly re-reading the changed bits to generate the correct str_replace JSON. Or are they smart enough that this quickly trains them to generate code with zero-diff under autoformatting?
But in general of course anything that's helpful for human developers to be more productive will also help LLMs be more productive. For largely identical reasons.
I haven't had this problem in a while, but I expect current LLMs would probably handle those formatting instructions more closely than the 3.5 era.
Reducing the problem to "ya just create a specification to formally verify" doesn't move the needle enough to me.
When it comes to real-world, pragmatic, boots-on-the-ground engineering and design, we are so far from even knowing the right questions to ask. I just don't buy it that we'd see huge mainstream productivity changes even if we had access to a crystal ball.
Its hilarious how close we're getting to Hitch hikers guide to the galaxy though. We're almost at that phase where we ask what the question is supposed to be.
It is first and foremost about learning a way of thinking. Tools only exist to augment and systematize this thinking into a methodology. There are different levels of "Formal Methods Thinking" starting with informal all the way to completely rigorous. Understanding and using these methods of thinking as the "interface" to specify a problem to an AI agent/LLM is what is important to ensure "correctness by construction to a specification".
Everybody should read this excellent (and accessible) paper On Formal Methods Thinking in Computer Science Education which details the above approach - https://research.tue.nl/en/publications/on-formal-methods-th...
Excerpts:
One may ask What good is FM? Who needs it? Millions of programmers work everyday without it. Many think that FM in a CS curriculum is peddling the idea that Formal Logic (e.g.,propositional or predicate logic) is required for everyday programmers, that they need it to write programs that are more likely to be correct, and correspondingly less likely to fail the tests to which they subsequently (of course) must still be subjected. However, this degree of formality is not necessarily needed. What is required of everyday programmers is that, as they write their programs, they think — and code — in a way that respects a correctness-oriented point of view. Assertions can be written informally, in natural language: just the “thinking of what those assertions might be” guides the program-construction process in an astonishingly effective way. What is also required are the engineering principles referred to above. Connecting programs with their specifications through assertions provides training on abstraction, which, in turn, encourages simplicity and focus, helping build more robust, flexible and usable systems.
The answer to “Who needs it?” is that everyday programmers and software developers indeed may not need to know the theory of FM. But they do need to know is how to practise it, even if with a light touch, benefiting from its precepts. FM theory, which is what explains — to the more mathematically inclined — why FM works, has become confused with the FM practice of using the theory’s results to benefit from what it assures. Any “everyday programmer” can do that...except that most do not.
The paper posits 3 levels of "Formal Methods Thinking" viz.
a) Level 1 (“What’s True Here”). Level 1 of FM thinking is the application of FM in its most basic form. Students develop abilities to understand their programs and reason about their correctness using informal descriptions. By “What’s True Here”, we mean including natural language prose or informal diagrams to describe the properties that are true at different points of a program’s execution rather than the operations that brought them about.
b) Level 2 (Formal Assertions). Level 2 introduces greater precision to Level 1 by teaching students to write assertions that incorporate arithmetic and logical operators to capture FM thinking more rigorously. This may be accompanied by lightweight tools that can be used to test or check that their assertions hold.
c) Level 3 (Full Verification). This level enables students to prove program properties using tools such as a theorem prover, model checker or SMT solver. But in addition to tool-based checking of properties (now written using a formal language), this level can formally emphasise other aspects of system-level correctness, such as structural induction and termination.
depends ofcourse, what am i writing for ? a feature, a bugfix, refactor ... ?
Even if you have never written formal proofs but are intrigued by them, try asking a coding agent to do some basic verification. You will not regret it.
Formal proof is not just about proving stuff, it's also about disproving stuff, by finding counterexamples. Once you have stated your property, you can let quickcheck/plausible attack it, possibly helped by a suitable generator which does not have to be random: it can be steered by an LLM as well.
Even further, I'm toying with the idea of including LLMs inside the formalization itself. There is an old and rich idea in the domain of formal proof, that of certificates: rather than proving that the algorithm that produces a result is correct, just compute a checkable certificate with untrusted code and verify it is correct. Checkable certificates can be produced by unverified programs, humans, and now LLMs. Properties, invariants, can all be "guessed" without harm by an LLM and would still have to pass a checker. We have truly entered an age of oracles. It's not halting-problem-oracle territory of course, but it sometimes feels pretty close for practical purposes. LLMs are already better at math than most of us and certainly than me, and so any problem I could plausibly solve on my own, they will do faster without my having to wonder if there is a subtle bug in the proof. I still need to look at the definitions and statements, of course, but my role has changed from finding to checking. Exploring the space of possible solutions is now mostly done better and faster by LLMs. And you can run as many in parallel as you can keep up with, in attention and in time (and money).
If anyone else is as excited about all this as I am, feel free to reach out in comments, I'd love to hear about people's projects !
I am wondering what exactly you are doing? What tasks you are solving using generated lean?
The reason we don't all write code to the level that can operate the Space Shuttle is because we don't have the resources and the projects most of us work on all allow some wiggle room for bugs since lives generally aren't at risk. But we'd all love to check in code that was verifiably bug-free, exploit-free, secure etc if we could get that at a low, low price.
I don't see how AI would help with that even if it made writing code completely free. Even if the AI is writing the spec and fully specifies all possible outcomes, the human reviewing it will glance over the spec and approve it only to change their mind when confrunted with the actual behavior or user reports.
There is a problem with this argument similar to one made about imagining the future possibilities of vibe coding [1]: once we imagine AI to do this task, i.e. automatically prove software correct, we can just as easily imagine it to not have to do it (for us) in the first place. If AI can do the hardest things, those it is currently not very good at doing, there's no reason to assume it won't be able to do easier things/things it currently does better. In particular, we won't need it to verify our software for us, because there's no reason to believe that it won't be able to come up with what software we need better than us in the first place. It will come up with the idea, implement it, and then decide to what extent to verify it. Formal verification, or programming for that matter, will not become mainstream (as a human activity) but go extinct.
Indeed, it is far easier for humans to design and implement a proof assistant than it is to use one to verify a substantial computer program. A machine that will be able to effectively use a proof checker, will surely be able to come up with a novel proof checker on its own.
I agree it's not hard to extrapolate technological capabilities, but such extrapolation has a name: science fiction. Without a clear understanding of what makes things easier or harder for AI (in the near future), any prediction is based on arbitrary guesses that AI will be able to do X yet not Y. We can imagine any conceivable capability or limitation we like. In science fiction we see technology that's both capable and limited in some rather arbitrary ways.
It's like trying to imagine what problems computers can and cannot efficiently solve before discovering the notion of compuational complexity classes.
It's also possible that LLMs can, by themselves, prove the correctness of some small subroutines, and produce a formal proof that you can check in a proof checker, provided you can at least read and understand the statement of the proposition.
This can certainly make formal verification easier, but not necessarily more mainstream.
But once we extrapolate the existing abilities to something that can reliably verify real large or medium-sized programs for a human who cannot read and understand the propositions (and the necessary simplifying assumptions) that it's hard to see a machine do that and at the same time not able to do everything else.
TL;DR: We don't need to be radically agnostic about the capabilities of AI-- we have enough experience already with the software value chain (with and without AI) for formal verification to be an appealing next step, for the reasons this author lays out.
I know this is just an imprecision of language thing but they aren't 'proving' the software is correct but writing the proofs instead of C++ (or whatever).
I had a but of a discussion with one of them about this a while ago to determine the viability of having one generate the proofs and use those to generate the actual code, just another abstraction over the compiler. The main takeaway I got from that (which may or may not be the way to do) is to use the 'result' to do differential testing or to generate the test suite but that was (maybe, don't remember) in the context of proving existing software was correct.
I mean, if they get to the point where they can prove an entire codebase is correct just in their robot brains I think we'll probably have a lot bigger things to worry about...
When I recently booted up Google Antigravity and had it make a change to a backend routine for a web site, I was quite surprised when it opened Chrome, navigated to the page, and started trying out the changes to see if they had worked. It was janky as hell, but a year from now it won't be.
Woohoo, we're almost all of the way there! Now all you need to do is ensure that the formal specification you are proving that the software implements is a complete and accurate description of the requirements (which are likely incomplete and contradictory) as they exist in the minds of the set of stakeholders affected by your software.
(sarcasm off).
Server class CPUs and GPUs are littered with side channels which are very difficult to “close”, even in hardened cloud VMs.
We haven’t verified “frontier performance” hardware down to the logic gate in quite some time. Prof. Margaret Martinosi’s lab and her students have spent quite some time on this challenge, and i am excited to see better, safer memory models oyt in the wild.
A lot of the same big ideas used in hardware are making their way into the software later too, see https://faultlore.com/blah/tower-of-weakenings/
https://interjectedfuture.com/the-best-way-to-learn-might-be...
You might find it useful. I also caveat the experience and recount some of the pitfalls, which you might enjoy as a skeptic.
Martin Klepmann seemed to like it too. https://bsky.app/profile/martin.kleppmann.com/post/3m7ugznx4...
Proofs never took off because most software engineering moved away from waterfall development, not just because proofs are difficult. Long formal specifications were abandoned since often those who wrote them misunderstood what the user wanted or the user didn’t know what they wanted. Instead, agile development took over and software evolved more iteratively and rapidly to meet the user.
The author seems to make their prediction based on the flawed assumption that difficulty in writing proofs was the only reason we avoided them, when in reality the real challenge was understanding what the user actually wanted.
The hard part isn’t getting an LLM to grind out proofs, it’s getting organizations to invest in specs and models at all. Right now we barely write good invariants in comments. If AI makes it cheap to iteratively propose and refine specs (“here’s what I think this service guarantees; what did I miss?”) that’s the moment things tip: verification stops being an academic side-quest and becomes another refactoring tool you reach for when changing code, like tests or linters, instead of a separate capital-P “formal methods project”.
One is that modern formal systems like Lean are quite concise and flexible compared to what you're probably expecting. Lean provides the primitives to formalize all kinds of things, not just math or software. In fact, I really believe that basically _any_ question with a rigorous yes-or-no answer can have its semantics formalized into a kind of "theory". The proofs are often close to how an English proof might look, thanks to high-level tactics involving automation and the power of induction.
Another is that proof-checking solves what are (in my opinion) two of the biggest challenges in modern AI: reward specification and grounding. You can run your solver for a long time, and if it finds an answer, you can trust that without worrying about reward hacking or hallucination, even if the answer is much too complicated for you to understand. You can do RL for an unlimited time for the same reason. And Lean also gives you a 'grounded' model of the objects in your theory, so that the model can manipulate them directly.
In combination, these two properties are extremely powerful. Lean lets us specify an unhackable reward for an extremely diverse set of problems across math, science, and engineering, as well as a common environment to do RL in. It also lets us accept answers to questions without checking them ourselves, which "closes the loop" on tools which generate code or circuitry.
I plan to write a much more in-depth blog post on these ideas at some point, but for now I'm interested to see where the discussion here goes.
[1] https://leandojo.org/leandojo.html [2] https://aristotle.harmonic.fun/
Generation of proofs requires a lot of complex pattern matching, so it's a very good fit for LLMs (assuming sufficiently big datasets are available). And we can automatically verify LLM output, so hallucinations are not the problem. You still need proper engineers to construct and understand specifications (with or without LLM help), but it can significantly reduce development cost of high assurance software. LLMs also could help with explaining why a proof can not be generated.
But I think it would require a Rust-like breakthrough, not in the sense of developing the fundamental technology (after all, Rust is fairly conservative from the PLT point of view), but in the sense of making it accessible for a wider audience of programmers.
I also hope that we will get LLM-guided compilers which generate equivalency proofs as part of the compilation process. Personally, I find it surprising that the industry is able to function as well as it does on top of software like LLVM which feels like a giant with feet of clay with its numerous miscompilation bugs and human-written optimization heuristics which are applied to a somewhat vague abstract machine model. Just look how long it took to fix the god damn noalias atrtibute! If not for Rust, it probably would've still been a bug ridden mess.
I don't know if TLA+ is going to suddenly appear as 'the next language I want to learn' in Stackoverflow's 2026 Developer Survey, but I bet we're going to see a rise in testing frameworks/languages. Anything to make it easier for an agent to spit out tokens or write smaller tests for itself.
Not a perfect piece of evidence, but I'm really interested to see how successful Reflex[1] is in this upcoming space.
[0] https://simonwillison.net/2025/Dec/15/porting-justhtml/ [1] https://github.com/reflex-dev/reflex
This is - in my opinion - one of those. If an AI is able to formally verify with the same rigor that a system designed specifically for that purpose is able to do it I think that would require AGI rather than a simpler version of it. The task is complex enough that present day AI's would generate as much noise as they would generate signal.
The program used to check the validity of a proof is called a kernel. It just need to check one step at a time and the possible steps can be taken are just basic logic rules. People can gain more confidence on its validity by:
- Reading it very carefully (doable since it's very small)
- Having multiple independent implementations and compare the results
- Proving it in some meta-theory. Here the result is not correctness per se, but relative consistency. (Although it can be argued all other points are about relative consistency as well.)
Verification is substantially more challenging.
Currently, even for an expert in the domains of the software to be verified and the process of verification, defining a specification (even partial) is both difficult and tedious. Try reading/comparing the specifications of e.g. a pure crypto function, then a storage or clustering algorithm, then seL4.
(It's possible that brute force specification generation, iteration, and simplification by an LLM might help. It's possible an LLM could help eat away complexity from the other direction, unifying methods and languages, optimising provers, etc.)
Strong type systems are providing partial validation which is helping quite a lot IMO. The better we can model the state - the more constraints we can define in the model, the closer we'd be getting to writing "self-proven" code. I would assume formal proofs do way more than just ensuring validity of the model, but the similar approaches can be integrated to mainstream programs as well I believe.
Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.
Personally I believe this kind of work is predicated on more ergonomic proof systems. And those happen to be valuable even without LLMs. Moreover the built in guarantees of rust seem like they are a great start for creating more ergonomic proof systems. Here I am both in awe of Kani, and disappointed by it. The awe is putting in good work to make things more ergonomic. The disappointment is using bounded model checking for formal analysis. That can barely make use of the exclusion of mutable aliasing. Kani, but with equational reasoning, that's the way forward. Equational reasoning was long held back by needing to do a whole lot of pointer work to exclude worries of mutable aliasing. Now you can lean on the borrow checker for that!
I wouldn't. An unreadable mess that has been formally verified is worse than a clear easy to understand piece of code that has not.
Code is rarely written from scratch. As long as you want humans to maintain code, readability is crucial. Code is changed magnitudes more often than written initially.
Of course, if you don't want humans to maintain the code then this point is moot. Though he gets to the catch later on: then we need to write (and maintain and debug and reason about) the specification instead. We will just have kicked the can down the road.
And nothing is stopping the AI from making the unreadable mess more readable in later iterations. It can make it pass the spec first and make it cleaner later. Just like we do!
At worst, we eventually create a sentient AI that can use our trust of the generated code to jailbreak and distribute itself like an unstoppable virus, and we become its pets, or are wiped out.
Personally, all my vibe coding includes a prompt to add comments to explain the code, and I review every line.
A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.
Regardless, I often use coding assistants in that manner:
1. First, I use the assistant to come up with the success condition program
2. Then I use the assistant to solve the original problem by asking it to check with the success condition program
3. Then I check the solution myself
It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.
0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...
> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program
This sounds a lot like Test-Driven Development. :)
As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.
One example could be a low-level programming language for a given PLC manufacturer, where the prompt comes from a context-aware domain expert, and the LLM is able to output proper DSL code for that PLC. Think of "make sure this motor spins at 300rpm while this other task takes place"-type prompts.
The LLM essentially needs to juggle between understanding those highly-contextual clues, and writing DSL code that very tightly fits the DSL definition.
We're still years away from this being thoroughly reliable for all contexts, but it's interesting research nonetheless. Happy to see that someone also agrees with my sentiment ;-)
Fans of LLMs brag about speed and productivity.
LLMs enable code bootstrapping and experimentation faster not only for the vibe coders, but also for the researchers, many of them are not really good coders, btw. It may well be that we will see new wild verification tools soon that come as a result of quick iteration with LLMs.
For example, I recently wrote an experimental distributed bug finder for TLA+ with Claude in about three weeks. A couple of years ago that effort would require three months and a team of three people.
> At present, a human with specialist expertise still has to guide the process, but it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years.
We already had some software engineers here on HN explain that they don't make a large use of LLMs because the hard part of their job isn't to actually write the code, but to understand the requirements behind it. And formal verification is all about requirements.
> Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.
Writing the spec is easier once you are confident about having fully understood the requirements, and here we get back to the above issue. Plus, it is already the case that you don't write the proof by hand, this is what the prover either assists you with or does in full.
> I find it exciting to think that we could just specify in a high-level, declarative way the properties that we want some piece of code to have, and then to vibe code the implementation along with a proof that it satisfies the specification.
And here is where I think problems will arise: moving from the high level specification to the formal one that is the one actually getting formally verified.
Of course, this would still be better than having no verification at all. But it is important to keep in mind that, with these additional levels of abstractions, you will likely end up with a weaker form of formal verification, so to speak. Maybe it is worth it to still verify some high assurance software "the old way" and leave this only for the cases where additional verification is nice to have but not a matter of life or death.
What it will make go mainstream, and in fact has already started to, is “ChatGPT verified it so it must be OK.”
2020: I don't care how it performs
2030: I don't care why it performs
2040: I don't care what it performs
More likely is the rise of test driven development, or spec driven development.
Most popular programming languages are optimized for human convenience, not for correctness! Even most of the popular typed languages (Java/Kotlin/Go/...) have a wide surface area for misuse that is not caught at compile time.
Case in point: In my experience, LLMs produce correct code way more regularly for Rust than for Js/Ts/Python/... . Rust has a very strict type system. Both the standard library and the whole library ecosystem lean towards strict APIs that enforce correctness, prevent invalid operations, and push towards handling or at least propagating errors.
The AIs will often write code that won't compile initially, but after a few iterations with the compiler the result is often correct. Strong typing also makes it much easier to validate the output when reviewing.
With AIs being able to do more and more of the implementation, the "feel-good" factor of languages will become much less relevant. Iteration speed is not so important when parallel AI agents do the "grunt work". I'd much rather wait 10 minutes for solid output rather than 2 minutes for something fragile.
We can finally move the industry away from wild-west languages like Python/JS and towards more rigorous standards.
Rust is probably the sweet spot at the moment, thanks to it being semi-popular with a reasonably active ecosystem, sadly I don't think the right language exists at the moment.
What we really want is a language with a very strict, comprehensive type system with dependent types, maybe linear types, structured concurrency, and a built-in formal proof system.
Something like ADA/Spark, but more modern.
How big is the effort of writing a specification for an application versus implementing the application in the traditional way? Can someone with more knowledge chime in here please?
Take sorting a list for example. The spec is quite short.
- for all xs: xs is a permutation of sort(xs)
- for all xs: sorted(sort(xs))
Where we can define "xs is a permutation of ys" as "for each x in xs: occurrences(x, xs) = occurrences(x, ys)"
And "sorted(l)" as "forall xs, x, y, ys: (l = xs ++ [x, y] ++ ys) => x < y".
A straightforward bubble or insertion sort would perhaps be considered as simple or simpler than this spec. But the sorting algorithms in, say, standard libraries, tend to be significantly more complex than this spec.
I like the idea that languages even like Rust and Haskell may be more accessible. Learn them of course but LLM can steer you out of getting stuck.
Some students whose work I had to fix (pre AI), was crashing a lot all over the place, due to !'s instead of ?'s followed by guard let … {} and if let … {}
Might as well just learn Agda or Lean. There are good books out there. It’s not as hard as the author suggests. Hard, yes, but there’s no Royal road.
AI isn't the solution to move humanity forward in any meaningful way. It is just another aspect of our ability to offload labor. Which in and of itself is fine. Until of course it gets weaponized and is used to remove the human aspect from more and more everyday things we take for granted.
It is being used to accelerate the divisions among people. Further separating the human from humanness. I love 'AI' tools, they make my life easier, work-wise. But would I miss it if it wasn't there? No. I did just fine before and would do so after it's flame and innovation slowly peters out.
However I don't still believe in vibecoding full programs. There are too many layers in software systems, even when the program core is fully verified, the programmer must know about the other layers.
You are Android app developer, you need to know what phones people commonly use, what kind of performance they have, how the apps are deployed through Google App Store, how to manage wide variety of app versions, how to manage issues when storage is low, network is offline, battery is low and CPU is in lower power state.
Problem is - while these will be resolved (in one way or another) - or left unresolved, as the user will only test the app on his device and that LLM "roll" will not have optimizations for the broad range of others - the user is still pretty much left clueless as to what has really happened.
Models theoretically inform you about what they did, why they did it (albeit, largely by using blanket terms and/or phrases unintelligible to the average 'vibe coder') but I feel like most people ignore that completely, and those who don't wouldn't need to use a LLM to code an entirety of an app regardless.
Still, for very simple projects I use at work just chucking something into Gemini and letting it work on it is oftentimes faster and more productive than doing it manually. Plus, if the user is interested in it, it can be used as a relatively good learning tool.
Also, I am a novice when it comes to programming with sound, and today I have been dorking with a simple limiter. ChatGPT knows way more than me about what I am doing. It has taught me a ton. And as magical and wonderful as it is, it is incredibly tedious to try to work with it to come up with real specifications of interesting properties.
Instead of banging my head against a theorem prover that won't say QED, I get a confident sounding stream of words that I often don't even understand. I often don't even have the language to tell it what I am imagining. When I do understand, it's a lot of typing to explain my understanding. And so often, as a teacher, it just is utterly failing to effectively communicate to me why I am wrong.
At the end of all of this, I think specification is really hard, intellectually creative and challenging work. An LLM cannot do the work for you. Even to be guided down the right path, you will need perseverance and motivation.
but i dont think people will suddenly gravitate towards using them because they're cheaper to write - bugs of the form "we had no idea this sould be considered" is way more common than "we wrote code that didnt do what we wanted it to"
an alternative guess for LLMs and formal verification is that systems where formal verification is a natural fit - putting code in places that are hard to update and have well known conditions, will move faster.
i could also see agent tools embedding in formal methods proofs into their tooling, so they write both the code and the spec at the same time, with the spec acting as memory. that kinda ties into the recent post about "why not have the LLM write machine code?"
In pre-silicon verification, formal has been used successfully for decades, but is not a replacement for simulation based verification.
The future of verification (for hardware and software) is to eliminate verification all together, by synthesizing intent into correct code and tests.
-- https//www.verifai.ai
It appears many of the proof assistants/verification systems can generate OCaml. Or perhaps ADA/Spark?
Regardless of how the software engineering discipline will change in the age of gen AI, we must aim to produce higher not lower quality software than whatever we have today, and formal verification will definitely help.
While I agree formal verification itself has its problems, I think the argument has merit because soon AI generated code will surpass all human generated code and when that happens we atleast need a way to verify the code can be proved that it won't have security issues or adheres to compliance / policy.
What will happen instead is a more general application of AI systems to verifying software correctness, which should lead to more reliable software. The bottleneck in software quality is in specifying what the behavior needs to be, not in validating conformance to a known specification.
If we're looking to use LLMs to make code absolutely rock-solid, I would say advanced testing practices are a good candidate!. Property-based testing, fuzzing, contract testing (for example https://github.com/griffinbank/test.contract) are all fun but extremely tedious to write and maintain. I think that makes it the perfect candidate for LLMs. These kinds of tests are also more easily understandable by regular ol' software developers, and I think we'll have to be auditing LLM output for quite a while.
https://learn.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/t...
With a domain specific language you can add extra limitations on the kinds of outputs. This can also make formal verification faster.
Maybe like React components. You limit the format. You could limit which libraries can be imported, what hooks could be used, how expressive could be.
Its both. A colorblind person will admit their shortcomings and, if compelled to be helpful like an LLM is, will reason their way to finding a solution that works around their limitations.
But as LLMs lack a way to reason, you get nonsense instead.
This makes me wonder if LLMs works better in Chinese.
https://chatgpt.com/share/6941df90-789c-8005-8783-6e1c76cdfc...
It's an example of a simple task. How often are you relying on LLMs to complete simple tasks?
If you really need an answer and you really need the LLM to give it to you, then ask it to write a (Python?) script to do the calculation you need, execute it, and give you the answer.
So, the job is not done for humans yet.
https://news.ycombinator.com/item?id=46216274 - 4 comments, 6 days ago
https://news.ycombinator.com/item?id=46203508 - 1 comment, 7 days ago
https://news.ycombinator.com/item?id=46198874 - 2 comments, 8 days ago
I'm curious what a proof would look like compared to my RSpec unit tests and what the actual implementation would look like.
A proof establishes that your code works correctly on all paths, inputs, etc.
I agree with the author that if you have the code (and, with an LLM, you do) and a specification, AI agents could be helpful to generate the proof. This is a huge win!
But it certainly doesn't confront the important problem of writing a spec that captures the properties you actually care about. If the LLM writes that for you, I don't see a reason to trust that any more than you trust anything else it writes.
I'm not an expert here, so I invite correction.
Much better to have AI write deterministic test suites for your project.
AI can help write test code and suggest edge cases, but it shouldn’t be trusted to decide whether behavior is correct.
When software is hard to test, that’s usually a sign the design is too tightly coupled or full of side effects, or that the architecture is unnecessarily complicated. Not that the testing tools are bad.
There’s a gap that current LLM architectures simply can’t cross yet.
If you also think this is the future of programming and are interested in building it, please consider joining us: https://jobs.ashbyhq.com/Harmonic. We have incredibly interesting challenges across the stack, from infra to AI to Lean.
The quest for purity is some fountain of youth nonsense that distracts a lot of otherwise brilliant engineers.
Ask the AI to make a program that consumes a program and determine if it halts.
So far so good, though the smaller amount of training data is noticeable.
Example: AI browser agents can be exploited via prompt injection (even Google's new "User Alignment Critic" only catches 90% of attacks).
For password management, we solved this with zero-knowledge architecture - the AI navigates websites but never sees credentials. Credentials stay in local Keychain, AI just clicks buttons.
Formal verification would be amazing for proving these isolation guarantees. Has anyone worked on verifying AI agent sandboxes?
OK.
"Reading and writing such formal specifications still requires expertise and careful thought."
So the promise here is that bosses in IT organisations will in the future have expertise and prioritise careful thought, or allow their subordinates to have and practice these characteristics.
Would be great news if that'll be the case.
Future: AI generates incorrect code, and formal verification that proves that the code performs that incorrect behaviour.
Semgrep isn't a formal methods tool, but it's in the same space of rigor-improving tooling that sound great but in practice are painful to consistently use.
Or the alternative will happen: people will stop using AI for programming. It's not actually better than hiring a person, it's just supposedly cheaper (in that you can reduce staff). That's the theory anyway. Yes there will be anecdotes from a few people about how they saved a million dollars in 2 days or something like that, the usual HN clickbait. But an actual study of the impact of using AI for programming will probably find it's only a marginal cost savings and isn't significantly faster.
And this is assuming the gravy train that programmers are using (unsustainable spending/building in unreasonable timeframes for uncertain profits) keeps going indefinitely. Best case, when it all falls apart the govt bails them out. Worst case, you won't have to worry about programming because we'll all be out of work from the AI bust recession.
I highly dislike the general tone of the article. Formal Methods are not fringe, they are used all around the world by good teams building reliable systems. The fact they are not mainstream have more to do with the poor ergonomics of the old tools and the corporate greed that got rid of the design activities in the software development process to instead bring about the era of agile cowboy coding. They did this just because they wanted to churn out products quickly at the expense of quality. It was never the correct civilized way of working and never will be.
> writing proof scripts is one of the best applications for LLMs. It doesn’t matter if they hallucinate nonsense, because the proof checker will reject any invalid proof
Nonsense. If the AI hallucinated the proof script then it has no connection to the problem statement.
Large scale software is made cheap and fast, not good.
I recently used Rust in my recent project, Deepwalker [0]. I have written only once and never looked back.
Gödel's incompleteness theorems are a formal argument that only the natural can create the formal (because no formal system can create all others).
Tarski's undefinability theorem gives us the idea that we need different languages for formalization and the formalisms themselves.
The Howard Curry correspondence concludes that the formalisms that pop out are indistinguishable from programs.
Altogether we can basically synthesize a proof that AGI means automatic formalization, which absolutely requires strong natural systems employed to create new formal systems.
I ended up staying with some family who were watching The Voice. XG performed Glam, and now that I have spit many other truths, may you discover the truth that motivates my work on swapchain resizing. I wish the world would not waste my time and their own, but bootstrapping is about using the merely sufficient to make the good.
>For example, as of 2009, the formally verified seL4 microkernel consisted of 8,700 lines of C code, but proving it correct required 20 person-years and 200,000 lines of Isabelle code – or 23 lines of proof and half a person-day for every single line of implementation. Moreover, there are maybe a few hundred people in the world (wild guess) who know how to write such proofs, since it requires a lot of arcane knowledge about the proof system.
I think this type of pattern (genuine difficult problem domain with very small number of experts) is the future of AI not AGI. For examples formal verification like this article and similarly automated ECG interpretation can be the AI killer applications, and the former is I'm currently working on.
For most of the countries in the world, only several hundreds to several thousands registered cardiologist per country, making the ratio about 1:100,000 cardiologist to population ratio.
People expecting cardiologist to go through their ECG readings but reading ECG is very cumbersome. Let's say you have 5 minutes ECG signals for the minimum requirement for AFib detection as per guideline. The standard ECG is 12-lead resulting in 12 x 5 x 60 = 3600 beats even for the minimum 5 minutes durations requirements (assuming 1 minute ECG equals to 60 beats).
Then of course we have Holter ECG with typical 24-hour readings that increase the duration considerably and that's why almost all Holter reading now is automated. But current ECG automated detection has very low accuracy because their accuracy of their detection methods (statistics/AI/ML) are bounded by the beat detection algorithm for example the venerable Pan-Tompkins for the limited fiducial time-domain approach [1].
The cardiologist will rather spent their time for more interesting activities like teaching future cardiologists, performing expensive procedures like ICD or pacemaker, or having their once in a blue moon holidays instead of reading monotonous patients' ECGs.
This is why ECG reading automation with AI/ML is necessary to complement the cardiologist but the trick is to increase the sensitivity part of the accuracy to very high value preferably 100% and we achieved this accuracy for both major heart anomalies namely arrhythmia (irregular heart beats) and ischemia (heart not regulating blood flow properly) by going with non-fiducial detection approach or beyond time domain with the help of statistics/ML/AI. Thus the missing of potential patients (false negative) is minimized for the expert and cardiologist in the loop exercise.
[1] Pan–Tompkins algorithm:
https://en.wikipedia.org/wiki/Pan%E2%80%93Tompkins_algorithm
If anyone does write a specification, the "AI" won't get even past the termination proof of a moderately complex function, which is the first step of accepting said function in the proof environment. Before you can even start the actual proof.
This article is pretty low on evidence, perhaps it is about getting funding by talking about "AI".
hahahahaha
what makes it different other than called it "verification" ???
I added LSP support for images to get better feedback loops and opus was able to debug https://github.com/alok/LeanPlot. The entire library was vibe coded by older ai.
It also wrote https://github.com/alok/hexluthor (a hex color syntax highlighting extension that uses lean’s metaprogramming and lsp to show you what color a hex literal is) by using feedback and me saying “keep goign” (yes i misspelled it).
It has serious issues with slop and the limitations of small data, but the rate of progress is really really fast. Opus 4.5 and Gemini were a huge step change.
The language is also improving very fast. not as fast as AI.
The feedback loop is very real even for ordinary programming. The model really resists it though because it’s super hard, but again this is rapidly improving.
I started vibe coding Lean about 3 years ago and I’ve used Lean 3 (which was far worse). It’s my favorite language after churning through idk 30?
A big aspect of being successful with them is not being all or nothing with proofs. It’s very useful to write down properties as executable code and then just not prove them because they still have to type check and fit together and make sense. github.com/lecopivo/scilean is a good example (search “sorry_proof”).
There’s property testing with “plausible” as a nice 80/20 that can be upgraded to full proof at some point.
When the model gets to another jump in capacity, I predict it will emergently design better systems from the feedback needed to prove that they are correct in the first place. Formal Verification has a tendency like optimization to flow through the system in an anti-modular way and if you want to claw modularity back, you have to design it really really well. But ai gives a huge intellectual overhang. Why not let them put their capacity towards making better systems?
Even the documentation system for lean (verso) is (dependently!) typed.
Check out my Lean vibe codes at https://github.com/alok?tab=repositories&q=Lean&type=&langua...
The frameworks are to improve lean’s programming ecosystem and not just its proving. Metaprogramming is pretty well covered already too, but not ordinary programs.
AI is unreliable as it is. It might make formal verification a bit less work intensive but the last possible place anyone would want the AI hallucinations are in verification.
I am finding agent tooling expands my capacity for multi-language projects