This, but also for code. I just don't trust new code, especially generated code; I need time to sit with it. I can't make the "if it passes all the tests" crowd understand and I don't even want to. There are things you think of to worry about and test for as you spend time with a system. If I'm going to ship it and support it, it will take as long as it will take.
> I need time to sit with it
Everyone knows doing the work yourself is faster than reviewing somebody elses if you don’t trust them. I’d argue if AI ever gets to the point where you fully trust it, all white collar jobs are gone.
We have control points (prompts + context) and we ask LLMs to draw a 3D surface which passes through those points satisfying some given constraints. Subsequent chats are like edit operations.
If the code passes tests, and also works at the functionality level - what difference does it make if you’ve read the code or not?
You could come up with pathological cases like: it passed the tests by deleting them. And the code written by it is extremely messy.
But we know that LLMs are way smarter than this. There’s very very low chance of this happening and even if it does - it quick glance at code can fix it.
The code may seem to work functionally on day 1. Will it continue to seem to work on day 30? Most often it doesn't.
And in my experience, the chances of LLMs fucking up are hardly very very low. Maybe it's a skill issue on my part, but it's also the case that the spec is sometimes discovered as the app is being built. I'm sure this is not the case if you're essentially summoning up code that exists in the test set, even if the LLM has to port it from another language, and they can be useful in parts here and there. But turning the controls over to the infinite monkey machine has not worked out for me so far.
Why doesn’t outsourcing work if this is all that is needed?
But I have a hypothesis.
The quality of the output, when you don’t own the long term outcome or maintenance, is very poor.
This is not the case with AI in the same sense it is with human contractors.
The most immediate example I can think of is the beans LLM workflow tracker. It’s insane that its measured in the 100s of thousands of LoC and getting that thing setup in a repo is a mess. I had to use Github copilot to investigate the repo to get the latest method. This wouldn’t fly at my employer but a lot of projects are going to be a lot less scrupulous.
You can see the effects in popular consumer facing apps too: Anthropic has drunk way too much of its own koolaid and now I get 10-50% failure rates on messages in their iOS app depending on the day. Some of their devs have publicly said that Claude writes 100% of their code and its starting to show. Intermittent network failures and retries have been a solved problem for decades, ffs!
On the verification loop: I think there’s so much potential here. AI is pretty good at autonomously working on tasks that have a well defined and easy to process verification hook.
A lot of software tasks are “migrate X to Y” and this is a perfect job for AI.
The workflow is generally straightforward - map the old thing to the new thing and verify that the new thing works the same way. Most of this can be automated using AI.
Wanna migrate codebase from C to Rust? I definitely think it should be possible autonomously if the code base is small enough. You do have to ask the AI to intelligently come up with extensive way to verify that they work the same. Maybe UI check, sample input and output check on API and functionality check.
It's scary how good it's become with Opus 4.5. I've been experimenting with giving it access to Ghidra and a debugger [1] for reverse engineering and it's just been plowing through crackmes (from sites like crackmes.one where new ones are released constantly). I haven't bothered trying to have it crack any software but I wouldn't be surprised if it was effective at that too.
I'm also working through reverse engineering several file formats by just having it write CLI scripts to export them to JSON then recreate the input file byte by byte with an import command, using either CLI hex editors or custom diff scripts (vibe coded by the agent).
I still get routinely frustrated trying to use it for anything complicated but whole classes of software development problems have been reduced to vibe coding that feedback loop and then blowing through Claude Max rate limits.
[1] Shameless plug: https://github.com/akiselev/ghidra-cli https://github.com/akiselev/debugger-cli
* The interface is near identical across bots
* I can switch bots whenever I like. No integration points and vendor lock-in.
* It's the same risk as any big-tech website.
* I really don't need more tooling in my life.
Any coding agent should be easily to whatever IDE or workflow you need.
The agents are not full fungible though. Each have their own characteristics.
Lots of people who become successful are the ones who can get this prediction correct.
This is better because I use my own test as a forcing function to learn and understand what the AI has done. Only after primary testing might I tell it to do checking for itself.
Real quote
> "Hence their value stems from the discipline and the thinking the writer is forced to impose upon himself as he identifies and deals with trouble spots in his presentation."
I mean seriously?