It’s really eye opening to work with these tools on a codebase you know deeply because these problems are everywhere.
However if I opened an unfamiliar project in another language and I wanted to add a little feature with no intention of maintaining it, I’d happily accept the changes and loop until it worked well enough for my temporary needs.
The scary middle is when you’re dealing with coworkers who don’t care about anything other than closing tickets and collecting credit. With enough of a token budget you can now wrap loops around an LLM and have it try things until the program appears to work. Ask it to do a code review and then submit the PR without having understood what it was doing. There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.
I was working on creating a next-n-actions predictor for one of our use cases and not paying much attention for a PoC. I was fairly happy with the progress for a few days, before actually reading the eval code and seeing that we leaked the final state in every eval.
It's nice to let claude run loose on porting from framework to framework (port my code from TRL to NemoRL to Tinker to VeRL) but looking at what it does in the intermediate steps makes me want to claw my eyes out. And getting it to adhere to our domain model (e.g. we have an SFTConfig and a .to_trl(), or a Row and a .to_harmony()) is impossible.
I don't want to start a fight or anything but IME Codex has a bit more of a spine. If you point out something weird, it sometimes gives a good reason for it. Whereas Claude will always say "whoopsie you're right as always sir" even when it's me who missed something.
But your comment just made me think whether this tendency for LLMs to resort to flattery when found out is a built in strategy to distract the user from the error prone fragility of much of the output? It's perhaps a stretch to think these canned responses were put in strategically, but the result is that the user's attention may be deflected to contemplating their own superior knowledge and insight, and bask in the glory of all that, but then forgot to appreciate that 'Hey, chatLLM is just making all this stuff up/doesn't know which way is up/or down!'
Not sure if there are sycophancy benchmarks for coding agents
I have a Rails background, so maybe KISS is more engrained in my philosophy than whatever training material was used on AI. At least it isn't heavily pushing design patterns...
It sounds like you've not conditioned your Claude to stop being a sycophant yet?
If the "big ball of spaghetti" theory holds, where software companies who can't manage the debt stumble over themselves as they continue to add to the big ball of spaghetti code, I guess we'll see a row of companies declaring "software bankruptcy" or something in some/many months, depending on how well these workspaces learn to care slightly more and get better at pushing back against slop.
People call coding agents bad because they don't know the asinine meaningless conventions at their particular company while they themselves write awful abstractions and brittle tightly coupled systems, but hey, at least they know how to write a for loop how their particular company likes.
An average enterprise developer would never add bloat like that up-front, unless if the ability to change the order was a requirement.
Obviously a stable order can be easily derived from the ID or a creation time (if available).
Setting a position however requires extra steps to ensure the integrity of the sequence.
I see things like that all the time, and it's always stuff that grows the code base and adds unnecessary complexity.
And how long does it take a coding agent to output a thousand lines of code versus a human? The worst human at any company was rate limited by themselves. Those 'average enterprise' programmers aren't going away, they're the ones now spending tens of thousands on coding agents and filling your codebase with even more garbage without bothering to review an iota of it.
In the past, a team of five mid devs and one good one would be fine, because that good one would ride herd on the mid ones. But now those mid ones are slamming out robot code that they're incapable of meaningfully reviewing (because it's better than they are already), and they're just overwhelming the good dev's capacity.
The solution, of course, is to fire them all -- they're worthless now -- but this is not going to happen quickly, and it's probably for the best that it doesn't.
Why is this worse than splitting it across 1k files?
Having worked 20 years in this field and managed a few projects, no, I wouldn't make a dozen mistakes, because I would refuse to take on work I can't responsibly do.
Invasive and risky work IS the thing I want to be working on because it's the place where I can be most valuable, but part of my value comes from asking the right people the right questions. If I'm working on something invasive and risky, I'm going to work directly with the people who wrote it, and only when THEY think I understand it well enough am I venturing in alone.
Absent access to the people who wrote the code, I'm going to start by writing tests around the code and spend a lot of time checking my initial assumptions upon reading the code, because I know that I don't know what I don't know.
Yeah, if I did foolishly just started making changes, I'd make mistakes but that's missing the point: a good senior engineer knows not to do that.
That's the failure point of AI: it's arrogant. It will provide you statements without any idea if they're true and make changes without any idea if they're correct. It will never tell you "I don't know how to do that" or even "I am not sure if this is correct". It just does the work with infinite confidence even when that confidence is not justified and often it will be just as hard to figure out if the AI's work is correct as it would be to do the work yourself.
I agree with your take, but AI is exactly as arrogant as the human driving it.
I'm not making an argument in favor of people using LLMs for this, but people were doing this before we had LLMs it was just usually a bit slower. I can't even say it usually doesn't work out long term because I worked with a lot of guys who did this and took a ton of Adderall while working practically around the clock. Every incentive structure in the organizations rewarded it along with social credibility from more junior engineers. (The last cowboy I worked with who pulled this shit ended up becoming the most senior engineer in the company, a multi-millionaire and worshipped like a god by 90% of the mostly fresh grads we were hiring).
The problem is when invariably these people burn out eventually and leave, they leave a massive vacuum in their stead. Not from load they were carrying but creating.
I think the larger the organization I've been at, the more they reward the people making huge commits on nights and weekends. Worse, they could get away with TBRing their shit and merging it without review.
LLMs are often all of the bad habits and organizational problems that we already carryied just being speedrun. There are some places doing it right, but they already were.
Could you be more specific what "right" is?
> I can't even say it usually doesn't work out long term because I worked with a lot of guys who did this and took a ton of Adderall while working practically around the clock. Every incentive structure in the organizations rewarded it along with social credibility from more junior engineers. (The last cowboy I worked with who pulled this shit ended up becoming the most senior engineer in the company, a multi-millionaire and worshipped like a god by 90% of the mostly fresh grads we were hiring).
I'm having a tough time believing this, it sounds like you're trying to backwards rationalize more productive engineers were "on drugs" and they delivered but "did it wrong"
(There are workplaces where that's the norm, I know -- it tends to be a thing with smaller teams with codebases that everyone understands fully, and much less a thing with larger teams where different people have areas of the code they understand more than others.)
With AI code, though, it's _your code_ and you can't give it a lgtm, you actually need to dig at it until you do fully understand it, fully agree with it, and could justify it to a hostile reviewer. It's a different level of rigor.
Not all engineers apply that rigor, though, which becomes a problem.
If it’s not good it’s not good.
The problem is this. Human cognitive resources are finite, so we inevitably become shallow outside our own expertise. There is no programmer who can do everything well. And as systems grow in scale, they become more modularized and fragmented, making it impossible to understand the whole system. So what should we do about this? That's always the question.
In the end, do I choose not to use AI, finish the project with uneven code outside my domain, and deliver it? Or do I use AI and deliver a program that is uniform and consistent, but not in my own style? I still don't know. I haven't found the answer yet.
In the end, an exceptionally skilled programmer might be able to keep their core domain intact, but I think the vast majority would find that very difficult. So it might be possible once you cross a certain threshold, but considering the sheer amount of code required to deliver a single modern program, it's hard to know which parts to focus on. However, my perspective might be different because I'm coming from the point of view of delivering a working program, not from the perspective of open source development
Pinky promise that's enough to get good output.
Pinky promise we won't invent yet another body of work the whole industry must adopt to get good output.
Pinky promise the AI tool will properly read all your work
And then of course we are told you must never trust its output !? You must review all code it produces line by line and grok it fully !
And now we have: keep challenging it, keep rejecting it, keep interrogating it... That's just fancy words for spend more money (tokens)
LLMs are perfect for quick prototypes, speed runs, learning, etc., but if the code really matters its still not clear cut. I think the definition of what "really matters" is very project dependent of course As an extreme example you would want to understand every line of the code for the control system runs an MRI machine or a jet engine since bugs might mean life or death. Depositing money into the wrong account might not kill anyone but could lead to severe economic losses. But, then again, even problems in far less consequential software may be drastically sub-economic (i.e. saving $1000 on the implementation might cost $10000 if customers aren't happy and fails to re new). Pick your scenario I guess.
The problem is, this isn't going to change regardless of how well a new model scores on a benchmark. It seems actually AGI is needed.
Good ol' software architecture tricks can also help you slot "vibe coded" components into a larger system safely.
What I'm hoping to build ultimately is something that works more like a pair-programming partner than existing harnesses do. I want the user to be an engaged part of the development process all the way through, I don't want the agent disappearing to work on its own. I even want to make it possible for users to swap into the driver role and have the LLM automatically assume the role of navigator when that happens.
There's more info in the readme (actually the readme is all that exists so far, I wanted to get the idea straight in my head first):
https://gitlab.com/philbooth/opair
Even if nobody else uses it, I hope it will be a useful tool for myself and help me find a way to work with LLMs that doesn't harm my mental models, which is what I feel current harnesses do.
I try to make sure the architecture docs of the code base are refreshed regularly based on recent changes, so it's easier for humans and AI agents to make sense of the code.
I also regularly stop all other developments and just focus on auditing the code base with these AI's to make sure they are secure, robust, clean, and well structured and well tested -- some refactoring would be needed most of the time, and it's well worth it.
With this approach, nowadays I often merge code from AI without completely understanding what it's doing, but seems the code has been working so far. :)
I do sometimes have to steer the discussions between the AI's to the right direction, if they deviate too far away from the real problem, either because they miss some context, or because my original description of the problem was misleading.
To do that formally, I have a mechanism built-in the review loop where if a comment on a github issue or PR is signed as "-- Human Reviewer", then all AI agents have to treat the comment as the highest priority item to address.
Each implementation is also reviewed by me before merging to master. I complete PRs only when I'm satisfied with the implementation, my feedback is addressed, and I fully understand what is going on. Agents are the replacement for typing and productivity multipliers.
I have big picture view of the product, each plan implements only a part of it, scoped to avoid merging unreviwed slop. Probably slower, but result is much better.
(For as long as that's true, "software developer" is still a job. It's not clear for how long it will be true.)
Meanwhile, those codebases often require a ton of boilerplate and drudgery to get anything done.
In these spaces it's very easy to read and comprehend AI generated output and review it fairly quickly. So the time savings from dealing with all that boilerplate and conforming with all that existing infrastructure are potentially substantial.
However if you’re highly familiar with a domain then LLMs are much less useful.
Adequate often means done and cheap
It really, REALLY depends what you're working on. If you're throwing together an internal tool or simple dashboard, it doesn't really matter what the code looks like. But if you're writing software that other programs will depend on, bad design choices ripple out and affect another generation of software. Imagine slop in the linux kernel, in google chrome, or in your compiler or runtime. Its not acceptable.
I know a lot of people spend their careers writing end user software and web UIs. AI is increasingly a good choice for this sort of code. But that's not all of us. And its not all of the software being written.
Stakeholder needs: What people wants to get done with the product
Management needs: How to manage the spending of resources (time, money,…) to create the product
Engineering needs: What is the product
You have to balance the three. Sometimes it’s simple and easy to get right. Sometimes it’s complex enough, you’re never truly sure until the product is out in the wild.
Software is malleable and we can do easily do iterations which is not possible with hardware. But today, we have a skew towards engineering, where the whole focus is to create a solution, whatever that is. No understanding of the problem, no proper allocation of resources, just do something. Even if it is plastering over the crack for the eleventh time.
Being able to step back and say "this was a failure and we need to discard the day's work and start over" is still hard with LLMs.
But with the agent, you know that the change will be relatively quick and easy, so the bar to tell it to shift approaches is much, much lower.
What I found myself doing is operating in two modes: 1. For projects that require my attention, I plan and instruct LLM, when needed will draft some code and ask agent to make it better or finish the mundane part (write code and leave gaps with comments asking agent to finish) 2. Full automode where I use spec driven development and TDD - I only ask for changes based on existing PRD, which agent also have to update. Here I do not look at the code at all.
Seems to be working just fine.
TLDR: Keeping your codebase human readable and reason-about-able is not just helping humans to stay relevant. It will save costs for LLMs to maintain it.
Now we are getting to the point where we are speed-running the deskilling of engineers into comprehension debt and they themselves rapidly losing confidence in reviewing code they did not write.
I think this blog post [0] is the best example of what could go entirely wrong and even worse when you do not know the technology.
If you cannot explain a change even when "the CI is green" or "all tests passing", I will immediately reject it.
Maybe great for vibe coding prototypes, but it all changes when that code is deployed onto mission critical systems. Just ask Amazon with Kiro. [1]
[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...
[1] https://www.reuters.com/business/retail-consumer/amazons-clo...
How do you verify that it works?
json='{ "left":2, "right":2 }';
result="$(
perl -e '($_)=<>; / "left":(\d+), "right":(\d+)/; print $1 + $2, "\n";' <<< "$json";
)";
printf '%s\n' "$result";
Yet, it is literally the same as: printf '%s\n' "$(( 2 + 2 ))";However, if AI provides a solution, as the person using AI, one should conduct research before making a decision. This is not in conflict with or hindered by the use of the ideas provided by AI.
The obvious counterargument is "well, just ask the AI for those answers," but the AI lacks the context and experience that you have. Sometimes, genuinely, the user really is just "holding it wrong," but none of the current AI models would ever admit that, and you'd spend hours trying to solve an unsolvable problem.
For example, I use a vibecoded internal tool written in Go. I don’t even know how to write Go. Haven’t read a single line of the code. I just wanted to move from bash scripts to using cloud SDKs for performance reasons.
But the internal tool is a convenience tool, and you can do everything it does using alternative methods. So if it break, there is no real negative impact besides personal convenience of anyone using it. There’s some documentation on how to do everything manually if needed.
Here’s another example: you’re making a static website. No JavaScript, no interactivity. Truly, what could go wrong? And while I do understand HTML a lot better than Go, it wouldn’t really matter if I didn’t.
What is this supposed to mean? How is a “cloud sdk” more performant than a shell script?
Linking a huge file consuming clients’s bandwith for no reason. Embedding PII in the html source? And if setting up your own server, misconfiguring it?…
Agents respond really well to feedback! They have no ego and they’ll happily improve code if told where and how. But you need to provide the tools that provide that feedback without your involvement - otherwise you can’t scale.
All the linting and autoformatting you can put in, is a good start. Next, create custom scripts that check for every single dumb AI-ism you can think of, tell the agent about them, tell it to use them to check its work, and put them in hooks so the harness refuses to let the agent stop until all your linters show no errors.
Then, keep iterating basically forever. Any dumb AI-ism you see, make a linter for it, give it to the agent, and enforce it using the harness.
I’ve spent months doing this. When I review a PR - which was built by the agent with TDD so it definitely works - I’m no longer asking if it did dumb stuff or confirming it conformed to the architecture or duplicated code or missed opportunities for reuse. That’s all linted for. I don’t worry about duplication or outdated docstrings/comments because the self review caught all that. I mostly read it to look for opportunities to make the feature even better & more useful.
If this makes no sense or you disagree it’s possible, my contact details are on my profile and I’ll be happy to give a demo.
Incidentally I also don't understand the drive to scale up. Show me a successful tech company and I'll show you a company that won, not by delivering code the fastest, but by delivering the right product with the right features at the right time.
Hell, Anthropic itself is the perfect example: they're doing well because unlike their competitors they realized the real revenues come from enterprise not consumer. They're winning by identifying the right market and giving them the right product.
Then have a look at https://github.com/cadamsdotcom/CodeLeash/blob/main/scripts/... (which was test-driven alongside https://github.com/cadamsdotcom/CodeLeash/blob/main/tests/un...)
The script can exit 2 to block the agent, and whatever it prints to stderr is shown to the agent. That’s a pretty darn flexible way to enforce whatever you like.
Despite this being in the codebase I still have no idea what python’s ast stuff is or does - I just let the agent rip, ensured it did TDD and reviewed it all to make sure the tests & code looked reasonable. I didn’t write this code and don’t want to. But I’ve watched it catch hundreds of dumb AI-isms, and watched the agent go “okay” and fix them ;) it’s been paying for itself over and over for months :)
"TDD" isn't some magic trick. The tests codify the expected behavior. But if you don't review them for correctness, if you let the LLM build them blindly, then you have no idea what those tests assert and can make no claims about whether the code then does what you expect.
That's fine. That's your choice.
But you have to acknowledge you've chosen to accept that you personally cannot vouch for the quality or correctness of that code.
I fully expect this to be the direction the industry goes, where increasingly complex systems exist that no human actually understands or can reason about.
I think it's bad for the industry. Very bad.
But I'm not making those decisions, so... it is what it is, I guess.
I design everything with plan mode and review every line. Nothing happens to my codebase that I don’t decide should happen. With my way of working, tech debt doesn’t exist because I never have to create it.
You’ve made a bunch of assumptions you’re not conscious of. And now you’re blaming me for that.
Open your mind, you never know what you might (un)learn.
The thesis of the post is (paraphrasing): "if an AI wrote it, and I don't immediately grok it or if the code quality is low, I throw it away, even if on the surface it seems to work, because simply 'working' isn't enough to say a piece of code is acceptable."
I'd add as a corollary "and therefore I would never want to be accountable for that code."
If you're reviewing every line then it sounds like you have no argument with the writer and I don't understand what your point is.
Your very first paragraph says:
> If you reject AI code that works then your mindset is still too hands on. Put another way - you still have some loops to work on taking yourself out of.
But if you do indeed "review every line" then you seem pretty damn in the loop yourself and I don't understand what you think taking oneself out of the loop is.