I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).
I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
Anyway, really looking forward to trying some if these patterns as the book develops to see if that makes a difference. Understanding how other peopke really use these tools is a big gap for me.
I think this is the main point where many people’s work differs. Most of my work I know roughly what needs changing and how things are structured but I jump between codebases often enough that I can’t always remember the exact classes/functions where changes are needed. But I can vaguely gesture at those specific changes that need to be made and have the AI find the places that need changing and then I can review the result.
I rarely get the luxury of working in a single codebase for a long enough period of time to get so familiar with it that I can jump to particular functions without much thought. That means AI is usually a better starting point than me fumbling around trying to find what I think exists but I don’t know where it is.
And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it, but my time is more expensive than Claude's time, and so as long as I'm not sitting around waiting it's a net win.
Gemini running an benchmark- everything ran smoothly for an hour. But on verification it had hallucinated the model used for judging, invalidating the whole run.
Another task used Opus and I manually specified the model to use. It still used the wrong model.
This type of hallucination has happened to me at least 4-5 times in the past fortnight using opus 4.6 and gemini-3.1-pro. GLM-5 does not seem to hallucinate so much.
So if you are not actively monitoring your agent and making the corrections, you need something else that is.
Also instead of just prompting, having it write a quick summary of exactly what it will do where the AI writes a plan including class names branch names file locations specific tests etc. is helpful before I hit go, since the code outline is smaller and quicker to correct.
That takes more wall clock time per agent, but gets better results, so fewer redo steps.
I'm thinking about how to solve the problem and how to express it in the programming language such that it is easy to maintain. Getting someone/something else to do that doesn't help me.
But different strokes for different folks, I suppose.
I’m not sure it’s really true in practice yet, but that would certainly be the claim.
Because, after they're done/have finished executing, I guess you still have to "check" their output, integrate their results into the bigger project they're (supposedly) part of etc, and for me the context-switching required to do all that is mentally taxing. But maybe this only happens because my brain is not young enough, that's why I'm asking.
I think trying agents to do larger tasks was always very hit or miss, up to about the end of last year.
In the past couple of months I have found them to have gotten a lot better (and I'm not the only one).
My experience with what coding assistants are good for shifted from:
smart autocomplete -> targeted changes/additions -> full engineering
To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.
I think it can (and is) shifting very rapidly. Everyone is different, and I’m sure models are better at different types of work (or styles of working), but it doesn’t take much to make it too frustrating to use. Which also means it doesn’t take much to make it super useful.
Opus 4.6 has been out for less than a month. If it was a big shift surely we'd see a massive difference over 4.5 which was november. I think this proves the point, you're not seeing seisimic shifts every 3 months and you're not even clear about which model was the fix.
> I think it can (and is) shifting very rapidly.
Shifting, maybe. But shuffling deck chairs every 3 months.
Especially good to navigate the code if you're unfamiliar with it (the code). If you have known the code for good, you'll find it's usually faster to debug and code by yourself.
Opus 4.6 with claude code vscode extension
No. The parent comment said I needed a new model, which I've tried. Being told "just try something else aswell" kind of proves the point.
Perfect example. You mean the C compiler that literally failed to compile a hello world [0] (which was given in it's readme)?
> What do you consider simple issues?
Hallucinating APIs for well documented libraries/interfaces, ignoring explicit instructions for how to do things, and making very simple logic errors in 30-100 line scripts.
As an example, I asked Claude code to help me with a Roblox game last weekend, and specifically asked it to "create a shop GUI for <X> which scales with the UI, and opens when you press E next to the character". It proceeded to create a GUI with absolute sizings, get stuck on an API hallucination for handling input, and also, when I got it unstuck, it didn't actually work.
[0] https://github.com/anthropics/claudes-c-compiler/issues/1
But the most important thing is that they were reverse engineering gcc by using it as an oracle. And it had gcc and thousands of other c compilers in its training set.
So if you are a large corporation looking to copy GPL code so that you can use it without worrying about the license, and the project you want to copy is a text transformer with a rigorously defined set of inputs and outputs, have at it.
> smart autocomplete -> targeted changes/additions -> full engineering
Define "full engineering". Because if you say "full engineering" I would expect the agent to get some expected product output details as input and produce all by itself the right implementation for the context (i.e. company) it lives in.
Pretty recently (a couple weeks ago). I give agentic workflows a go every couple of weeks or so.
I should say, I don't find them abysmal, but I tend to work in codebases where I understand them, and the patterns really well. The use cases I've tried so far, do sort of work, just not yet at least, faster than I'm able to actual write the code myself.
In my experience, this heavily depends on the task, and there's a massive chasm between tasks where it's a good and bad fit. I can definitely imagine people working only on one side of this chasm and being perplexed by the other side.
1) Having review loops between agents (spawn separate "reviewer" agents) and clear tests / eval criteria improved results quite a bit for me. 2) Reviewing manually and giving instructions for improvements is necessary to have code I can own
I’ve yet to see these things do well on anything but trivial boilerplate.
The benefit is I can keep some things ticking over while I’m in meetings, to be honest.
In one of my experiments I had the simple goal of "making Linux binaries smaller to download using better compression" [1]. Compression is perfect for this. Easily validated (binary -> compress -> decompress -> binary) so each iteration should make a dent otherwise the attempt is thrown out.
Lessons I learned from my attempts:
- Do not micro-manage. AI is probably good at coming up with ideas and does not need your input too much
- Test harness is everything, if you don't have a way of validating the work, the loop will go stray
- Let the iterations experiment. Let AI explore ideas and break things in its experiment. The iteration might take longer but those experiments are valuable for the next iteration
- Keep some .md files as scratch pad in between sessions so each iteration in the loop can learn from previous experiments and attempts
This is the most important piece to using AI coding agents. They are truly magical machines that can make easy work of a large number of development, general purpose computing, and data collection tasks, but without deterministic and executable checks and tests, you can't guarantee anything from one iteration of the loop to the next.
Good news - agents are good at open ended adding new tests and finding bugs. Do that. Also do unit tests and playwright. Testing everything via web driving seems insane pre agents but now its more than doable.
The tricky part in our case is that "behaves correctly" has two layers - functional (did it navigate correctly?) and behavioral (does it look human to detection systems?). Agents are fine with the first layer but have no intuition for the second. Injecting behavioral validation into the loop was the thing that actually made it useful.
The .md scratch pad between sessions is underrated. We ended up formalizing it into a short decisions log - not a summary of what happened, just the non-obvious choices and why. The difference between "we tried X" and "we tried X, it failed because Y, so we use Z instead" is huge for the next session.
the interesting engineering problem is that the two feedback loops run on different timescales - functional feedback is immediate (did the click work?) but behavioral feedback is lagged and probabilistic (the session might get flagged 10 requests from now based on something that happened 5 requests ago). teaching an agent to reason about that second loop is the unsolved part.
- Through the last two decades of the 20th century, Moore’s Law held and ensured that more transistors could be packed into next year’s chips that could run at faster and faster clock speeds. Software floated on a rising tide of hardware performance so writing fast code wasn’t always worth the effort.
- Power consumption doesn’t vary with transistor density but varies with the cube of clock frequency, so by the early 2000s Intel hit a wall and couldn’t push the clock above ~4GHz with normal heat dissipation methods. Multi-core processors were the only way to keep the performance increasing year after year.
- Up to this point the CPU could squeeze out performance increases by parallelizing sequential code through clever scheduling tricks (and compilers could provide an assist by unrolling loops) but with multiple cores software developers could no longer pretend that concurrent programming was only something that academics and HPC clusters cared about.
CS curricula are mostly still stuck in the early 2000s, or at least it feels that way. We teach big-O and use it to show that mergesort or quicksort will beat the pants off of bubble sort, but topics like Amdahl’s Law are buried in an upper-level elective when in fact it is much more directly relevant to the performance of real code, on real present-day workloads, than a typical big-O analysis.
In any case, I used all this as justification for teaching bitonic sort to 2nd and 3rd year undergrads.
My point here is that Simon’s assertion that “code is cheap” feels a lot like the kind of paradigm shift that comes from realizing that in a world with easily accessible massively parallel compute hardware, the things that matter for writing performant software have completely shifted: minimizing branching and data dependencies produces code that looks profoundly different than what most developers are used to. e.g. running 5 linear passes over a column might actually be faster than a single merged pass if those 5 passes touch different memory and the merged pass has to wait to shuffle all that data in and out of the cache because it doesn’t fit.
What all this means for the software development process I can’t say, but the payoff will be tremendous (10-100x, just like with properly parallelized code) for those who can see the new paradigm first and exploit it.
Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.
The marginal cost of code is collapsing. That single fact changes everything.
Quite a heavy-lifting word here. You understand why people flagged that post right? It's painfully non-human. I'm all for utilizing LLM, but I highly suggest you read Simon's posts. He's obviously a heavy AI user, but even his blog posts aren't that inorganic and that's why he became the new HN blog babe.
[0]: I personally believe Simon writes with his own voice, but who knows?
There's no actual way to determine if any words are from a silicon token generator or meat-based generator. It's not AI, it's human! Emdash. You're absolutely right!
system failure.
I would not equate software engineering to "proper" engineering insofar as being uttered in the same sentence as mechanical, chemical, or electrical engineering.
The cost of code is collapsing because web development is not broadly rigorous, robust software was never a priority, and everyone knows it. The people complaining that AI isn't good enough yet don't grasp that neither are many who are in the profession currently.
I think the externalities are being ignored. Having time and money to train engineers is expensive. Having all the data of your users being stolen is a slap in the wrist.
So replacing those bad worekrs with AI is fine. Unless you remove the incentives to be fast instead of good, then yeah AI can be good enough for some cases.
The claim here is profound: comprehension of the codebase at the function level is no longer necessary
It's not profound. It's not profound when I read the exact same awed blog post about how "agentic" is the future and you don't even need to know code anymore.It wasn't profound the first time, and it's even dumber that people keep repeating it - maybe they take all the time they saved not writing, and use it to not read.
Engineering is the practical application of science and mathematics to solve problems. It sounds like you're maybe describing construction management instead. I'm not denying that there's value here, but what you're espousing seems divorced from reality. Good luck vibecoding a nontrivial actuarial model, then having it to pass the laundry list of reviews and having large firms actually pick it up.
https://www.slater.dev/2025/09/its-time-to-license-software-...
As my projects were growing in complexity and scope, I found myself worrying that we were building things that would subtly break other parts of the application. Because of the limited context windows, it was clear that after a certain size, Claude kind of stops understanding how the work you're doing interacts with the rest of the system. Tests help protect against that.
Red/green TDD specifically ensures that the current work is quite focused on the thing that you're actually trying to accomplish, in that you can observe a concrete change in behaviour as a result of the change, with the added benefit of growing the test suite over time.
It's also easier than ever to create comprehensive integration test suites - my most valuable tests are tests that test entire user facing workflows with only UI elements, using a real backend.
I’ve always been partial to integration tests too. Hand coding made integration tests feel bad; you’re almost doubling the code output in some cases - especially if you end up needing to mock a bunch of servers. Nowadays that’s cheap, which is super helpful.
The only problem is... they still take much longer to _run_ than unit tests, and they do tend to be more flaky (although Claude is helpful in fixing flaky tests too). I'm grateful for the extra safety, but it makes deployments that much slower. I've not really found a solution to that part beyond parallelising.
agents role (Orchestrator, QA etc.), agents communication, thinking patterns, iteration patterns, feature folders, time-aware changelog tracking, prompt enforcing, real time steering.
We might really need a public Wiki for that (C2 [1] style)
[1]: https://wiki.c2.com/
"deeply understand this codebase, clearly noting async/sync nature, entry points and external integration. Once understood prepare for follow up questions from me in a rapid fire pattern, your goal is to keep responses concise and always cite code snippets to ensure responses are factual and not hallucinated. With every response ask me if this particular piece of knowledge should be persistent into codebase.md"
Both the concise and structure nature (code snippets) help me gain knowledge of the entire codebase - as I progressively ask complex questions on the codebase.
Feels like it’s a lot of words to say what amounts to make the agent do the steps we know works well for building software.
- tell the agent to write a plan, review the plan, tell the agent to implement the plan
- allow the agent to “self discover” the test harness (eg. “Validate this c compiler against gcc”)
- queue a bunch of tasks with // todo … and yolo “fix all the todo tasks”
- validate against a known output (“translate this to rust and ensure it emits the same byte or byte output as you go”)
- pick a suitable language for the task (“go is best for this task because I tried several languages and it did the best for this domain in go”)
Other things that I feel are useful:
- Very strict typing/static analysis
- Denying tool usage with a hook telling the agent why+what they should do (instead of simple denial, or dangerously accepting everything)
- Using different models for code review
I am still not sold on agentic coding. We’ll probably get there within the next couple of years.
The thing I keep coming back to is that it's all code. Almost all white collar professions have at least some key outputs in code. Whether you are a store manager filling out reports or a marketing firm or a teacher, there is so much code.
This means you can give claude code a branded document template, fill it out, include images etc. and uploaded to our cloud hosting.
With this same guidance and taste, I'm doing close to the work of 5 people.
Setup: Claude code with full API access to all my digital spaces + tmux running 3-5 tasks in parallel
0: https://wiki.roshangeorge.dev/w/Blog/2025-12-01/Grounding_Yo...
Which is oddly close to how investment advice is given. If these techniques work so well, why give them up for free?
Colleagues don’t usually like to review AI generated code. If they use AI to review code, then that misses the point of doing the review. If they do the review manually (the old way) it becomes a bottleneck (we are faster at producing code now than we are at reviewing it)
I'm hoping to add more on that topic as I discover other patterns that are useful there.
A broken test doesn’t make the agentic coding tool go “ooooh I made a bad assumption” any more than a type error or linter does
All a broken test does it prompt me to prompt back “fix tests”
I have no clue which one broke or why or what was missed, and it doesnt matter. Actual regressions are different and not dependent on these tests, and I follow along from type errors and LLM observability
Take a guitar, for example. You don't industrialize the manufacture of guitars by speeding up the same practices that artisans used to build them. You don't create machines that resemble individual artisans in their previous roles (like everyone seems to be trying to do with AI and software). You become Leo Fender, and you design a new kind of guitar that is made to be manufactured at another level of scale magnitude. You need to be Leo Fender though (not a talented guitarrist, but definitely a technical master).
To me, it sounds too early to describe patterns, since we haven't met the Ford/Fender/etc equivalent of this yet. I do appreciate the attempt though.
Dismissing everything AI as slop strikes me as an attitude that is not going to age well. You’ll miss the boat when it does come (and I believe it already has).
Is the boat:
1) unmissable since the tools get better all the time and are intelligent
or
2) nearly-impossible to board since the tools will replace most of the developers
or
3) a boat of small productivity improvements?
?
Eventually I do think it will be 2.
I think you’ve got to make hay while the sun shines. Nobody knows how this is all going to play out, I just want to make sure I’m at the forefront of it.
And the progress is slowing down in such a way, that knowledge learned today will not be outdated anymore?
Should investors be worried, since AGI is not coming anymore?
We didn't ask if type-based autocomplete was "intelligent" before we started using that.
Treat coding agents as tools and figure out what they can and cannot do and how best to use them.
I think the relative comfort we've enjoyed as software engineers is going to disappear eventually. I just want to be the last to go.
My whole career, I've remained valuable by staying at the forefront of what is possible and connecting that to users' needs. Nothing has changed about my approach from that perspective.
I'm not an investor so I have no idea how they should think.