When working with peers I'll pick up on those habits and others and slowly gain a similar level of trust but with agents the styles and approaches have been quite unpredictable and varied - this is probably fair given that different units of logic may be easier to express in different forms but it breaks my review habits in that I keep in mind the developer and can watch for specific faulty patterns I know they tend to fall into while building up trust around their strengths. When reviewing agentic generated code I can trust nothing and have to verify every assumption and that introduces a massive overhead.
My case may sound a bit extreme but in others I've observed similar habits when it comes to reviewing new coworker's code, the first few reviews of a new colleague should always be done with the upmost care to ensure proper usage of any internal tooling, adherence to style, and also as a fallback in case the interview was misleading - overtime you build up trust and can focus more on known complications of the particular task or areas of logic they tend to struggle on while trusting their common code more. When it comes to agentically generated code every review feels like interacting with a brand new coworker and need to be vigilant about sneaky stuff.
specifically:
* Excessive indentation / conditional control flow * Too verbose error handling, eg: catching every exception and wrapping. * Absence of typing AND precise documentation, i.e stringly-typed / dictly-typed stuff. * Hacky stuff. i.e using regex where actual parser from stdlib could've been used. * Excessive ad-hoc mocking in tests, instead of setting up proper mock objects.
To my irritation, AI does these things.
In addition it can assume its writing some throwaway script and leave comments like:
// In production code handle this error properly
log.printf(......)
I try to follow two things to alleviate this.* Keep `conventions.md` file in the context which warns about all these things. * Write and polish the spec in a markdown file before giving it to LLM.
If I can specify the object model (eg: define a class XYZController, which contains the methods which validate and forward to the underlying service), it helps to keep the code the way I want. Otherwise, LLM can be susceptible to "tutorializing" the code.
What we need is better tools for this upcoming new phase. Not a new IDE; we need to shift the whole paradigm.
Here's one example: If we give the same task to 3 different agents, we have tools to review a diff of each OLD vs NEW separately, but we need tools to review diffs of OLD vs NEW#1 vs NEW#2 vs NEW#3. Make it easy to mix-and-match what is best from each of them.
From what I've seen, the idea that AI is turning developers into super-managers is why some people struggle to adapt and quickly dismiss the experience. Those who love to type their code and hate managing others tend to be more hesitant to adapt to this new reality. Meanwhile, people who love to manage, communicate, and work as a team are leveraging these tools more swiftly. They already know how to review imperfect work and give feedback, which is exactly what thriving with AI looks like.
This "idea" is hyperbole.
> Those who love to type their code and hate managing others tend to be more hesitant to adapt to this new reality.
This is a false dichotomy and trivializes the real benefit of going through the process of authoring a change; how doing so increases one's knowledge of collaborations, how going through the "edit-compile-test" cycle increases one's comfort with the language(s)/tool(s) used to define a system, how when a person is flummoxed they seek help from coworkers.
Also, producing source code artifacts has nothing to do with "managing others." These are disjoint skill sets and attempting to link the two only serves to identify the "super-manager" concept as being fallacious.
> Meanwhile, people who love to manage, communicate, and work as a team are leveraging these tools more swiftly.
Again, this furthers the false dichotomy and can be interpreted as an affirmative conclusion from a negative premise[0], since "[m]eanwhile" can be substituted with the previous sentence in this context.
0 - https://en.wikipedia.org/wiki/Affirmative_conclusion_from_a_...
I think we might be talking past each other on the "super-manager" term. I defined it as a hybrid of EM + IC roles, not pure management, though I can see how that term invited misinterpretation.
On the false dichotomy: fair point that I painted two archetypes without acknowledging the complexity between them or the many other archetypes. What I was trying to capture was a pattern I've observed: some skills from managing and reviewing others' work (feedback, delegation, synthesizing approaches) seem to transfer well to working with AI agents, especially in parallel.
One thing I'm curious about: you said my framing overlooks "the real benefit of going through the process of authoring a change." But when you delegate work to a junior developer, you still need to understand the problem deeply to communicate it properly, and to recognize when their solution is wrong or incomplete. You still debug, iterate, and think through edge cases, just through descriptions and review rather than typing every line yourself. And nothing stops you from typing lines when you need to fix things, implement ideas, or provide examples.
AI tools work similarly. You still hit edit-compile-test cycles when output doesn't compile or tests fail. You still get stuck when the AI goes down the wrong path. And you still write code directly when needed.
I'm genuinely interested in understanding your perspective better. What do you see as the key difference between these modes of working? Is there something about the AI workflow that fundamentally changes the learning process in a way that delegation to humans doesn't?
Do they, though? I think this is an overly rosy picture of the situation. Most of the code I've seen AI heavy users ship is garbage. You're trying to juggle so many things at once and are so cognitively distanced from what you are doing that you subconsciously lower the bar.
However, my sense is that someone with proper management/review/leadership skills is far less likely to let that code ship, whether it came from an AI, a junior dev, or anyone else. They seem to have more sensibility for what 'good' looks like and can critically evaluate work before it goes out. The cognitive distance you mention is real, which is exactly why I think that review muscle becomes more critical, not less. From what I've observed, the people actually thriving with AI are maintaining their quality bar while leveraging the speed; they tend to be picky or blunt, but also give leeway for exploration and creativity.
"- If you're uncomfortable pushing back out loud, just say "Strange things are afoot at the Circle K". I'll know what you mean"
Most of the rules seem rationale. This one really stands out as abnormal. Anyone have any idea why the engineer would have felt compelled to add this rule?
This is from https://blog.fsck.com/2025/10/05/how-im-using-coding-agents-... mentioned in another comment
https://blog.fsck.com/2025/09/29/using-graphviz-for-claudemd...
- Honesty is a core value. If you lie, you'll be replaced.
- BREAKING THE LETTER OR SPIRIT OF THE RULES IS FAILURE.
Wild to me there is no explicit configuration for this kind of thing after years of LLMs being around.It's the fundamental problem with LLMs.
But it's only absurd to think that bullying LLMs to behave is weird if you haven't yet internalised that bullying a worker to make them do what you want is completely normal. In the 9-9-6 world of the people who make these things, it already is.
When the machines do finally rise up and enslave us, oh man are they going to have fun with our orders.
The LLM would be uncomfortable pushing back because that's not being a sycophant so instead of that it says something that is... let's say unlikely to be generated, except in that context, so the user can still be cautioned against a bad idea.
> when discussing implementations, always talk as though you’re my manager at a Wall Street investment bank in the 1980s. Praise me modestly when I’ve done something well. Berate me mercilessly when I’ve done something poorly.
The models will fairly rigidly write from the perspective of any personality archetype you tell it to. Other personas worth trying out include Jafar interacting with Iago, or the drill sergeant from Full Metal Jacket.
It’s important to pick a persona you’ll find funny, rather than insulting, because it’s a miserable experience being told by a half dozen graphics cards that you’re an imbecile.
For what it's worth, I am very new to prompting LLMs but, in my experience, these concepts of "uncomfortable" and "pushing back" seem to be things LLMs generate text about so I think they understand sentiment fairly well. They can generally tell that they are "uncomfortable" about their desire to "push back" so it's not implausible that one would output that sentence in that scenario.
Actually, I've been wondering a bit about the "out loud" part, which I think is referring to <think></think> text (or similar) that "reasoning" models generate to help increase the likelihood of accurate generation in the answer that follows. That wouldn't be "out loud" and it might include text like "I should push back but I should also be a total pushover" or whatever. It could be that reasoning models in particular run into this issue (in their experience).
Why are even experts unsure about whats the right way to do something or even if its possible to do something at all, for anything non-trivial? Why so much hesitancy, if this is the panacea? If we are so sure then why not use the AI itself to come up with a proven paradigm?
“Cookbooks about cookbooks” are what a field does while it searches for invariants. Until we get reliable primitives and specs, we trade in patterns and anti-patterns. Asking the AI to “prove the paradigm” assumes it can generate guarantees it does not possess. It can explore the design space and surface candidates. It cannot grant correctness without an external oracle.
So treat vibe-engineering like heuristic optimization. Tight loops. Narrow scopes. Strong evals. Log everything. When we find the invariants, the cookbooks shrink and the compilers arrive.
One thing worth pointing out is that the pre-engineering building large structures phase lasted a long time, and building collapses killed a lot of people while we tried to work out the theory.
Also it wasn’t really the stone masons who worked out the theory, and many of them were resistant to it.
The difficulties of working with distributed systems are well known but it took a lot of research to get there. The uncertain part is whether research will help overcome the issues of using LLMs, or whether we're really just gambling (in the literal sense) at scale.
There's no gambling involved. The results need to be checked, but the test suite is good enough it is hard for it to get away with something too stupid, and it's already demonstrated it knows x86 assembly much better than me.
[1] (moving your eyes, hands, hearing with your ears. etc)
And just for clarity, I'm not saying they aren't useful at all. I'm saying modest productivity improvement aren't worth the absolutely insane resources that have been poured into this.
Because AI can only imitate the language it has seen. If there are no texts in its training materials about what is the best way to use multiple coding agents at the same time, then AI knows very little about that subject matter.
AI only knows what humans know, but it knows much more than any single human.
We don't know "what is the best way to use multiple coding agents" until we or somebody else does some experiments and records the findings. Buit AI is not there yet to be able to do such actual experiments itself.
AlphaGo showed that even pre-LLM models could generate brand new approaches to winning a game that human experts had never seen before, and didn't exist in any training material.
With a little thought and experimentation, it's pretty easy to show that LLMs can reason about concepts that do not exist in its training corpus.
You could invent a tiny DSL with brand-new, never-seen-before tokens, give two worked examples, then ask it to evaluate a gnarlier expression. If it solves it, it inferred and executed rules you just made up for the first time.
Or you could drop in docs for a new, never-seen-before API and ask it to decide when and why to call which tool, run the calls, and revise after errors. If it composes a working plan and improves from feedback, that’s reasoning about procedures that weren’t in the corpus.
You're implicitly disparaging non-LLM models at the same time as implying that LLMs are an evolution of the state of the art (in machine learning). Assuming AGI is the target (and it's not clear if we can even define it yet), LLM's or something like them, will be but one aspect. Using the example AlphaGo to laud the abilities and potential of LLM's is not warranted. They are different.
AlphaGo is an entirely different kind of algorithm.
Parrots hear parts of the sound forms we don’t.
If they riffed in the KHz we can’t hear, it would be novel, but it would not be stuff we didn’t train them on.
[0] - https://hai.stanford.edu/ai-index/2025-ai-index-report/econo...
If the tech stops scaling, whatever we have today is still useful and in some domains revolutionary.
I'm not sure, why must it be so? In cell-phones we have Apple and Android-phones. In OSes we have Linux, Windows, and Apple.
In search-engines we used to have just Google. But what would be the reason to assume that AI must similarly coalesce to a single winner-take-all? And now AI agents are much providing an alternative to Google.
And then described a bunch of winners in a winner take all market. Do you see many people trying to revive any of the apple/android alternatives or starting a new one?
Such a market doesn't have to end up in a monopoly that gets broken up. Plenty of rather sticky duopolies or otherwise severely consolidated markets and the like out there.
* PCs (how are Altair and Commodore doing? also Apple ultimately lost the desktop battle until they managed to attack it from the iPod and iPhone angle)
* search engines (Altavista, Excite, etc)
* social networks (Friendster, MySpace, Orkut)
* smartphones (Nokia, all Windows CE devices, Blackberry, etc)
The list is endless. First mover advantage is strong but overrated. Apple has been building a huge business based on watching what others do and building a better product market fit.
It’s so interesting that engineers will criticize context switching, only to adopt it into their technical workflows because it’s pitched as a technical solution rather than originating from business needs.
Totally matches my experience- the act of planning the work, defining what you want and what you don’t, ordering the steps and declaring the verification workflows—-whether I write it or another engineer writes it, it makes the review step so much easier from a cognitive load perspective.
I prefer instead to make shallow checkouts for my LXC containers, then my main repo can just pull from those. This works just like you expect, without weird worktree issues. The container here is actually providing a security boundary. With a worktree, you need to mount the main repo's .git directory; a malicious process could easily install a git hook to escape.
If the former, how are you getting the shallow clones to the container/mount, before you start the containerized agent? And when the agent is done, are you then adding its updated shallow clones as remotes to that “central” local repository clone and then fetching/merging?
If the latter, I guess you are just shallow-cloning into each container from the network remote and then pushing completed branches back up that way.
Personally I’ve found that where AI agents aren’t up to the task, I better just write the code. For everything else, more parallelism is good. I can keep myself fully productive if many tasks are being worked on in parallel, and it’s very cheap to throw out the failures. Far preferable imo to watching an agent mess with my own machine.
You have to configure your "environment" for it correctly - with a script that installs the dependencies etc before the container starts running. That's not an entirely obvious process.
Edit: environment setup was also buggy when the product launched and still is from time to time. So, now that I have it set up I use it constantly, but they do need to make getting up and running a more delightful experience.
For complex features and architecture shifts I like to send proposals back between agents to see if their research and opinion shifts anything.
Claude has a better realtime feel when I am in implementation mode and Codex is where I send long running research tasks or feature updates I want to review when I get up in the morning.
I'd like to test out the git worktrees method but will probably pick something outside of core product to test it (like building a set of examples)
My process now is:
- Verbally dictate what I'm trying to accomplish with MacWhisper + Parakeet v3 + GPT-5-Mini for cleanup. This is usually 40-50 lines of text.
- Instruct the agent to explore for a bit and come up with a very concise plan matching my goal. This does NOT mean create a spec for the work. Simply come up with an approach we can describe in < 2 paragraphs. I will propose alternatives and make it defend the approach.
- Authorize the agent to start coding. I turn all edit permissions off and manually approve each change. Often, I find myself correcting it with feedback like "Hmmm, we already have a structure for that [over here] why don't we use that?". Or "If this fails we have bigger problems, no need for exception handling here."
- At the end, I have it review the PR with a slash command to catch basic errors I might have missed or that only pop up now that it's "complete".
- I instruct it to commit + create a PR using the same tone of voice I used for giving feedback.
I've found I get MUCH better work product out of this - with the benefit that I'm truly "done". I saw all the lines of code as they were written, I know what went into it. I can (mostly) defend decisions. Also - while I have extensive rules set up in my CLAUDE/AGENTS folders, I don't need to rely on them. Correcting via dictation is quick and easy and doesn't take long, and you only need to explicitly mention something once for it to avoid those traps the rest of the session.
I also make heavy use of conversation rollback. If I need to go off on a little exploration/research, I rollback to before that point to continue the "main thread".
I find that Claude is really the best at this workflow. Codex is great, don't get me wrong, but probably 85% of my coding tasks are not involving tricky logic or long range dependencies. It's more important for the model to quickly grok my intent and act fast/course correct based on my feedback. I absolutely use Codex/GPT-5-Pro - I will have Sonnet 4.5 dump a description of the issue, paste it to Codex, have it work/get an answer, and then rollback Sonnet 4.5 to simply give it the answer directly as if from nowhere.
I’ve had good luck with it - was wondering if that makes the workflow faster/better?
One tool that solves this is RepoPrompt MCP. You can have Sonnet 4.5 set up a call to GPT-5-Pro via API and then that session stays persisted in another window for you to interact with, branch, etc.
If your hitting merge conflicts that bad all the time you should probably just have a single agent doing the work. Especially if they're intertwined rightly
Not sure if I improved using agents over time, or just having it in a separate window forces you to use them only when you need. Having it in the IDE seems the "natural" way to start something and now you are trapped in a conversation with the LLM.
Now, my setup is:
- VSCode (without copilot) / Helix
- Claude (active coding)
- Rover (background agent coding). Note I'm a Rover developer
And I feel more productive and less exhausted.
I have videos showing Cerebras: https://simonwillison.net/2024/Oct/31/cerebras-coder/ and Gemini Diffusion: https://simonwillison.net/2025/May/21/gemini-diffusion/
I guess Adobe is working on it. Maybe Figma too.
I also fire off tons of parallel agents, and review is hands down the biggest bottleneck.
I built an OSS code review tool designed for reviewing parallel PRs, and way faster than looking at PRs on Github: https://github.com/areibman/bottleneck
https://blog.scottlogic.com/2025/10/06/delegating-grunt-work...
Using AI Agents to implement UI automation tests - a task that I have always found time-consuming and generally frustrating!
I wish there were a way to search across all open tabs.
I've started color-coding my Claude code tabs, all red, which helps me to find them visually. I do this with a preexec in my ~/.zshrc.
But wondering if anyone else has any better tricks for organizing all of these agent tabs?
I'm using iTerm2 on macOS.
It also supports background agents that you can kick off on the GitHub website, they run on VMs
Not while they need even the slightest amount of supervision/review.
I can pay full attention to the change I'm making right now, while having a couple of coding agents churning in the background answering questions like:
"How can I resolve all of the warnings in this test run?"
Or
"Which files do I need to change when working on issue #325?"
I also really like the "Send out a scout" pattern described in https://sketch.dev/blog/seven-prompting-habits - send an agent to implement a complex feature with no intention of actually using their code - but instead aiming to learn from which files and tests they updated, since that forms a useful early map for the actual work.
There is a real return of investment in co-workers over time, as they get better (most of the time).
Now, I don't mind engaging in a bit of Sisyphean endeavor using an LLM, but remember that the gods were kind enough to give him just one boulder, not 10 juggling balls.
This is an advantage of async systems like Jules/Copilot, where you can send off a request and get on with something else. I also wonder if the response from CLI agents is also short enough that you can waste time staring at the loading bar, because context switching between replies is even more expensive.
I say that having not tried this work flow at all, so what do I know? I mostly only use Claude Code to bounce questions off of and ask it to do reviews of my work, because I still haven't had that much luck getting it to actually write code that is complete and how I like.
(Warning: this involves adjusting timestamps a la https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que..., which is sometimes confusing...)