They've actually hit upon something that several of us have evolved to naturally.
LLM's are like unreliable interns with boundless energy. They make silly mistakes, wander into annoying structural traps, and have to be unwound if left to their own devices. It's like the genie that almost pathologically misinterprets your wishes.
So, how do you solve that? Exactly how an experienced lead or software manager does: you have systems write it down before executing, explain things back to you, and ground all of their thinking in the code and documentation, avoiding making assumptions about code after superficial review.
When it was early ChatGPT, this meant function-level thinking and clearly described jobs. When it was Cline it meant cline rules files that forced writing architecture.md files and vibe-code.log histories, demanding grounding in research and code reading.
Maybe nine months ago, another engineer said two things to me, less than a day apart:
- "I don't understand why your clinerules file is so large. You have the LLM jumping through so many hoops and doing so much extra work. It's crazy."
- The next morning: "It's basically like a lottery. I can't get the LLM to generate what I want reliably. I just have to settle for whatever it comes up with and then try again."
These systems have to deal with minimal context, ambiguous guidance, and extreme isolation. Operate with a little empathy for the energetic interns, and they'll uncork levels of output worth fighting for. We're Software Managers now. For some of us, that's working out great.
For those starting out using Claude Code it gives a structured way to get things done bypassing the time/energy needed to “hit upon something that several of us have evolved to naturally”.
Anyone who spends some time with these tools (and doesn't black out from smashing their head against their desk) is going to find substantial benefit in planning with clarity.
It was #6 in Boris's run-down: https://news.ycombinator.com/item?id=46470017
So, yes, I'm glad that people write things out and share. But I'd prefer that they not lead with "hey folks, I have news: we should *slice* our bread!"
#6 is about using plan mode whereas the author says "The built-in plan mode sucks".
The author's post is much more than just "planning with clarity".
Unfortunately, there's a lot of people trying to content-farm with LLMs; this means that whatever style they default to, is automatically suspect of being a slice of "dead internet" rather than some new human discovery.
I won't rule out the possibility that even LLMs, let alone other AI, can help with new discoveries, but they are definitely better at writing persuasively than they are at being inventive, which means I am forced to use "looks like LLM" as proxy for both "content farm" and "propaganda which may work on me", even though some percentage of this output won't even be LLM and some percentage of what is may even be both useful and novel.
I think your sentence should have been "people who use ai do so to mostly rewrite or clean up content", but even then I'd question the statistical truth behind that claim.
Personally, seeing something written by AI means that the person who wrote it did so just for looks and not for substance. Claiming to be a great author requires both penmanship and communication skills, and delegating one or either of them to a large language model inherently makes you less than that.
However, when the point is just the contents of the paragraph(s) and nothing more then I don't care who or what wrote it. An example is the result of a research, because I'd certainly won't care about the prose or effort given to write the thesis but more on the results (is this about curing cancer now and forever? If yes, no one cares if it's written with AI).
With that being said, there's still that I get anywhere close to understanding the author behind the thoughts and opinions. I believe the way someone writes hints to the way they think and act. In that sense, using LLM's to rewrite something to make it sound more professional than what you would actually talk in appropriate contexts makes it hard for me to judge someone's character, professionalism, and mannerisms. Almost feels like they're trying to mask part of themselves. Perhaps they lack confidence in their ability to sound professional and convincing?
However I do find the standard out-of-the-box style very grating. Call it faux-chummy linkedin corporate workslop style.
Why don't people give the llm a steer on style? Either based on your personal style or at least on a writer whose style you admire. That should be easier.
> Because they think this is good writing. You can’t correct what you don’t have taste for.
I have to disagree about:
> Most software engineers think that reading books means reading NYT non-fiction bestsellers.
There's a lot of scifi and fantasy in nerd circles, too. Douglas Adams, Terry Pratchett, Vernor Vinge, Charlie Stross, Iain M Banks, Arthur C Clarke, and so on.
But simply enjoying good writing is not enough to fully get what makes writing good. Even writing is not itself enough to get such a taste: thinking of Arthur C Clarke, I've just finished 3001, and at the end Clarke gives thanks to his editors, noting his own experience as an editor meant he held a higher regard for editors than many writers seemed to. Stross has, likewise, blogged about how writing a manuscript is only the first half of writing a book, because then you need to edit the thing.
It's much more efficient and intentional for the writer to put the time into doing the condensing and organizing once, and review and proofread it to make sure it's what they mean, than to just lazily spam every human they want to read it with the raw prompt, so every recipient has to pay for their own AI to perform that task like a slot machine, producing random results not reviewed and approved by the author as their intended message.
Is that really how you want Hacker News discussions and your work email to be, walls of unorganized unfiltered text prompts nobody including yourself wants to take the time to read? Then step aside, hold my beer!
Or do you prefer I should call you on the phone and ramble on for hours in an unedited meandering stream of thought about what I intended to write?
Slop looks reasonable on the surface, and requires orders of magnitude more effort to evaluate than to produce. It’s produced once, but the process has to be repeated for every single reader.
Disregarding content that smells like AI becomes an extremely tempting early filtering mechanism to separate signal from noise - the reader’s time is valuable.
It is to me, because it indicates the author didn't care about the topic. The only thing they cared about is to write an "insightful" article about using llms. Hence this whole thing is basically linked-in resume improvement slop.
Not worth interacting with, imo
Also, it's not insightful whatsoever. It's basically a retelling of other articles around the time Claude code was released to the public (March-August 2025)
It's not just misleading — it's lazy. And honestly? That doesn't vibe with me.
[/s obviously]
This is clearly a standard AI exposition:
LLM's are like unreliable interns with boundless energy. They make silly mistakes, wander into annoying structural traps, and have to be unwound if left to their own devices. It's like the genie that almost pathologically misinterprets your wishes.
The LLM does most of the coding, yet I wouldn't call it "vibe coding" at all.
"Tele coding" would be more appropriate.
For me what works well is to ask it to write some code upfront to verify its assumptions against actual reality, not just be telling it to review the sources "in detail". It gains much more from real output from the code and clears up wrong assumptions. Do some smaller jobs, write up md files, then plan the big thing, then execute.
The resulting artefact, that's what is worth testing.
This looks exactly like what anthropic recommends as the best practice for using Claude Code. Textbook.
It also exposes a major downside of this approach: if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong.
I've found a much better approach in doing a design -> plan -> execute in batches, where the plan is no more than 1,500 lines, used as a proxy for complexity.
My 30,000 LOC app has about 100,000 lines of plan behind it. Can't build something that big as a one-shot.
This is my experience too, but it's pushed me to make much smaller plans and to commit things to a feature branch far more atomically so I can revert a step to the previous commit, or bin the entire feature by going back to main. I do this far more now than I ever did when I was writing the code by hand.
This is how developers should work regardless of how the code is being developed. I think this is a small but very real way AI has actually made me a better developer (unless I stop doing it when I don't use AI... not tried that yet.)
I bet if they did a work and motion study on this approach they'd find the classic:
"Thinks they're more productive, AI has actually made them less productive"
But lots of lovely dopamine from this false progress that gets thrown away!
Yes. In fact, that's not emphatic enough: HELL YES!
More specifically, developers should experiment. They should test their hypothesis. They should try out ideas by designing a solution and creating a proof of concept, then throw that away and build a proper version based on what they learned.
If your approach to building something is to implement the first idea you have and move on then you are going to waste so much more time later refactoring things to fix architecture that paints you into corners, reimplementing things that didn't work for future use cases, fixing edge cases than you hadn't considered, and just paying off a mountain of tech debt.
I'd actually go so far as to say that if you aren't experimenting and throwing away solutions that don't quite work then you're only amassing tech debt and you're not really building anything that will last. If it does it's through luck rather than skill.
Also, this has nothing to do with AI. Developers should be working this way even if they handcraft their artisanal code carefully in vi.
Yes? I can't even count how many times I worked on something my company deemed was valuable only for it to be deprecated or thrown away soon after. Or, how many times I solved a problem but apparently misunderstood the specs slightly and had to redo it. Or how many times we've had to refactor our code because scope increased. In fact, the very existence of the concepts of refactoring and tech debt proves that devs often spend a lot of time making the "wrong" thing.
Is it a waste? No, it solved the problem as understood at the time. And we learned stuff along the way.
This is the way for me as well. Have a high-level master design and plan, but break it apart into phases that are manageable. One-shotting anything beyond a todo list and expecting decent quality is still a pipe dream.
Just because plan is elaborate doesn’t mean it makes sense.
You just revert what the AI agent changed and revise/iterate on the previous step - no need to start over. This can of course involve restricting the work to a smaller change so that the agent isn't overwhelmed by complexity.
And of course there are shortcuts in life. Any form of progress whether its cars, medicine, computers or the internet are all shortcuts in life. It makes life easier for a lot of people.
I yield similar benefits at work. I can wow management with LLM assited/vibe coded apps. What previously would've taken a multi-man team weeks of planning and executing, stand ups, jour fixes, architecture diagrams, etc. can now be done within a single week by myself. For the type of work I do, managers do not care whether I could do it better if I'd code it myself. They are amazed however that what has taken months previously, can be done in hours nowadays. And I for sure will try to reap benefits of LLMs for as long as they don't replace me rather than being idealistic and fighting against them.
They write a short high level plan (let's say 200 words). The plan asks the agent to write a more detailed implementation plan (written by the LLM, let's say 2000-5000 words).
They read this plan and adjust as needed, even sending it to the agent for re-dos.
Once the implementation plan is done, they ask the agent to write the actual code changes.
Then they review that and ask for fixes, adjustments, etc.
This can be comparable to writing the code yourself but also leaves a detailed trail of what was done and why, which I basically NEVER see in human generated code.
That alone is worth gold, by itself.
And on top of that, if you're using an unknown platform or stack, it's basically a rocket ship. You bootstrap much faster. Of course, stay on top of the architecture, do controlled changes, learn about the platform as you go, etc.
I have a road map (AI generated, of course) for a side project I'm toying around with to experiment with LLM-driven development. I read the road map and I understand and approve it. Then, using some skills I found on skills.sh and slightly modified, my workflow is as such:
1. Brainstorm the next slice
It suggests a few items from the road map that should be worked on, with some high level methodology to implement. It asks me what the scope ought to be and what invariants ought to be considered. I ask it what tradeoffs could be, why, and what it recommends, given the product constraints. I approve a given slice of work.
NB: this is the part I learn the most from. I ask it why X process would be better than Y process given the constraints and it either corrects itself or it explains why. "Why use an outbox pattern? What other patterns could we use and why aren't they the right fit?"
2. Generate slice
After I approve what to work on next, it generates a high level overview of the slice, including files touched, saved in a MD file that is persisted. I read through the slice, ensure that it is indeed working on what I expect it to be working on, and that it's not scope creeping or undermining scope, and I approve it. It then makes a plan based off of this.
3. Generate plan
It writes a rather lengthy plan, with discrete task bullets at the top. Beneath, each step has to-dos for the llm to follow, such as generating tests, running migrations, etc, with commit messages for each step. I glance through this for any potential red flags.
4. Execute
This part is self explanatory. It reads the plan and does its thing.
I've been extremely happy with this workflow. I'll probably write a blog post about it at some point.
Keep frying your brain with neural network powered autocomplete.
By using it first-hand or by a colleague? And useful to whom, you, or the person writing it? There are plenty of people in this thread who have actually used this "garbage process," myself included, to produce stuff we, and our colleagues, find is useful.
Have fun paying for "Think for me Saas".
2025-2026: The year everyone became the mental equivalent of obese and let their brain atrophy. There are no shortcuts in life that don't come at a huge cost, remember how everyone forgot how to navigate without a maps app, that's going to be you with writing code/reading code/thinking about code.
If someone’s brain atrophies, that’s a user problem, not a tool problem.
I've been thinking about doing something like that myself because I'm one of those people who have tried countless apps but there's always a couple deal breakers that cause me to drop the app.
I figured trying to agentically develop a planner app with the exact feature set I need would be an interesting and fun experiment.
Some problems AI surprised me immensely with fast, elegant efficient solutions and problem solving. I've also experienced AI doing totally absurd things that ended up taking multiple times longer than if I did it manually. Sometimes in the same project.
Prompting basic notes apps is not as exciting but I can see how people who care about that also care about it being exactly a certain way, so I think get your excitement.
This makes no sense to my intuition of how an LLM works. It's not that I don't believe this works, but my mental model doesn't capture why asking the model to read the content "more deeply" will have any impact on whatever output the LLM generates.
Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts. It tells the model to pay attention to the part of the corpus that has those terms, weighting them more highly than all the other programming samples that it's run across.
Maybe you remember that, without reinforcement learning, the models of 2019 just completed the sentences you gave them. There were no tool calls like reading files. Tool calling behavior is company specific and highly tuned to their harnesses. How often they call a tool, is not part of the base training data.
Just a theory.
So if you send a python code then the first one in function can be one expert, second another expert and so on.
This pretend-you-are-a-[persona] is cargo cult prompting at this point. The persona framing is just decoration.
A brief purpose statement describing what the skill [skill.md] does is more honest and just as effective.
These tools are literally designed to make people behave like gamblers. And its working, except the house in this case takes the money you give them and lights it on fire.
Unless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.
Sure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.
The thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.
If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.
"Add tests to this function" for GPT-3.5-era models was much less effective than "you are a senior engineer. add tests for this function. as a good engineer, you should follow the patterns used in these other three function+test examples, using this framework and mocking lib." In today's tools, "add tests to this function" results in a bunch of initial steps to look in common places to see if that additional context already exists, and then pull it in based on what it finds. You can see it in the output the tools spit out while "thinking."
So I'm 90% sure this is already happening on some level.
https://github.com/solatis/claude-config
It’s based entirely off academic research, and a LOT of research has been done in this area.
One of the papers you may be interested in is “emotion prompting”, eg “it is super important for me that you do X” etc actually works.
“Large Language Models Understand and Can be Enhanced by Emotional Stimuli”
A common technique is to prompt in your chosen AI to write a longer prompt to get it to do what you want. It's used a lot in image generation. This is called 'prompt enhancing'.
This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.
This is why we see a new Markdown format every week, "skills", "benchmarks", and other useless ideas, practices, and measurements. Consider just how many "how I use AI" articles are created and promoted. Most of the field runs on anecdata.
It's not until someone actually takes the time to evaluate some of these memes, that they find little to no practical value in them.[1]
Now? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running
(1 Outside of some core ML developers at the big model companies)
Practice playing songs by ear and after 2 weeks, my brain has developed an inference model of where my fingers should go to hit any given pitch.
Do I have any idea how my brain’s model works? No! But it tickles a different part of my brain and I like it.
thats hilarious. i definitely treat claude like shit and ive noticed the falloff in results.
if there's a source for that i'd love to read about it.
See, uhhh, https://pmc.ncbi.nlm.nih.gov/articles/PMC8052213/ and maybe have a shot at running claude while playing Enya albums on loop.
/s (??)
sometimes internet arguments get messy, people die on their hills and double / triple down on internet message boards. since historic internet data composes a bit of what goes into an llm, would it make sense that bad-juju prompting sends it to some dark corners of its training model if implementations don't properly sanitize certain negative words/phrases ?
in some ways llm stuff is a very odd mirror that haphazardly regurgitates things resulting from the many shades of gray we find in human qualities.... but presents results as matter of fact. the amount of internet posts with possible code solutions and more where people egotistically die on their respective hills that have made it into these models is probably off the charts, even if the original content was a far cry from a sensible solution.
all in all llm's really do introduce quite a bit of a black box. lot of benefits, but a ton of unknowns and one must be hyperviligant to the possible pitfalls of these things... but more importantly be self aware enough to understand the possible pitfalls that these things introduce to the person using them. they really possibly dangerously capitalize on everyones innate need to want to be a valued contributor. it's really common now to see so many people biting off more than they can chew, often times lacking the foundations that would've normally had a competent engineer pumping the brakes. i have a lot of respect/appreciation for people who might be doing a bit of claude here and there but are flat out forward about it in their readme and very plainly state to not have any high expectations because _they_ are aware of the risks involved here. i also want to commend everyone who writes their own damn readme.md.
these things are for better or for worse great at causing people to barrel forward through 'problem solving', which is presenting quite a bit of gray area on whether or not the problem is actually solved / how can you be sure / do you understand how the fix/solution/implementation works (in many cases, no). this is why exceptional software engineers can use this technology insanely proficiently as a supplementary worker of sorts but others find themselves in a design/architect seat for the first time and call tons of terrible shots throughout the course of what it is they are building. i'd at least like to call out that people who feel like they "can do everything on their own and don't need to rely on anyone" anymore seem to have lost the plot entirely. there are facets of that statement that might be true, but less collaboration especially in organizations is quite frankly the first steps some people take towards becoming delusional. and that is always a really sad state of affairs to watch unfold. doing stuff in a vaccuum is fun on your own time, but forcing others to just accept things you built in a vaccuum when you're in any sort of team structure is insanely immature and honestly very destructive/risky. i would like to think absolutely no one here is surprised that some sub-orgs at Microsoft force people to use copilot or be fired, very dangerous path they tread there as they bodyslam into place solutions that are not well understood. suddenly all the leadership decisions at many companies that have made to once again bring back a before-times era of offshoring work makes sense: they think with these technologies existing the subordinate culture of overseas workers combined with these techs will deliver solutions no one can push back on. great savings and also no one will say no.
I'm not being sarcastic. This is absolutely incredible.
It's easy to know why they work. The magic invocation increases test-time compute (easy to verify yourself - try!). And an increase in test-time compute is demonstrated to increase answer correctness (see any benchmark).
It might surprise you to know that the only different between GPT 5.2-low and GPT 5.2-xhigh is one of these magic invocations. But that's not supposed to be public knowledge.
Without something quantifiable it's not much better then someone who always wears the same jersey when their favorite team plays, and swears they play better because of it.
But I get the impression from your comment that you have a fixed idea, and you're not really interested in understanding how or why it works.
If you think like a hammer, everything will look like a nail.
The system is inherently non-deterministic. Just because you can guide it a bit, doesn't mean you can predict outcomes.
The system isn't randomly non-deterministic; it is statistically probabilistic.
The next-token prediction and the attention mechanism is actually a rigorous deterministic mathematical process. The variation in output comes from how we sample from that curve, and the temperature used to calibrate the model. Because the underlying probabilities are mathematically calculated, the system's behavior remains highly predictable within statistical bounds.
Yes, it's a departure from the fully deterministic systems we're used to. But that's not different than the many real world systems: weather, biology, robotics, quantum mechanics. Even the computer you're reading this right now is full of probabilistic processes, abstracted away through sigmoid-like functions that push the extremes to 0s and 1s.
> Yes, it's a departure from the fully deterministic systems we're used to.
A system either produces the same output given the same input[1], or doesn't.
LLMs are nondeterministic by design. Sure, you can configure them with a zero temperature, a static seed, and so on, but they're of no use to anyone in that configuration. The nondeterminism is what gives them the illusion of "creativity", and other useful properties.
Classical computers, compilers, and programming languages are deterministic by design, even if they do contain complex logic that may affect their output in unpredictable ways. There's a world of difference.
[1]: Barring misbehavior due to malfunction, corruption or freak events of nature (cosmic rays, etc.).
Is it engineering? Maybe not. But neither is knowing how to talk to junior developers so they're productive and don't feel bad. The engineering is at other levels.
So 60% of the time, it works every time.
... This fucking industry.
You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time. Just like you can roll dice the exact same way on the exact same table and you'll get two totally different results. People are doing their best to constrain that behavior by layering stuff on top, but the foundational tech is flawed (or at least ill suited for this use case).
That's not to say that AI isn't helpful. It certainly is. But when you are basically begging your tools to please do what you want with magic incantations, we've lost the fucking plot somewhere.
And even a human engineer might not solve a problem the same way twice in a row, based on changes in recent inspirations or tech obsessions. What's the difference, as long as it passes review and does the job?
This is more of an implementation detail/done this way to get better results. A neural network with fixed weights (and deterministic floating point operations) returning a probability distribution, where you use a pseudorandom generator with a fixed seed called recursively will always return the same output for the same input.
think of the latent space inside the model like a topological map, and when you give it a prompt, you're dropping a ball at a certain point above the ground, and gravity pulls it along the surface until it settles.
caveat though, thats nice per-token, but the signal gets messed up by picking a token from a distribution, so each token you're regenerating and re-distorting the signal. leaning on language that places that ball deep in a region that you want to be makes it less likely that those distortions will kick it out of the basin or valley you may want to end up in.
if the response you get is 1000 tokens long, the initial trajectory needed to survive 1000 probabilistic filters to get there.
or maybe none of that is right lol but thinking that it is has worked for me, which has been good enough
The claw machine is also a sort-of-lie, of course. Its main appeal is that it offers the illusion of control. As a former designer and coder of online slot machines... totally spin off into pages on this analogy, about how that illusion gets you to keep pulling the lever... but the geographic rendition you gave is sort of priceless when you start making the comparison.
i think probably once you start seeing that the behavior falls right out of the geometry, you just start looking at stuff like that. still funny though.
- You are a Python Developer... or - You are a Professional Python Developer... or - You are one of the World most renowned Python Experts, with several books written on the subject, and 15 years of experience in creating highly reliable production quality code...
You will notice a clear improvement in the quality of the generated artifacts.
I am having the most success describing what I want as humanly as possible, describing outcomes clearly, making sure the plan is good and clearing context before implementing.
That's very different from "think deeper". I'm just curious about this case in specific :)
Of course, that doesn't mean it'll definitely be better, but if you're making an LLM chain it seems prudent to preserve whatever info you can at each step.
"Large Language Models Understand and Can be Enhanced by Emotional Stimuli": https://arxiv.org/abs/2307.11760
To the extent that LLMs mimic human behaviour, it shouldn’t be a surprise that setting clear expectations works there too.
(chirp)
—HAL, please open the shuttle bay doors.
(pause)
—HAL!
—I'm afraid I can't do that, Dave.
In image generation, it's fairly common to add "masterpiece", for example.
I don't think of the LLM as a smart assistant that knows what I want. When I tell it to write some code, how does it know I want it to write the code like a world renowned expert would, rather than a junior dev?
I mean, certainly Anthropic has tried hard to make the former the case, but the Titanic inertia from internet scale data bias is hard to overcome. You can help the model with these hints.
Anyway, luckily this is something you can empirically verify. This way, you don't have to take anyone's word. If anything, if you find I'm wrong in your experiments, please share it!
I am not sure if we know why really, but they are that way and you need to explicitly prompt around it.
Lazy thinking makes LLMs do surface analysis and then produce things that are wrong. Neurotic thinking will see them over-analyze, and then repeatedly second-guess themselves, repeatedly re-derive conclusions.
Something very similar to an anxiety loop in humans, where problems without solutions are obsessed about in circles.
My workflow is more like scaffold -> thin vertical slices -> machine-checkable semantics -> repeat.
Concrete example: I built and shipped a live ticketing system for my club (Kolibri Tickets). It’s not a toy: real payments (Stripe), email delivery, ticket verification at the door, frontend + backend, migrations, idempotency edges, etc. It’s running and taking money.
The reason this works with AI isn’t that the model “codes fast”. It’s that the workflow moves the bottleneck from “typing” to “verification”, and then engineers the verification loop:
-keep the spine runnable early (end-to-end scaffold)
-add one thin slice at a time (don’t let it touch 15 files speculatively)
-force checkable artifacts (tests/fixtures/types/state-machine semantics where it matters)
-treat refactors as normal, because the harness makes them safe
If you run it open-loop (prompt -> giant diff -> read/debug), you get the “illusion of velocity” people complain about. If you run it closed-loop (scaffold + constraints + verifiers), you can actually ship faster because you’re not paying the integration cost repeatedly.Plan docs are one way to create shared state and prevent drift. A runnable scaffold + verification harness is another.
[0]: https://kiro.dev/
[1]: https://antigravity.google/
First, the "big bang" write it all at once. You are going to end up with thousands of lines of code that were monolithically produced. I think it is much better to have it write the plan and formulate it as sensible technical steps that can be completed one at a time. Then you can work through them. I get that this is not very "vibe"ish but that is kind of the point. I want the AI to help me get to the same point I would be at with produced code AND understanding of it, just accelerate that process. I'm not really interested in just generating thousands of lines of code that nobody understands.
Second, the author keeps refering to adjusting the behaviour, but never incorporating that into long lived guidance. To me, integral with the planning process is building an overarching knowledge base. Every time you're telling it there's something wrong, you need to tell it to update the knowledge base about why so it doesn't do it again.
Finally, no mention of tests? Just quick checks? To me, you have to end up with comprehensive tests. Maybe to the author it goes without saying, but I find it is integral to build this into the planning. Certain stages you will want certain types of tests. Some times in advance of the code (so TDD style) other times built alongside it or after.
It's definitely going to be interesting to see how software methodology evolves to incorporate AI support and where it ultimately lands.
I get the PLAN.md (or equivalent) to be separated into "phases" or stages, then carefully prompt (because Claude and Codex both love to "keep going") it to only implement that stage, and update the PLAN.md
Tests are crucial too, and form another part of the plan really. Though my current workflow begins to build them later in the process than I would prefer...
I craft a detailed and ordered set of lecture notes in a Quarto file and then have a dedicated claude code skill for translating those notes into Slidev slides, in the style that I like.
Once that's done, much like the author, I go through the slides and make commented annotations like "this should be broken into two slides" or "this should be a side-by-side" or "use your generate clipart skill to throw an image here alongside these bullets" and "pull in the code example from ../examples/foo." It works brilliantly.
And then I do one final pass of tweaking after that's done.
But yeah, annotations are super powerful. Token distance in-context and all that jazz.
The author mentions annotations but doesn't go into detail about how to feed the annotations to Claude.
<!-- TODOCLAUDE: Split this into a two-cols-title, divide the examples between -->
or <!-- TODOCLAUDE: Use clipart skill to make an image for this slide -->
And then, when I finish annotating I just say: "Address all the TODOCLAUDEs"But it's not hard to build one. The key for me was describing, in great detail:
1. How I want it to read the source material (e.g., H1 means new section, H2 means at least one slide, a link to an example means I want code in the slide)
2. How to connect material to layouts (e.g., "comparison between two ideas should be a two-cols-title," "walkthrough of code should be two-cols with code on right," "learning objectives should be side-title align:left," "recall should be side-title align:right")
Then the workflow is:
1. Give all those details and have it do a first pass.
2. Give tons of feedback.
3. At the end of the session, ask it to "make a skill."
4. Manually edit the skill so that you're happy with the examples.
There's no winner for "least amount of code written regardless of productivity outcomes.", except for maybe Anthropic's bank account.
Yesterday I had Claude write an audit logging feature to track all changes made to entities in my app. Yeah you get this for free with many frameworks, but my company's custom setup doesn't have it.
It took maybe 5-10 minutes of wall-time to come up with a good plan, and then ~20-30 min for Claude implement, test, etc.
That would've taken me at least a day, maybe two. I had 4-5 other tasks going on in other tabs while I waited the 20-30 min for Claude to generate the feature.
After Claude generated, I needed to manually test that it worked, and it did. I then needed to review the code before making a PR. In all, maybe 30-45 minutes of my actual time to add a small feature.
All I can really say is... are you sure you're using it right? Have you _really_ invested time into learning how to use AI tools?
Fast forward to today and I tried the tools again--specifically Claude Code--about a week ago. I'm blown away. I've reproduced some tools that took me weeks at full-time roles in a single day. This is while reviewing every line of code. The output is more or less what I'd be writing as a principal engineer.
I certainly hope this is not true, because then you're not competent for that role. Claude Code writes an absolutely incredible amount of unecessary and superfluous comments, it's makes asinine mistakes like forgetting to update logic in multiple places. It'll gladly drop the entire database when changing column formats, just as an example.
The problem is LLMs are great at simple implementation, even large amounts of simple implementation, but I've never seen it develop something more than trivial correctly. The larger problem is it's very often subtly but hugely wrong. It makes bad architecture decisions, it breaks things in pursuit of fixing or implementing other things. You can tell it has no concept of the "right" way to implement something. It very obviously lacks the "senior developer insight".
Maybe you can resolve some of these with large amounts of planning or specs, but that's the point of my original comment - at what point is it easier/faster/better to just write the code yourself? You don't get a prize for writing the least amount of code when you're just writing specs instead.
It's an okay mental energy saver for simpler things, but for me the self review in an actual production code context is much more draining than writing is.
I guess we're seeing the split of people for whom reviewing is easy and writing is difficult and vice versa.
The original article is, to me, seemingly not that novel. Not because it's a trite example, but because I've begun to experience massive gains from following the same basic premise as the article. And I can't believe there's others who aren't using like this.
I iterate the plan until it's seemingly deterministic, then I strip the plan of implementation, and re-write it following a TDD approach. Then I read all specs, and generate all the code to red->green the tests.
If this commenter is too good for that, then it's that attitude that'll keep him stuck. I already feel like my projects backlog is achievable, this year.
More recently, I tried the same experiment, again with Claude. I used the exact same prompt. This time, the aim was exactly correct. Instead of spending my time trying to correct it, I was able to ask it to add features. I've spent more time writing this comment on HN than I spent optimizing this toy. https://claude.ai/public/artifacts/d7f1c13c-2423-4f03-9fc4-8...
My point is that AI-assisted coding has improved dramatically in the past few months. I don't know whether it can reason deeply about things, but it can certainly imitate a human who reasons deeply. I've never seen any technology improve at this rate.
What are you working on? I personally haven't seen LLMs struggle with any kind of problem in months. Legacy codebase with great complexity and performance-critical code. No issue whatsoever regardless of the size of the task.
This is 100% incorrect, but the real issue is that the people who are using these llms for non-trivial work tend to be extremely secretive about it.
For example, I view my use of LLMs to be a competitive advantage and I will hold on to this for as long as possible.
Does it write maintainable code? Does it write extensible code? Does it write secure code? Does it write performant code?
My experience has been it failing most of these. The code might "work", but it's not good for anything more than trivial, well defined functions (that probably appeared in it's training data written by humans). LLMs have a fundamental lack of understanding of what they're doing, and it's obvious when you look at the finer points of the outcomes.
That said, I'm sure you could write detailed enough specs and provide enough examples to resolve these issues, but that's the point of my original comment - if you're just writing specs instead of code you're not gaining anything.
But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms. So maintainable has to evolve from good human design patterns to good AI patterns.
Specs are worth it IMO. Not because if I can spec, I could’ve coded anyway. But because I gain all the insight and capabilities of AI, while minimizing the gotchas and edge failures.
I don't find that LLMs are any more likely than humans to remember to update all of the places it wrote redundant functions. Generally far less likely, actually. So forgive me for treating this claim with a massive grain of salt.
How do you square that with the idea that all the code still has to be reviewed by humans? Yourself, and your coworkers
So maybe it's that we won't be reviewing by hand anymore? I.e. it's LLMs all the way down. Trying to embrace that style of development lately as unnatural as it feels. We're obv not 100% there yet but Claude Opus is a significant step in that direction and they keep getting better and better.
And you don’t blame humans anyways lol. Everywhere I’ve worked has had “blameless” postmortems. You don’t remove human review unless you have reasonable alternatives like high test coverage and other automated reviews.
“It’s AI all the way down” is either nonsense on its face, or the industry is dead already.
yes, if I steer it properly.
It's very good at spotting design patterns, and implementing them. It doesn't always know where or how to implement them, but that's my job.
The specs and syntactic sugar are just nice quality of life benefits.
The compounding is much greater than my brain can do on its own.
But did you truly think about such feature? Like guarantees that it should follow (like how do it should cope with entities migration like adding a new field) or what the cost of maintaining it further down the line. This looks suspiciously like drive-by PR made on open-source projects.
> That would've taken me at least a day, maybe two.
I think those two days would have been filled with research, comparing alternatives, questions like "can we extract this feature from framework X?", discussing ownership and sharing knowledge,.. Jumping on coding was done before LLMs, but it usually hurts the long term viability of the project.
Adding code to a project can be done quite fast (hackatons,...), ensuring quality is what slows things down in any any well functioning team.
Some things are complex.
You could've been curious and ask why it would take 1-2 days, and I would've happily told you.
I wanted to add audit logging for all endpoints we call, all places we call the DB, etc. across areas I haven't touched before. It would have taken me a while to track down all of the touchpoints.
Granted, I am not 100% certain that Claude didn't miss anything. I feel fairly confident that it is correct given that I had it research upfront, had multiple agents review, and it made the correct changes in the areas that I knew.
Also I'm realizing I didn't mention it included an API + UI for viewing events w/ pretty deltas
I think the method in TFA is overall less stressful for the dev. And you can always fix it up manually in the end; AI coding vs manual coding is not either-or.
That said, if you're on a serious team writing professional software there is still tons of value in always telling AI to plan first, unless it's a small quick task. This post just takes it a few steps further and formalizes it.
I find Cursor works much more reliably using plan mode, reviewing/revising output in markdown, then pressing build. Which isn't a ton of overhead but often leads to lots of context switching as it definitely adds more time.
I find the best way to use agents (and I don't use claude) is to hash it out like I'm about to write these changes and I make my own mental notes, and get the agent to execute on it.
Agents don't get tired, they don't start fat fingering stuff at 4pm, the quality doesn't suffer. And they can be parallelised.
Finally, this allows me to stay at a higher level and not get bogged down of "right oh did we do this simple thing again?" which wipes some of the context in my mind and gets tiring through the day.
Always, 100% review every line of code written by an agent though. I do not condone committing code you don't 'own'.
I'll never agree with a job that forces developers to use 'AI', I sometimes like to write everything by hand. But having this tool available is also very powerful.
This new version that I'm doing (from scratch with ChatGPT web) has a far more ambitious scope and is already at the "usable" point. Now I'm primarily solidifying things and increasing test coverage. And I've tested the key parts with IRL scenarios to validate that it's not just passing tests; the thing actually fulfills its intended function so far. Given the increased scope, I'm guessing it'd take me a few months to get to this point on my own, instead of under a week, and the quality wouldn't be where it is. Not saying I haven't had to wrangle with ChatGPT on a few bugs, but after a decent initial planning phase, my prompts now are primarily "Do it"s and "Continue"s. Would've likely already finished it if I wasn't copying things back and forth between browser and editor, and being forced to pause when I hit the message limit.
I recommend to try out Opencode with this approach, you might find it less tiring than ChatGPT web (yes it works with your ChatGPT Plus sub).
This is where our challenges are. We've build our own chatbot where you can "build" your own agent within the librechat framework and add a "skill" to it. I say "skill" because it's older than claude skills but does exactly the same. I don't completely buy the authors:
> “deeply”, “in great details”, “intricacies”, “go through everything”
bit, but you can obviously save a lot of time by writing a piece of english which tells it what sort of environment you work in. It'll know that when I write Python I use UV, Ruff and Pyrefly and so on as an example. I personally also have a "skill" setting that tells the AI not to compliment me because I find that ridicilously annoying, and that certainly works. So who knows? Anyway, employees are going to want more. I've been doing some PoC's running open source models in isolation on a raspberry pi (we had spares because we use them in IoT projects) but it's hard to setup an isolation policy which can't be circumvented.
We'll have to figure it out though. For powerplant critical projects we don't want to use AI. But for the web tool that allows a couple of employees to upload three excel files from an external accountant and then generate some sort of report on them? Who cares who writes it or even what sort of quality it's written with? The lifecycle of that tool will probably be something that never changes until the external account does and then the tool dies. Not that it would have necessarily been written in worse quality without AI... I mean... Have you seen some of the stuff we've written in the past 40 years?
This! Once I'm familiar with the codebase (which I strive to do very quickly), for most tickets, I usually have a plan by the time I've read the description. I can have a couple of implementation questions, but I knew where the info is located in the codebase. For things, I only have a vague idea, the whiteboard is where I go.
The nice thing with such a mental plan, you can start with a rougher version (like a drawing sketch). Like if I'm starting a new UI screen, I can put a placeholder text like "Hello, world", then work on navigation. Once that done, I can start to pull data, then I add mapping functions to have a view model,...
Each step is a verifiable milestone. Describing them is more mentally taxing than just writing the code (which is a flow state for me). Why? Because English is not fit to describe how computer works (try describe a finite state machine like navigation flow in natural languages). My mental mental model is already aligned to code, writing the solution in natural language is asking me to be ambiguous and unclear on purpose.
As others have already noted, this workflow is exactly what the Google Antigravity agent (based off Visual Studio Code) has been created for. Antigravity even includes specialized UI for a user to annotate selected portions of an LLM-generated plan before iterating it.
One significant downside to Antigravity I have found so far is the fact that even though it will properly infer a certain technical requirement and clearly note it in the plan it generates (for example, "this business reporting column needs to use a weighted average"), it will sometimes quietly downgrade such a specialized requirement (for example, to a non-weighted average), without even creating an appropriate "WARNING:" comment in the generated code. Especially so when the relevant codebase already includes a similar, but not exactly appropriate API. My repetitive prompts to ALWAYS ask about ANY implementation ambiguities WHATSOEVER go unanswered.
From what I gather Claude Code seems to be better than other agents at always remembering to query the user about implementation ambiguities, so maybe I will give Claude Code a shot over Antigravity.
> One trick I use constantly: for well-contained features where I’ve seen a good implementation in an open source repo, I’ll share that code as a reference alongside the plan request. If I want to add sortable IDs, I paste the ID generation code from a project that does it well and say “this is how they do sortable IDs, write a plan.md explaining how we can adopt a similar approach.” Claude works dramatically better when it has a concrete reference implementation to work from rather than designing from scratch.
Licensing apparently means nothing.
Ripped off in the training data, ripped off in the prompt.
1) anything larger I work on in layers of docs. Architecture and requirements -> design -> implementation plan -> code. Partly it helps me think and nail the larger things first, and partly helps claude. Iterate on each level until I'm satisfied.
2) when doing reviews of each doc I sometimes restart the session and clear context, it often finds new issues and things to clear up before starting the next phase.
You might say a junior might do the same thing, but I'm not worried about it, at least the junior learned something while doing that. They could do it better next time. They know the code and change it from the middle where it broke. It's a net positive.
It's OSS.
Real-time work is happening at https://app.beadhub.ai/juanre/beadhub (beadhub is a public project at https://beadhub.ai so it is visible).
Particularly interesting (I think) is how the agents chat with each other, which you can see at https://app.beadhub.ai/juanre/beadhub/chat
There are several projects on GitHub that attempt to tackle context and memory limitations, but I haven’t found one that consistently works well in practice.
My current workaround is to maintain a set of Markdown files, each covering a specific subsystem or area of the application. Depending on the task, I provide only the relevant documents to Claude Code to limit the context scope. It works reasonably well, but it still feels like a manual and fragile solution. I’m interested in more robust strategies for persistent project context or structured codebase understanding.
Skills almost seem like a solution, but they still need an out-of-band process to keep them updated as the codebase evolves. For now, a structured workflow that includes aggressive updates at the end of the loop is what I use.
- Specs: these are generally static, but updatable as the project evolves. And they're broken out to an index file that gives a project overview, a high-level arch file, and files for all the main modules. Roughly ~1k lines of spec for 10k lines of code, and try to limit any particular spec file to 300 lines. I'm intimately familiar with every single line in these.
- Plans: these are the output of a planning session with an LLM. They point to the associated specs. These tend to be 100-300 lines and 3 to 5 phases.
- Working memory files: I use both a status.md (3-5 items per phase roughly 30 lines overall), which points to a latest plan, and a project_status (100-200 lines), which tracks the current state of the project and is instructed to compact past efforts to keep it lean)
- A planner skill I use w/ Gemini Pro to generate new plans. It essentially explains the specs/plans dichotomy, the role of the status files, and to review everything in the pertinent areas of code and give me a handful of high-level next set of features to address based on shortfalls in the specs or things noted in the project_status file. Based on what it presents, I select a feature or improvement to generate. Then it proceeds to generate a plan, updates a clean status.md that points to the plan, and adjusts project_status based on the state of the prior completed plan.
- An implementer skill in Codex that goes to town on a plan file. It's fairly simple, it just looks at status.md, which points to the plan, and of course the plan points to the relevant specs so it loads up context pretty efficiently.
I've tried the two main spec generation libraries, which were way overblown, and then I gave superpowers a shot... which was fine, but still too much. The above is all homegrown, and I've had much better success because it keeps the context lean and focused.
And I'm only on the $20 plans for Codex/Gemini vs. spending $100/month on CC for half year prior and move quicker w/ no stall outs due to token consumption, which was regularly happening w/ CC by the 5th day. Codex rarely dips below 70% available context when it puts up a PR after an execution run. Roughly 4/5 PRs are without issue, which is flipped against what I experienced with CC and only using planning mode.
I have found it to work very well with Claude by giving it context and guardrails. Basically I just tell it "follow the guidance docs" and it does. Couple that with intense testing and self-feedback mechanisms and you can easily keep Claude on track.
I have had the same experience with Codex and Claude as you in terms of token usage. But I haven't been happy with my Codex usage; Claude just feels like it's doing more of what I want in the way I want.
This is the part that seems most novel compared to what I've heard suggested before. And I have to admit I'm a bit skeptical. Would it not be better to modify what Claude has written directly, to make it correct, rather than adding the corrections as separate notes (and expecting future Claude to parse out which parts were past Claude and which parts were the operator, and handle the feedback graciously)?
At least, it seems like the intent is to do all of this in the same session, such that Claude has the context of the entire back-and-forth updating the plan. But that seems a bit unpleasant; I would think the file is there specifically to preserve context between sessions.
Personally, I like to order claude one more time to update the plan file after I have given annotation, and review it again after. This will ensure (from my understanding) that claude won't treat my annotation as different instructions, thus risking the work being conflicted.
* create a feature-name.md file in a gitignored folder
* start the file by giving the business context
* describe a high-level implementation and user flows
* describe database structure changes (I find it important not to leave it for interpretation)
* ask Claude to inspect the feature and review if for coherence, while answering its questions I ask to augment feature-name.md file with the answers
* enter Claude's plan mode and provide that feature-name.md file
* at this point it's detailed enough that rarely any corrections from me are needed
This shortcuts a range of problem cases where the LLM fights between the users strict and potentially conflicting requirements, and its own learning.
In the early days we used to get LLM to write the prompts for us to get round this problem, now we have planning built in.
First, Claude evolves. The original post work pattern evolved over 9 months, before claude's recent step changes. It's likely claude's present plan mode is better than this workaround, but if you stick to the workaround, you'd never know.
Second, the staging docs that represent some context - whether a library skills or current session design and implementation plans - are not the model Claude works with. At best they are shaping it, but I've found it does ignore and forget even what's written (even when I shout with emphasis), and the overall session influences the code. (Most often this happens when a peripheral adjustment ends up populating half the context.)
Indeed the biggest benefit from the OP might be to squeeze within 1 session, omitting peripheral features and investigations at the plan stage. So the mechanism of action might be the combination of getting our own plan clear and avoiding confusing excursions. (A test for that would be to redo the session with the final plan and implementation, to see if the iteration process itself is shaping the model.)
Our bias is to believe that we're getting better at managing this thing, and that we can control and direct it. It's uncomfortable to realize you can only really influence it - much like giving direction to a junior, but they can still go off track. And even if you found a pattern that works, it might work for reasons you're not understanding -- and thus fail you eventually. So, yes, try some patterns, but always hang on to the newbie senses of wonder and terror that make you curious, alert, and experimental.
The practice is:
- simple
- effective
- retains control and quality
Certainly the “unsupervised agent” workflows are getting a lot of attention right now, but they require a specific set of circumstances to be effective:
- clear validation loop (eg. Compile the kernel, here is gcc that does so correctly)
- ai enabled tooling (mcp / cli tool that will lint, test and provide feedback immediately)
- oversight to prevent sgents going off the rails (open area of research)
- an unlimited token budget
That means that most people can't use unsupervised agents.
Not that they dont work; Most people have simply not got an environment and task that is appropriate.
By comparison, anyone with cursor or claude can immediately start using this approach, or their own variant on it.
It does not require fancy tooling.
It does not require an arcane agent framework.
It works generally well across models.
This is one of those few genunie pieces of good practical advice for people getting into AI coding.
Simple. Obviously works once you start using it. No external dependencies. BYO tools to help with it, no “buy my AI startup xxx to help”. No “star my github so I can a job at $AI corp too”.
Great stuff.
It’s the same reason adding a thinking step works.
You want to write a paper, you have it form a thesis and structure first. (In this one you might be better off asking for 20 and seeing if any of them are any good.) You want to research something, first you add gathering and filtering steps before synthesis.
Adding smarter words or telling it to be deeper does work by slightly repositioning where your query ends up in space.
Asking for the final product first right off the bat leads to repetitive verbose word salad. It just starts to loop back in on itself. Which is why temperature was a thing in the first place, and leads me to believe they’ve turned the temp down a bit to try and be more accurate. Add some randomness and variability to your prompts to compensate.
One step I added, that works great for me, is letting it write (api-level) tests after planning and before implementation. Then I’ll do a deep review and annotation of these tests and tweak them until everything is just right.
The “easy” path of “short prompt declaring what I want” works OK for simple tasks but consistently breaks down for medium to high complexity tasks.
What i mean is, in practice, how does one even get to a a high complexity task? What does that look like? Because isn't it more common that one sees only so far ahead?
I've also noted such a huge gulf between some developers describing 'prompting things into existence' and the approach described in this article. Both types seem to report success, though my experience is that the latter seems more realistic, and much more likely to produce robust code that's likely to be maintainable for long term or project critical goals.
1. Use brainstorming to come up with the plan using the Socratic method
2. Write a high level design plan to file
3. I review the design plan
4. Write an implementation plan to file. We've already discussed this in detail, so usually it just needs skimming.
5. Use the worktree skill with subagent driven development skill
6. Agent does the work using subagents that for each task:
a. Implements the task
b. Spec reviews the completed task
c. Code reviews the completed task
7. When all tasks complete: create a PR for me to review8. Go back to the agent with any comments
9. If finished, delete the plan files and merge the PR
When you go to YouTube and search for stuff like "7 levels of claude code" this post would be maybe 3-4.
Oh, one more thing - quality is not consistent, so be ready for 2-3 rounds of "are you happy with the code you wrote" and defining audit skills crafted for your application domain - like for example RODO/Compliance audit etc.
I find that brainstorming + (executing plans OR subagent driven development) is way more reliable than the built-in tooling.
One addition that's worked well for me: keeping a persistent context file that the model reads at the start of each session. Instead of re-explaining the project every time, you maintain a living document of decisions, constraints, and current state. Turns each session into a continuation rather than a cold start.
The biggest productivity gain isn't in the code generation itself — it's in reducing the re-orientation overhead between sessions.
https://github.com/srid/AI/blob/master/commands/plan.md#2-pl...
It works very similar to Antigravity's plan document comment-refine cycle.
Sometimes when doing big task I ask claude to implement each phase seprately and review the code after each step.
Experimentally, i've been using mfbt.ai [https://mfbt.ai] for roughly the same thing in a team context. it lets you collaboratively nail down the spec with AI before handing off to a coding agent via MCP.
Avoids the "everyone has a slightly different plan.md on their machine" problem. Still early days but it's been a nice fit for this kind of workflow.
https://github.com/mbcrawfo/vibefun/tree/main/.claude/archiv...
Never let Claude write code until you’ve reviewed, *fully understood* and approved a written plan.
In my experience, the beginning of chaos is the point at which you trust that Claude has understood everything correctly and claims to present the very best solution. At that point, you leave the driver's seat.
The key insight that most people miss: this isn't a new workflow invented for AI - it's how good senior engineers already work. You read the code deeply, write a design doc, get buy-in, then implement. The AI just makes the implementation phase dramatically faster.
What I've found interesting is that the people who struggle most with AI coding tools are often junior devs who never developed the habit of planning before coding. They jump straight to "build me X" and get frustrated when the output is a mess. Meanwhile, engineers with 10+ years of experience who are used to writing design docs and reviewing code pick it up almost instantly - because the hard part was always the planning, not the typing.
One addition I'd make to this workflow: version your research.md and plan.md files in git alongside your code. They become incredibly valuable documentation for future maintainers (including future-you) trying to understand why certain architectural decisions were made.
2. Have the agent review if it followed the plan and relevant skills accurately.
here is another one which had about 200 tokens and opus decided to change the model name i requested.
https://x.com/xundecidability/status/2005647216741105962?s=2...
opus is bad at instruction following now.
On the PR review front, I give Claude the ticket number and the branch (or PR) and ask it to review for correctness, bugs and design consistency. The prompt is always roughly the same for every PR. It does a very good job there too.
Modelwise, Opus 4.6 is scary good!
And “Don’t change this function signature” should be enforced not by anticipating that your coding agent “might change this function signature so we better warn it not to” but rather via an end to end test that fails if the function signature is changed (because the other code that needs it not to change now has an error). That takes the author out of the loop and they can not watch for the change in order to issue said correction, and instead sip coffee while the agent observes that it caused a test failure then corrects it without intervention, probably by rolling back the function signature change and changing something else.
Sadly my post didn't much attention at the time.
https://thegroundtruth.media/p/my-claude-code-workflow-and-p...
I maintain two directories: "docs/proposals" (for the research md files) and "docs/plans" (for the planning md files). For complex research files, I typically break them down into multiple planning md files so claude can implement one at a time.
A small difference in my workflow is that I use subagents during implementation to avoid context from filling up quickly.
Even if the product doesn’t resonate I think I’ve stumbled on some ideas you might find useful^
I do think spec-driven development is where this all goes. Still making up my mind though.
This inspired me to finally write good old playwright tests for my website :).
I wonder why you don't remove it yourself. Aren't you already editing the plan?
"Most developers type a prompt, sometimes use plan mode, fix the errors, repeat. "
Does anyone think this is as epic as, say, watch the Unix archives https://www.youtube.com/watch?v=tc4ROCJYbm0 where Brian demos how pipes work; or Dennis working on C and UNIX? Or even before those, the older machines?
I am not at all saying that AI tools are all useless, but there is no real epicness. It is just autogenerated AI slop and blob. I don't really call this engineering (although I also do agree, that it is engineering still; I just don't like using the same word here).
> never let Claude write code until you’ve reviewed and approved a written plan.
So the junior-dev analogy is quite apt here.
I tried to read the rest of the article, but I just got angrier. I never had that feeling watching oldschool legends, though perhaps some of their work may be boring, but this AI-generated code ... that's just some mythical random-guessing work. And none of that is "intelligent", even if it may appear to work, may work to some extent too. This is a simulation of intelligence. If it works very well, why would any software engineer still be required? Supervising would only be necessary if AI produces slop.
> ...
> never let Claude write code until you’ve reviewed and approved a written plan
I certainly always work towards an approved plan before I let it lost on changing the code. I just assumed most people did, honestly. Admittedly, sometimes there's "phases" to the implementation (because some parts can be figured out later and it's more important to get the key parts up and running first), but each phase gets a full, reviewed plan before I tell it to go.
In fact, I just finished writing a command and instruction to tell claude that, when it presents a plan for implementation, offer me another option; to write out the current (important parts of the) context and the full plan to individual (ticket specific) md files. That way, if something goes wrong with the implementation I can tell it to read those files and "start from where they left off" in the planning.
We all tend to regress to average (same thoughts/workflows)...
Have had many users already doing the exact same workflow with: https://github.com/backnotprop/plannotator
Speckit is worth trying as it automates what is being described here, and with Opus 4.6 it's been a kind of BC/AD moment for me.
Which maybe has to do with people wanting to show how they use Claude Code in the comments!
except that I put notes to plan document in a single message like:
> plan quote
my note
> plan quote
my note
otherwise, I'm not sure how to guarantee that ai won't confuse my notes with its own plan.one new thing for me is to review the todo list, I was always relying on auto generated todo list
Do you markup and then save your comments in any way, and have you tried keeping them so you can review the rules and requirements later?
However, Opus made me rethink my entire workflow. Now, I do it like this:
* PRD (Product Requirements Document)
* main.py + requirements.txt + readme.md (I ask for minimal, functional, modular code that fits the main.py)
* Ask for a step-by-step ordered plan
* Ask to focus on one step at a time
The super powerful thing is that I don’t get stuck on missing accounts, keys, etc. Everything is ordered and runs smoothly. I go rapidly from idea to working product, and it’s incredibly easy to iterate if I figure out new features are required while testing. I also have GLM via OpenCode, but I mainly use it for "dumb" tasks.
Interestingly, for reasoning capabilities regarding standard logic inside the code, I found Gemini 3 Flash to be very good and relatively cheap. I don't use Claude Code for the actual coding because forcing everything via chat into a main.py encourages minimal code that's easy to skim—it gives me a clearer representation of the feature space
Right now when Claude Code (or any agent) executes a plan, it typically has the same broad permissions for every step. But ideally, each execution step should only have access to the specific tools and files it needs — least privilege, applied to AI workflows.
I've been experimenting with declarative permission manifests for agent tasks. Instead of giving the agent blanket access, you define upfront what each skill can read, write, and execute. Makes the planning phase more constrained but the execution phase much safer.
Anyone else thinking about this from a security-first angle?
There are whole products wrapped around this common workflow already (like Augment Intent).
1. Spec
2. Plan
3. Read the plan & tell it to fix its bad ideas.
4. (NB) Critique the plan (loop) & write a detailed report
5. Update the plan
6. Review and check the plan
7. Implement plan
Detailed here:
I pretty much agree with that. I use long sessions and stopped trying to optimize the context size, the compaction happens but the plan keeps the details and it works for me.
FWIW I have had significant improvements by clearing context then implementing the plan. Seems like it stops Claude getting hung up on something.
The important thing is to have a conversation with Claude during the planning phase and don't just say "add this feature" and take what you get. Have a back and forth, ask questions about common patterns, best practices, performance implications, security requirements, project alignment, etc. This is a learning opportunity for you and Claude. When you think you're done, request a final review to analyze for gaps or areas of improvement. Claude will always find something, but starts to get into the weeds after a couple passes.
If you're greenfield and you have preferences about structure and style, you need to be explicit about that. Once the scaffolding is there, modern Claude will typically follow whatever examples it finds in the existing code base.
I'm not sure I agree with the "implement it all without stopping" approach and let auto-compact do its thing. I still see Claude get lazy when nearing compaction, though has gotten drastically better over the last year. Even so, I still think it's better to work in a tight loop on each stage of the implementation and preemptively compacting or restarting for the highest quality.
Not sure that the language is that important anymore either. Claude will explore existing codebase on its own at unknown resolution, but if you say "read the file" it works pretty well these days.
My suggestions to enhance this workflow:
- If you use a numbered phase/stage/task approach with checkboxes, it makes it easy to stop/resume as-needed, and discuss particular sections. Each phase should be working/testable software.
- Define a clear numbered list workflow in CLAUDE.md that loops on each task (run checks, fix issues, provide summary, etc).
- Use hooks to ensure the loop is followed.
- Update spec docs at the end of the cycle if you're keeping them. It's not uncommon for there to be some divergence during implementation and testing.
Claude Code now creates persistent markdown plan files in ~/.claude/plans/ and you can open them with Ctrl-G to annotate them in your default editor.
So plan mode is not ephemeral any more.
The AI first works with you to write requirements, then it produces a design, then a task list.
The helps the AI to make smaller chunks to work on, it will work on one task at a time.
I can let it run for an hour or more in this mode. Then there is lots of stuff to fix, but it is mostly correct.
Kiro also supports steering files, they are files that try to lock the AI in for common design decisions.
the price is that a lot of the context is used up with these files and kiro constantly pauses to reset the context.
This has changed in the last week, for 3 reasons:
1. Claude opus. It’s the first model where I haven’t had to spend more time correcting things than it would’ve taken me to just do it myself. The problem is that opus chews through tokens, which led to..
2. I upgraded my Claude plan. Previously on the regular plan I’d get about 20 mins of time before running out of tokens for the session and then needing to wait a few hours to use again. It was fine for little scripts or toy apps but not feasible for the regular dev work I do. So I upgraded to 5x. This now got me 1-2 hours per session before tokens expired. Which was better but still a frustration. Wincing at the price, I upgraded again to the 20x plan and this was the next game changer. I had plenty of spare tokens per session and at that price it felt like they were being wasted - so I ramped up my usage. Following a similar process as OP but with a plans directory with subdirectories for backlog, active and complete plans, and skills with strict rules for planning, implementing and completing plans, I now have 5-6 projects on the go. While I’m planning a feature on one the others are implementing. The strict plans and controls keep them on track and I have follow up skills for auditing quality and performance. I still haven’t hit token limits for a session but I’ve almost hit my token limit for the week so I feel like I’m getting my money’s worth. In that sense spending more has forced me to figure out how to use more.
3. The final piece of the puzzle is using opencode over claude code. I’m not sure why but I just don’t gel with Claude code. Maybe it’s all the sautéing and flibertygibbering, maybe it’s all the permission asking, maybe it’s that it doesn’t show what it’s doing as much as opencode. Whatever it is it just doesn’t work well for me. Opencode on the other hand is great. It’s shows what it’s doing and how it’s thinking which makes it easy for me to spot when it’s going off track and correct early.
Having a detailed plan, and correcting and iterating on the plan is essential. Making clause follow the plan is also essential - but there’s a line. Too fine grained and it’s not as creative at solving problems. Too loose/high level and it makes bad choices and goes in the wrong direction.
Is it actually making me more productive? I think it is but I’m only a week in. I’ve decided to give myself a month to see how it all works out.
I don’t intend to keep paying for the 20x plan unless I can see a path to using it to earn me at least as much back.
I burned through $10 on Claude in less than an hour. I only have $36 a day at $800 a month (800/22 working days)
It doesn’t seem controversial that the model that can solve more complex problems (that you admit the cheaper model can’t solve) costs more.
For the things I use it for, I’ve not found any other model to be worth it.
Have you tried Codex with OpenAi’s latest models?
Current clause subscription is a sunk cost for the next month. Maybe I’ll try codex if Claude doesn’t lead anywhere.
I can switch back and forth and use the MD file as shared context.
https://code.claude.com/docs/en/amazon-bedrock
The second fallback if it is for a customer project is to use their AWS account for development for them.
The rate my company charges for me - my level as an American based staff consultant (highest bill rate at the company) they are happy to let us use Claude Code using their AWS credentials. Besides, if we are using AWS Bedrock hosted Anthropic models, they know none of their secrets are going to Anthropic. They already have the required legal confidentiality/compliancd agreements with AWS.
https://github.com/obra/superpowers https://github.com/jlevy/tbd
The team that has developers closest to the customer usually makes the better product...or has the better product/market fit.
Then it's iteration.
Genuinely: no one really knows how humans work either.
This back and forth between the two agents with me steering the conversation elevates Claude Code into next level.
I'm not this structured yet, but I often start with having it analyse and explain a piece of code, so I can correct it before we move on. I also often switch to an LLM that's separate from my IDE because it tends to get confused by sprawling context.
How is this noteworthy other than to spark a discussion on hn? I mean I get it, but a little more substance would be nice.
give it a try: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow
* I ask the LLM for it's understanding of a topic or an existing feature in code. It's not really planning, it's more like understanding the model first
* Then based on its understanding, I can decide how great or small to scope something for the LLM
* An LLM showing good understand can deal with a big task fairly well.
* An LLM showing bad understanding still needs to be prompted to get it right
* What helps a lot is reference implementations. Either I have existing code that serves as the reference or I ask for a reference and I review.
A few folks do it at my work do it OPs way, but my arguments for not doing it this way
* Nobody is measuring the amount of slop within the plan. We only judge the implementation at the end
* it's still non deterministic - folks will have different experiences using OPs methods. If claude updates its model, it outdates OPs suggestions by either making it better or worse. We don't evaluate when things get better, we only focus on things not gone well.
* it's very token heavy - LLM providers insist that you use many tokens to get the task done. It's in their best interest to get you to do this. For me, LLMs should be powerful enough to understand context with minimal tokens because of the investment into model training.
Both ways gets the task done and it just comes down to my preference for now.
For me, I treat the LLM as model training + post processing + input tokens = output tokens. I don't think this is the best way to do non deterministic based software development. For me, we're still trying to shoehorn "old" deterministic programming into a non deterministic LLM.
However, there is a caveat. LLMs resist ambiguity about authority. So the "PCL" or whatever you want to call it, needs to be the ONE authoritative place for everything. If you have the same stuff in 3 different files, it won't work nearly as well.
Bonus Tip: I find long prompt input with example code fragments and thoughtful descriptions work best at getting an LLM to produce good output. But there will always be holes (resource leaks, vulnerabilities, concurrency flaws, etc). So then I update my original prompt input (keep it in a separate file PROMPT.txt as a scratch pad) to add context about those things maybe asking questions along the way to figure out how to fix the holes. Then I /rewind back to the prompt and re-enter the updated prompt. This feedback loop advances the conversation without expending tokens.
In my experience, the best scenario is that instruction and plan should be human written, and be detailed.
Just skip to the Ai stand-ups
1. First vibecode software to figure out what you want
2. Then throw it out and engineer it
It looks verbose but it defines the requirements based on your input, and when you approve it then it defines a design, and (again) when you approve it then it defines an implementation plan (a series of tasks.)
The key insight here - that planning and execution should be distinct phases - applies to productivity tools too. I've been using www.dozy.site which takes a similar philosophy: it has smart calendar scheduling that automatically fills your empty time slots with planned tasks. The planning happens first (you define your tasks and projects), then the execution is automated (tasks get scheduled into your calendar gaps).
The parallel is interesting: just like you don't want Claude writing code before the plan is solid, you don't want to manually schedule tasks before you've properly planned what needs to be done. The separation prevents wasted effort and context switching.
The annotation cycle you describe (plan -> review -> annotate -> refine) is exactly how I work with my task lists too. Define the work, review it, adjust priorities and dependencies, then let the system handle the scheduling.
This is my workflow as well, with the big caveat that 80% of 'work' doesn't require substantive planning, we're making relatively straight forward changes.
Edit: there is nothing fundamentally different about 'annotating offline' in an MD vs in the CLI and iterating until the plan is clear. It's a UI choice.
Spec Driven Coding with AI is very well established, so working from a plan, or spec (they can be somewhat different) is not novel.
This is conventional CC use.
i like the idea of having an actual document because you could actually compare the before and after versions if you wanted to confirm things changed as intended when you gave feedback
It comes back to you with an update for verification.
You ask it to 'write the plan' as matter of good practice.
What the author is describing is conventional usage of claude code.
https://github.com/backnotprop/plannotator Plannotator does this really effectively and natively through hooks
Really nice ui based on the demo.