The overall approach I now have for medium sized task is roughly:
- Ask the agent to research a particular area of the codebase that is relevant to the task at hand, listing all relevant/important files, functions, and putting all of this in a "research.md" markdown file.
- Clear the context window
- Ask the agent to put together a project plan, informed by the previously generated markdown file. Store that project plan in a new "project.md" markdown file. Depending on complexity I'll generally do multiple revs of this.
- Clear the context window
- Ask the agent to create a step by step implementation plan, leveraging the previously generated research & project files, put that in a plan.md file.
- Clear the context window
- While there are unfinished steps in plan.md:
-- While the current step needs more work
--- Ask the agent to work on the current step
--- Clear the context window
--- Ask the agent to review the changes
--- Clear the context window
-- Ask the agent to update the plan with their changes and make a commit
-- Clear the context window
I also recommend to have specialized sub agents for each of those phases (research, architecture, planning, implementation, review). Less so in terms of telling the agent what to do, but as a way to add guardrails and structure to the way they synthesize/serialize back to markdown.
I actually think it works better that way, the agent doesn't have to spend as much time rereading code it had previously just read. I do have several "agents" like you mention, but I just use them one by one in the same chat so they share context. They all write to markdown in case I do want to start fresh if things do go the wrong direction, but that doesn't happen very often.
When you run llama.cpp on your home computer, it holds onto the key-value cache from previous runs in memory. Presumably Claude does something analogous, though on a much larger scale. Maybe Claude holds onto that key-value cache indefinitely, but my naive expectation would be that it only holds onto it for however long it expects you to keep the context going. If you walk away from your computer and resume the context the next day, I'd expect Claude to re-read your entire context all over again.
At best, you're getting some performance benefit keeping this context going, but you are subjecting yourself to context rot.
Someone familiar with running Claude or industrial-strength SOTA models might have more insight.
That said, the fact that we're all curating these random bits of "llm whisperer" lore is...concerning. The product is at the same time amazingly good and terribly bad.
As someone who definitely doesn’t know what they’re talking about, I’m going to guess that some analogous optimizations might apply to Claude.
Something something… TPU slice cache locality… guestures vaguely
Not trolling, true question.
Also while Claude Code is crunching floats you can do other things (maybe direct another agent instance)
The author literally talks about managing a team of multiple agents and Llm services requiring purchase of “tokens” is similar to popping a token into an arcade machine.
"Hacker culture never took root in the AI gold rush because the LLM 'coders' saw themselves not as hackers and explorers, but as temporarily understaffed middle-managers"
Also hacking really doesn’t have anything to do with generating poorly structured documents that compile into some sort of visual mess that needs fixing. Hacking is the analysis and circumvention of systems. Sometimes when hacking we piece together some shitty code to accomplish a circumvention task, but rarely is the code representative of the entire hack. Llms just make steps of a hack quicker to complete. At a steep cost.
My workflow for any IDE, including Visual Studio 2022 w/ CoPilot, JetBrains AI, and now Zed w/ Claude Code baked in is to start a new convo altogether when I'm doing something different, or changing up my initial instructions. It works way better. People are used to keeping a window until the model loses its mind on apps like ChatGPT, but for code, the context Window gets packed a lot sooner (remember the tools are sending some code over too), so you need to start over or it starts getting confused much sooner.
I didn't mention it in the blog post but actually experimented a bit with using Claude Code to create specialized agents such as an expert-in-Figma-and-frontend "Design Engineer", but in general found the results worse than just using Claude Code as-is. This also could be a prompting issue though and it was my first attempt at creating my own agents, so likely a lot of room to learn and improve.
like I'm sorry but when I see how much work the advocates are putting into their prompts the METR paper comes to mind.. you're doing more work than coding the "old fashioned way"
if there's adequate test coverage, and the tests emit informative failures, coding agents can be used as constraint-solvers to iterate and make changes, provided you stage your prompts properly, much like staging PRs.
claude code is really good at this.
It's highly likely that if you're working with one of the commercial models that has been tuned for code tasks, in one of the commercial platforms that is marketed to SWEs, that instructions similar to the effect of "you're an expert/experienced engineer" will already be part of the context window.
What does work is to provide clues for the agent to impersonate a clueless idiot on a subject, or a bad writer. It will at least sound like it in the responses.
Those models have been heavily trained with RLHF, if anything today's LLMs are even more likely to throw authoritative predictions, if not in accuracy, at least in tone.
https://github.com/0xeb/TheBigPromptLibrary/tree/main/System...
Every time I read about people using AI I come away with one question. What if they spent hours with a pen and paper and brainstormed about their idea, and then turned it into an actual plan, and then did the plan? At the very least you wouldn't waste hours of your life and instead enjoy using your own powers of thought.
OP here - I am a bit confused by this response. What are you trying to say or suggest here?
It's not like I didn't have a plan when making changes; I did, and when things went wrong, I tried to debug.
That said, if what you mean by having a plan (which again, I might not be understanding!) is write myself a product spec and then go build the site by learning to code or using a no/low code tool, I think that would have been arguably far less efficient and achieved a less ideal outcome.
In this case, I had Figma designs (from our product designer) that I wanted to implement, but I don't have the programming experience or knowledge of Remix as a framework to have been able to "just do it" on my own in a reasonable amount of time without pairing with Claude.
So while I had some frustrating hours of debugging, I still think overall I achieved an outcome (being able to build a site based on a detailed Figma design by pairing with Claude) that I would never have been able to achieve otherwise to that quality bar in that little amount of time.
I find my first branch more and more being `ask claude`. Having to actually think up organic solutions feels more and more annoying.
I’d rather put hours in figuring out what works and what doesn’t to get more value out of my future use.
Embrace that you aren't learning anything useful. Everything you are learning will be redundant in a year's time. Advice on how to make AI effective from 1 year ago is gibberish today. Today you've got special keyword like ultrathink or advice on when to compact context that will be gibberish in a year.
Use it, enjoy experimenting and seeing the potential! But no FOMO! There's a point when you need to realize it's not good enough yet, use the few useful bits, put the rest down, and get on with real work again.
Why would I have FOMO? I am literally not missing out.
> All you are doing is learning to deal with the very flaws that have to be fixed for it to be worth anything.
No it is already worth something.
> Embrace that you aren't learning anything useful
No, I am learning useful things.
> There's a point when you need to realize it's not good enough yet
No, it’s good enough already.
Interesting perspective I guess.
If it takes you hours to figure out what's working and what's not, then it isn't good enough. It should just work or it should be obvious when it won't work.
It’s just that you don’t like AI lol.
When LLMs ever reach that point I'll certainly hear about it and gladly use them. In the meantime I let the enthusiasts sort out the problems and glitches first.
> And my expectation on tools is that they help me
LLMs do this for me. You just don’t seem to get the same benefit that I do.
> and not make things more complicated than they already are.
LLMs do not do this for me. Things are already complicated. Just because they’re still complicated with LLMs does not mean LLMs are bad.
> When LLMs ever reach that point I'll certainly hear about it and gladly use them
You are hearing about it now. You’re just not listening because you don’t like LLMs.
I need substance and clear explanations of models, methodology, concepts with some visual support. Screenshots of the product are great but a quick real or two showing different examples or scenarios may be better.
I'm also skeptical many people who are already technical and already using AI tools will now want to use YOUR tool to conduct simulation based testing instead of creating their own. The deeper and more complex the simulation, the less likely your tool can adapt to specific business models and their core logic.
This is party of the irony of AI and YC startups, LOTS of people creating this interesting pieces of software with AI when part of the huge moat that AI provides is being able to more quickly create your own software. As it evolves, the SaaS model may face serious trouble except in the most valuable (e.g. complex and/or highly scalable) solutions already available with good value.
However simulations ARE important and they can take a ton of time to develop or get right, so I would agree this could be an interesting market if people give it a chance and it's well designed to support different stacks and business logic scenarios.
> If your ICP is technical, the frontend and marketing shouldn't be overdone IMO.
Great point. The ICP is technical, so this is certainly valid.
> I need substance and clear explanations of models, methodology, concepts with some visual support. Screenshots of the product are great but a quick real or two showing different examples or scenarios may be better.
We're working hard to get to something folks can try out more easily (hopefully one day Show HN-worthy) and better documentation to go with it. We don't have it yet unfortunately, which is why the site is what it is (for now).
>I'm also skeptical many people who are already technical and already using AI tools will now want to use YOUR tool to conduct simulation based testing instead of creating their own.
Ironically, we'd first assumed simulations would be easy to generate with AI (that's part of why we attempted to do this!) but 18+ months of R&D later and it's turned out to be something very challenging to do, never mind to replicate.
I do think AI will continue to make building SaaS easier but I think there are certain complex products, simulations included (although we'll see), that are just too difficult to build yourself in most cases.
To some extent, as I think about this, I suppose build vs. buy has somewhat always been true for SaaS and it's a matter of cost versus effort (and what else you could do with that effort). E.g. do you architect your own database solution or just use Supabase?
> However simulations ARE important and they can take a ton of time to develop or get right, so I would agree this could be an interesting market if people give it a chance and it's well designed to support different stacks and business logic scenarios.
I appreciate this, and it's certainly been our experience! We're still working to get it right, but it's something I'm quite excited about.
> Still, I wouldn’t trust Claude, or any AI agent, to touch production code without close human oversight.
My experience has been similar, and it's why I prefer to keep LLMs separate from my code base. It may take longer than providing direct access, but I find it leads to less hidden/obscure bugs that can take hours (and result in a lot of frustration) to fix.
I'm curious how you're managing this - is it primarily by inputting code snippets or abstract context into something like a Claude or ChatGPT?
I found for myself that I usually was bad at providing sufficient context when trying to work with the LLM separately from the codebase, but also might lack the technical background or appropriate workflow.
I usually provide the initial context by describing the app that I'm working on (language, framework, etc) as well as the feature I want to build, and then add the files (either snippets or upload) that are relevant to build the feature (any includes or other files it will be integrating with).
This keeps the chat context focused, and the LLM still has access to the code it needs to build out the feature without having access to the full code base. If it needs more context (sometimes I'll ask the LLM if they want access to other files), I'll provide additional code until it feels like it has enough to work with to provide a solution.
It's a little tedious, but once I have the context set up, it works well to provide solutions that are (mostly) bug free and integrate well with the rest of my code.
I primarily work with Perplexity Pro so that I have access to and can switch between all pro level models (Claude, ChatGPT, Grok, etc) plus Google search results for the most up-to-date information.
I haven’t used Perplexity (Pro or otherwise) much at all yet but will have to try.
It indexes files in your repo, but you can control which specific files to include when prompting and keep it very limited/controlled.
However, I do applaud you being transparent about the AI use by posting it here.
I've never liked the free-tier Claude (Sonnet/Opus) chat sessions I've attempted with code snippets. Claude non-coding chat sessions were good, but I didn't detect anything magical about the model and the code it churned out for me to decide for a Claude Max Plan. Neither Cursor (I'm also a customer), with its partial use of Claude seemed that great. Maybe the magic is mostly in CC the agent...
So, I've been using a modified CC [1] with a modified claude-code-router [2] (on my own server), which exposes an Anthropic endpoint, and a Cerebras Coder account with qwen-3-coder-480b. No doubt Claude models+CC are well greased-out, but I think the folks in the Qwen team trained (distilled?) a coding model that is Sonnet-inspired so maybe that's the reason. I don't know. But the sheer 5x-10x inference speed of Cerebras makes up for any loss in quality from Sonnet or the FP8 quantization of qwen on the Cerebras side. If starting from zero every few agentic steps is the strategy to use, that with Cerebras is just incredible because it's ~ instantaneous.
I've tried my Cerebras Coder account with way too many coding agents, and for now CC, Cline (VS Code) and Qwen Code (a Gemini Code fork) are the ones that work best. CC beats the pack as it compresses the context just right and recovers well from Cerebras 429 errors (tpm limit), due to the speed (hitting ~1500 tps typically) clashing with Cerebras unreasonably tight request limits. When a 429 comes trough, CC just holds its breath a few seconds then goes at it again. Great experience overall!
[1] I've decompiled CC and modified some constants for Cerebras to fix some hickups
[2] had to remove some invalid request json keys sent by CC using CCR, and added others that were missing
> for now CC, Cline (VS Code) and Qwen Code (a Gemini Code fork) are the ones that work best
Thanks for sharing how you set this up, as well as which agents you've found work best.
I tried a handful before settling on CC (for now!) but there are so many new ones popping up and existing ones seem to be rapidly changing. I also had a good experience with Cline in VS Code, but not quite as good as CC.
Haven't tried Quen Code yet (I tried the Gemini CLI but had issues with the usability; the content would frequently strobe while processing which was a headache to look at).
Yeah, definitely it's a coding agent zoo out there. But you can actually notice the polish in CC. Codex looks promising, a more naked look, I hope they invest in it, it's OSS and built in Rust, nice.
Qwen has that same jumpy scrolling as Gemini, and too many boxes, but it works well with Cerebras.
Coding agent product managers out there: stop putting boxes around text! There's no need! In Gemini, the boxes are all of different sizes, it's really ugly to look at. Think about copy-paste, multiline are all messed up with vertical box lines. Argh! Codex, which only has delicate colored border to the left, has a ctrl-t shortcut that is mandatory in TUIs: transcript mode, a ready to copy-paste print out totally togglable.
Another area of improvement is how fast and usable tooling can be. File read/write and patching can really make a difference. Also using the model for different stages of tool calling in parallel, specially if they all get faster like Cerebras. And better compression algos!
The speed and usability points you make are so critical. I'm sure these will continue to improve - and hope they do so soon!
Also it had some cutoffs with Cerebras - every once in a while it will get a reply then nothing happens, just stops there. I think 4xx errors weren't handled well either. Same happens with Codex with a Cerebras provider. Unfortunately there isn't a compelling reason for me to debug that, although I like that Codex is now Rust and OSS, much more fun than decompiling Claude for sure.
That said, I liked that it has sessions, undo and Plan x Code modes ("build" I think it was called), although that's already a non-pattern for most coding agents, it allows me to have say a OpenAI API o3 or gpt-5 to do some paid planning. But even that is not needed with something like Cerebras that just shoots code out of its butt like there's no tomorrow. Just rinse and repeat until it gets it right.
EDIT: just recalled another thing about opencode that messed me up: no exiting on ctrl-d blows up my mental unix cognition.
This seems to be a bad practice LLMs have internalized; there should be some indication that there’s more content below the fold. Either a little bit of the next section peeking up, or a little down arrow control.
I vibe coded a marketing website and hit the same issue.
Here’s a decent writeup on the problem and some design options: https://uxdesign.cc/dear-web-designer-let-s-stop-breaking-th...
The scroll bar behavior is an OS-level setting. The default in MacOS is to not show scroll bars unless you’re actively scrolling.
If you’re using an input device that doesn’t support gestures, or you changed the setting, you’ll see them always.
> "Approved - Ship It!” and ‘Great work on this!”
This pat on the head from an algorithm gives me the creeps, and I'm really struggling to put my finger on why.
Maybe it's because it's emulated approval, yet generating real feelings of pleasure in the author?
Ive come to just terminate the session if a phrase like that turns up
Tbf to Nadia however, her comment supposedly came from the "code reviewer" agent? So the prompt might've explicitly asked it to make this statement and would (hopefully) not be reusing the context of the development (and neither the other way)
I feel like Claude specifically uses language in what seems like a more playful way. I notice this also when instead of just a loader there are things like “Baking… Percolating…” etc.
I do get the ick if it feels like the agent is trying to be too “human” but in this case I just thought it was funny language for a response to my ask specifically to play a PR reviewer role.
Claude in reviewer mode also had a funny (at least to me) comment about it being interesting that Claude put itself as a PR contributor. I think in the screenshot in my blog post (if it got cutoff for others let me know and I can fix), but not called out in the text.
"Good boy! Good PR!"
I've heard increasingly good things about Cursor and Codex, but haven't tried them as recently. Cline (as a VS Code extension) might also be helpful here.
If you need designs, something like v0 could work well. There are a ton of alternatives (Base44, Figma Make, etc.) but I've found v0 works the best personally, although it probably takes a bit of trial and error.
For SEO support specifically, I might just try asking some of the existing AI tooling to try to help you optimize there although I'm not sure how well the results would be. I briefly experimented with this and early results seemed promising, but did not push on it a lot.
I run Dropbox on my laptop almost entirely as insurance against my laptop breaking or getting stolen before I've committed and pushed my work to git.
If for some hypothetical reason we still were in the era of tarballs, I doubt they'd be as useful.
(also yeah, I have iCloud pretty much for the same reason)
It's been a bit since I tried Cursor and I may need to revisit that as well.
I am converting a WordPress site to a much leaner custom one, including the functionality of all plugins and migrating all the data. I've put in about 20 hours or so and I'd be shocked if I have another 20 hours to go. What I have so far looks and operates better than the original (according to the owner). It's much faster and has more features.
The original site took more than 10 people to build, and many months to get up and running. I will have it up single-handedly inside of 1 month, and it will have much faster load times and many more features. The site makes enough money to fully support 2 families in the USA very well.
My Stack: Old school LAMP. PHPstorm locally. No frameworks. Vanilla JS.
Original process: webchat based since sonnet 3.5 came out, but I used Gemini a lot after 2.5 pro came out, but primarily sonnet.
- Use Claude projects for "features". Give it only the files strictly required to do the specific thing I'm working on. - Have it read the files closely, "think hard" and make a plan - Then write the code - MINOR iteration if needed. Sometimes bounce it off of Gemini first. - the trick was to "know when to stop" using the LLM and just get to coding. - copy code into PHPStorm and edit/commit as needed - repeat for every feature. (refresh the claude project each time).
Evolution: Finally take the CLI plunge: Claude Code - Spin up a KVM: I'm not taking any chances. - Run PHPStorm + CC in the KVM as a "contract developer" - the "KVM developer" cannot push to main - set up claude.md carefully - carefully prompt it with structure, bounds, and instructions
- run into lots of quirks with lots of little "fixes" -- too verbose -- does not respect "my coding style" -- poor adherence to claude.md instructions when over half way through context, etc - Start looking into subagents. It feels like it's not really working? - Instead: I break my site into logical "features" -- Terminal Tab 1: "You may only work in X folder" -- Terminal Tab 2: "You may only work in Y folder". -- THIS WORKS WELL. I am finally in a "HOLY MOLLY, I am now unquestionably more productive territory!"
Codex model comes out - I open another tab and try it - I use it until I hit the "You've reached your limit. Wait 3 hour" message. - I go back to Claude (Man is this SLOW! and Verbose!). Minor irritation. - Go back to Codex until I hit my weekly limit - Go back to Claude again. "Oh wow, Codex works SO MUCH BETTER for me". - I actually haven't fussed with the AGENTS.md, nor do I give it a bunch of extra hand-holding. It just works really well by itself. - Buy the OpenAI PRO plan and haven't looked back.
I haven't "coded" much since switching to Codex and couldn't be happier. I just say "Do this" and it does it. Then I say "Change this" and it does it. On the rare occasions it takes a wrong turn, I simply add a coding comment like "Create a new method that does X and use that instead" and we're right back on track.
We are 100% at a point where people can just "Tell the computer what you want in a web page, and it will work".
And I am SOOOO Excited to see what's next.
I await the good software. Where is the good software?
> I await the good software. Where is the good software?
Exactly this, it looks great on the surface until you dig in to find it using BlinkMacSystemFont and absolute positioning because it can't handle a proper grid layout.
You argue with it and it adds !important everywhere because the concept of cascading style is too much for its context window.
Someone once quipped that AI is like a college kid who studied a few programming courses, has access to all of stack overflow, lives in a world where hours go by in the blink of an eye, and has an IQ of 80 and is utterly incapable of learning.
Oh, also when it broke down and I tried to restart (the data model rewrite) using a context summary, it started going backwards and migrating back to the old data model beacuse it couldn't tell which one was which .. sigh.
What languages do you use?
What kind of projects?
Do you maintain these projects or is this for greenfield development?
Could you fix any bugs without Claude?
Are these projects tested, who writes the tests. If it's Claude how do you know these tests actually test something sensible?
Is anybody using these projects and what do users think of using these projects?
- HTML, JavaScript, Python, PHP, Rust
What kind of projects?
- Web apps (consumer and enterprise), games
Do you maintain these projects or is this for greenfield development?
- Both, I have my own projects and client projects
Could you fix any bugs without Claude?
- Yes, I have decades of software development experience
Are these projects tested, who writes the tests. If it's Claude how do you know these tests actually test something sensible?
- For serious projects yes, I will define the test cases and have Claude build them out along with any additional cases it identifies. I use planning mode heavily before any code gets written.
Is anybody using these projects and what do users think of using these projects?
- Yes, these are real projects in production with real users :) They love them
If you want to keep features in separate Git worktrees, https://conductor.build/ is pretty nice.
> Since our landing page is isolated from core product code, the risk was minimal.
The real question to ask is why your landing page so complex, it is a very standard landing page with sign-ups, pretty graphics, and links to the main bits of the website and not anything connected to a demo instance of your product or anything truly interactable.
Also, you claim this avoided you having to hire another engineer but you then reference human feedback catching the LLM garbage being generated in the repo. Sounds like the appropriate credit is shared between yourself, the LLM, and especially the developer who shepherded this behind the scenes.
That said, I was working on implementing a redesign for my startup's website as the project for the experiment - there's no way around that as context.
> The real question to ask is why your landing page so complex
I disagree on this; I don't think that was an issue. Our landing page would have been very easy for a developer on our team to build, that was never a question.
That said, we're a small startup team with myself, my cofounder / CTO, one engineer, and a design contractor. The two technical folks (my cofounder / CTO and the engineer) are focused on building our core product for the most part. I absolutely agree credit is due to them both for their work!
For this project, they helped me review a couple of my bigger PRs and also helped me navigate our CI/CD, testing, and build processes. I believe I mentioned their help in my blog post explicitly, but if it wasn't clear enough definitely let me know.
My goal in attempting this project was in no way to belittle the effort of actual developers or engineers on our team, whom I highly respect and admire. Instead, it was to share an experiment and my learnings as I tried to tackle our website redesign which otherwise would not have been prioritized.
also loved how in cto mode it went right away to "approve with minor comments" in the code review. this is too perfect in character.