But I absolutely can't see how feeding the entire context into a more expensive model multiple times per task, just to propose context edits that might indirectly help, could ever be worthwhile.
I think more what's missing here is the comparison of different tries, from the same head. And there prompt caching does help!
I followed this tutorial earlier today and I'm having a lot of fun with it.
https://gist.github.com/a-n-d-a-i/cb5e929b4c87b8d185760d0264...
I added a 2nd while loop so that it takes user input. And vendored my tiny llm lib (so it's 150 lines now, and dependency free :)
---
As for context-sculpting, the economics are different when not touching the context gives you the >98% discount everyone's doing now. (Although it might be worth fiddling with the suffix... not sure yet!)
e.g. this issue: "ToolSearch saves ~15K tokens per request in prompt size, but at the cost of breaking prefix-based caching for models like DeepSeek that rely on stable prefixes. For heavy users of DeepSeek through OpenRouter, the savings from smaller prompts are dwarfed by the increased cost from cache misses."
What tends to happen is that I’ll reach a junction with Claude where I’m reading its output and starting to manually make changes yeah?. The output is usually already very good, close enough to what I want that it's way more efficient for both of us if I just carry it the rest of the way myself. Or I 'ruin' the clean flow by veering off into questions, clarifications, or broader changes. After a while of this back-and-forth review process, after definitely attaining an 'improvement checkpoint', I scroll back up to what I think of as a 'conversational context checkpoint' and then summarise all the changes we’ve converged on as succinctly as possible including the 'why' (verrry important to always tell them the why). I edit that checkpoint message, and that becomes a new branch of the conversation starting from there. All the noise, all the intermediate negotiation, disappears, and what remains is a compressed version of only what I want the model to carry forward.
Sometimes, if the changes are too extensive to easily summarise, I go even further back, one message above where Claude started generating files. I edit that instead, rebuild the context with only what matters, and then present the corrected output as if I authored it myself sometimes even saying something like, 'So Claude I tries to implement this myself, what do you think? Does it match what we agreed on? Do you disagree with anything or have questions or see issues?'. It works surprisingly well up to even catching new subtler problems in that improved code. It’s very close to your idea of context sculpting, just executed through manual, human-intuited branching rather than model-side compaction.
The interesting constraint I want to point out is judgement. The model doesn’t really have it in the human sense. Knowing what is important is exactly the hard part. Prompting it into the right shape figuring that out for itself is almost like an art form, whereas for the human there’s already an intuitive sense of... salience? running underneath everything. That's the sauce really. You, the human, are the most efficient and effective OUTER_MODEL
Another difference is structural. In this workflow, you are effectively editing your own message and then forking the conversation from that point downwards. It behaves more like a git branch than a linear chat.
I’ve never fully bought into agents either, at least not in the sense of something that meaningfully replaces the human in the loop. I don't want to either 'cause what then? Even if agents work, the question becomes who maintains the code, who understands it deeply enough to extend it without decay. So even this manual technique still depends on full human involvement. Each prompt is modular, structured, information dense, and intentionally non-open-ended. That non-open-endedness is like avoiding a circular dependency where groups of related contexts are littered all over and yu can't pick a proper checkpoint for the chunk of work you're currently doing. This again assumes you're solving problems in logical bite-sizes anyway.
But I consider prompting as less like conversation and more like maintaining modular, versioned intent. It should feel like a sciency discipline rather than purely improvisation Al. Steer it, don't let it steer and never append either; wipe the slate with an edit or new conversation but an edit keeps more of the nuances of that specific model instance. It's very interesting research you've done here. Well done to yu and codex!
EDIT: typo!