Claude always gets the syntax wrong on my tool calls.
So I did a revolutionary thing and made the error output print helpful guidance on how to correctly call the tool.
The agent tries again and always gets it right. Total time “wasted”: 1-2 seconds. It happens every session, but it only happens once per context window. After that the agent holds on to the lesson.
To do this for your own tool calls, imagine what you’d do in the agent’s place - what info you’d need so you can correct your mistake. Assume the agent wants to achieve the goal so it’ll try again. These are probabilistic systems, so we need to give them an extra loop to get the deterministic bits right.
It honestly depends on the model. For my pi-brains extension for pi
https://github.com/gitsense/pi-brains
I've found after the first hook injection they get it, but there are occasions it can forget, but since everything is driven by hooks, you can inject as often as needed.
The issue with skills is, they are a one time thing, so you really can't use skills to correct haviorial issues.
Have a look at the TDD guard at https://codeleash.dev - the scripts/tdd_log.py arguments are pretty specific but it also has guidance in CLAUDE.md and lots of helpful error messages.
The curl command is extremely popular so models seem to be really good at using it.
Also I like that curl uses a bash syntax and my platform requires JSON payloads; it makes the separation clear to the agent. I find it to be very reliable.
But:
"Now I’m somewhat worried about the track we’re on here. Alternative tool schemas might not just be unfamiliar. They might be implicitly punished by post-training that optimizes for one particular, forgiving tool ecology."
Only implicitly?
--
Many decades ago when I was working on research related to using MOOs as a learning environment, you would add "tool calls" into the stream of text that a MOO object might generate, so your rich client would e.g. show a picture, load a web page in a frame, move you on a map, trigger a change in an on-screen representation of an object.
Everyone who tried this in MUD/MUSH/MOO clients ran into more or less the same problems that LLM clients do: any attempt to shoehorn control sequences into in-band content was riddled with security risks, objects accidentally triggering the wrong interface etc.; you could never truly communicate out-of-band.
The more I read about how agentic harnesses work, the less embarrassed I feel about the code twenty-something-year-old me wrote in a MOO client.
Edit: found it, it’s called Grammar-Constrained Decoding (GCD)
Things are not quite AGI yet; which is why people are now saying that intelligence is the harness + model, because the harness makes up for limitations in generalization.
Doesn't always work, for better performance you can kneel and start begging
Is this still a thing? I thought Anthropic walked back the silent downgrades so now all the different domains downgrade non-silently.
- All models are terrible at generating line numbers for a proper diff, give up on them
- Some models (Owl-alpha) must have been post-trained on Codex transcripts, because they occasionally push its V4A patch format into any diff tool available
- Codex puts a lot of info in its system prompt about the desired patch style, making larger hunks instead of granular ones, etc
Only need ~650 tokens of system prompt for it to work. It’s pretty stellar.
> My strongest hypothesis is that this is not random deterioration but a training artifact. [...] Anthropic’s own client appears to expect and accept a fair amount of slop and repairs it, mostly silently
> If reinforcement learning happens in a harness like that, or a simulation of one, then slightly malformed tool calls can still complete the task and receive reward.
> Worse, the model may become very strongly adapted to the canonical Claude Code edit tool shape.
> Tool schemas are somewhere in the distribution and some shapes are close to what the model saw during post-training and some are far away.
Great article.
Interesting root cause hypothesis. Couldn't one simply strip the slop-handling from the RL env's harness to avoid this though?
I do agree on the walled garden being built here. Proprietary frontier models performing best in proprietary harnesses makes sense for Anthropic's interests.
It's amazing anyone watched the last 2 decades of tech's enshitification and wants to hook their wagon to this shitshow.
I definitely think models may be trained to use particular popular harnesses or expect certain fields in the editing-tool or other tool schemas. Rather than trying to conform to (or force) one particular format, my approach instead is to design flexibly enough to handle a wide array of possible inputs and tool calls, but that also help the agent recover whenever its tool calls truly can't be salvaged and have to return etrors, and to auto-normalize results whenever reasonable to do so. It really does make a very dramatic difference (I wouldn't have bothered to launch if I thought it wasn't a meaningful advance) but anyway, just wanted to share my perspective given that I live and breathe this problem all day, every day.
I have gone back and forth between Claude and Cursor and it is clear Claude just throws the kitchen sink at problems to get an edge. I write MCP tools and I see these exact problems when the inputs and outputs aren't clearly defined, the LLM just guesses and retries.