What happens if every turn in an agent loop goes through Anthropic’s Batch API instead of the normal synchronous endpoint?
The motivation was cost. Batch API is 50% off, which sounds very attractive for agent workloads: evals, background research agents, CI agents, unattended subagents, etc.
The result: it works, but it is awful for a single interactive agent.
In my runs, a one-entry batch usually took ~90–120 seconds to complete. That means a five-turn tool loop becomes a ten-minute interaction. Waiting two minutes for the model to decide it needs to run ls is not a good UX.
But that was also the point of the experiment. A single REPL turn is probably the wrong unit to batch.
The interesting version is fleet-level batching:
- many agents running in parallel - background subagents - CI/eval jobs - multiple harnesses sharing a local proxy - shared prompt prefixes that may benefit from caching
In that world, the batcher should probably sit below the harness as infrastructure. Existing tools keep using the normal API shape, while a proxy decides per request whether it should go sync or async based on latency tolerance.
One surprising observation: in my small, non-rigorous testing, Haiku batches often felt slower than Sonnet/Opus batches. I wouldn’t treat that as a benchmark, but it does suggest routers should measure this rather than assuming “cheap model = batch model.”
Repo is here:
https://github.com/erans/batching-harness
It is intentionally small: one Python file, a basic tool loop, local shell tool, stats panel, and minimal sandboxing.
The useful lesson for me was:
Batch API is terrible as an interaction pattern for one agent. It might be very useful as a hidden optimization layer for a fleet of agents.