FilterHN

Ask HN: What's your biggest LLM cost multiplier?

4 points

12 hours ago

| 3 comments

"Tokens per request" has been a misleading cost model for us in production. The real drivers seem to be multipliers: retries/429s, tool fanout, P95 context growth, and safety passes.

What’s been the biggest cost multiplier in your prod LLM systems, and what policies worked (caps, degraded mode, fallback, hard fail)?

▲

zhug3

11 hours ago

[-]

In my experience the biggest multiplier isn't any single variable it's the interaction between them. Fanout × retries × context growth compounds in ways that linear cost models completely miss.

The fix that worked for us: treat budget as a hard constraint, not a target. When you're approaching limit, degrade gracefully (shorter context, fewer tool calls, fallback to smaller model) rather than letting costs explode and cleaning up later.

Also worth tracking: the 90th percentile request often costs 10x the median. A handful of pathological queries can dominate your bill. Capping max tokens per request is crude but effective.

▲

teilom

8 hours ago

[-]

+1 on interaction terms + tails : fanout × retries × context growth is where linear token math dies.

One thing we do in enzu is make “budget as constraint” executable: we clamp `max_output_tokens` from the budget before the call, and in multi-step/RLM runs we adapt output caps downward as the budget depletes (so it naturally gets shorter/cheaper instead of spiraling). When token counting is unavailable we explicitly enter a “budget degraded” mode rather than pretending estimates are exact.

Also agree p90/p95 cost/run matters more than averages; max-output caps are crude but effective.

Docs: https://github.com/teilomillet/enzu/blob/main/docs/PROD_MULT... and https://github.com/teilomillet/enzu/blob/main/docs/BUDGET_CO...

▲

teilom

12 hours ago

[-]

If you’re trying to estimate before prod, logging these 4 things in a pilot gets you 80% there: - tokens/run (in+out) - tool calls/run (and fanout) - retry rate (timeouts/429s) - context length over turns (P50/P95)

Fanout × retries is the classic “bill exploder”, and P95 context growth is the stealth one. The point of “budget as contract” is deciding in advance what happens at limit (degraded mode / fallback / partial answer / hard fail), not discovering it from the invoice.

▲

teilom

12 hours ago

[-]

Background note I wrote (framing + “budget as contract”): https://github.com/teilomillet/enzu/blob/main/docs/BUDGETS_A...