FilterHN

Show HN: Ran an AI agent 100x – pass rate 70%, not 100%

1 points

1 hour ago

| 0 comments

I tested Claude 3 Haiku on "What is 247 * 18?" across 100 trials. Pass rate: 70%. 95% CI: 48%-85%. A task any calculator solves 100% of the time.

This is the core problem with agent evals today: one run tells you nothing. The same prompt, same model, same tools — different result every time.

I built agentrial to fix this. It's a pytest-style CLI that runs your agent N times and gives you:

- Wilson confidence intervals on pass rate - Step-level failure attribution (Fisher exact test pinpoints which tool call or reasoning step diverges between pass/fail runs) - Real API cost from response metadata - A GitHub Action that blocks PRs when reliability drops

Usage is minimal — write a YAML config, run "agentrial run":

  pip install agentrial

Tested extensively with LangGraph agents. 100 trials cost $0.06. MIT licensed, no telemetry, runs locally.

Looking for feedback on what metrics matter most when you're shipping agents to production.

No one has commented on this post.