FilterHN

Show HN: Auditi – open-source LLM tracing and evaluation platform

3 points

1 hour ago

| 0 comments

I've been building AI agents at work and the hardest part isn't the prompts or orchestration – it's answering "is this agent actually good?" in production.

Tracing tells you what happened. But I wanted to know how well it happened. So I built Auditi – it captures your LLM traces and spans and automatically evaluates them with LLM-as-a-judge + human annotation workflows.

Two lines to get started:

  auditi.init(api_key="...")
  auditi.instrument()  # monkey-patches OpenAI/Anthropic/Gemini

Every API call is captured with full span trees, token usage, and costs. No code changes to your existing LLM calls.

The interesting technical bit: the SDK monkey-patches client.chat.completions.create() at runtime (similar to how OpenTelemetry auto-instruments HTTP libraries). It wraps streaming responses with proxy iterators that accumulate content and extract usage from the final chunk – so even streamed responses get full cost tracking without the user doing anything.

What makes this different from just tracing: - Built-in evaluators – 7 managed LLM judges (hallucination, relevance, correctness, toxicity, etc.) run automatically on every trace - Span-level evaluation – scores each step in a multi-step agent, not just the final output - Human annotation queues – when you need ground truth, not just vibes - Dataset export – annotated traces export as JSONL/CSV/Parquet for fine-tuning

Self-host with docker compose up.

I'd love feedback from anyone running AI agents or LLMs in production. What metrics do you actually look at? How do you decide if an agent response is "good enough"?

GitHub: https://github.com/deduu/auditi

No one has commented on this post.