Show HN: IncidentFox, AI SRE that auto-builds its own integrations (open source)
1 points
1 hour ago
| 0 comments
| github.com
| HN
We've been building an AI agent that debugs production incidents, and we think most AI SRE tools are solving the wrong problem.

Everyone's focused on the reasoning — better prompts, better models, RAG over runbooks. But in our experience, the bottleneck isn't the AI's ability to think. It's access to data. At any company past 50 engineers, half the context you need during an incident lives in internal tools with no public docs, custom deploy systems, homegrown dashboards, internal CLIs. No vendor integration covers this.

So we focused on the integration problem. On setup, IncidentFox reads your codebase and Slack history and auto-generates tool integrations — including for internal tools it's never seen before. It figures out what APIs exist, how they're called, and builds the connectors. After each incident, it learns what data sources were actually useful and refines.

The agent itself is straightforward: alert fires → form hypotheses → query observability stack → correlate → report findings in Slack. For example, if latency spikes on your payments service, it'll pull the deploy from 20 minutes ago, find the relevant code change, check downstream dependencies, and open a PR with a fix.

Three design decisions we're opinionated about:

1. Slack-native, not dashboard-native. At 3am you don't want another tab. Paste screenshots, drop log files, view traces — all in the thread.

2. Team-customizable agents. The SRE team's agent should behave differently from the platform team's. Engineers can configure prompts and build team-specific tools. One-size-fits-all doesn't work for incident response.

3. No manual MCP server setup. If you need to spend weeks configuring integrations before AI is useful, you've already lost.

Apache 2.0, fully self-hostable. Try it in our Slack (https://join.slack.com/t/incidentfox/shared_invite/zt-3ojlxv...) with real telemetry (no setup), or clone and run it yourself.

The technical piece we find most interesting: given enough incidents, the agent builds an emergent model of your entire infrastructure — which services are fragile, which deploys are risky, which alerts are noise. Nobody explicitly teaches it this. It just falls out of the investigation data. We think there's something deeper here about AI agents that learn system topology from failure patterns.

Would love to hear people's thoughts!

No one has commented on this post.