CC-Canary: Detect early signs of regressions in Claude Code
28 points
3 hours ago
| 5 comments
| github.com
| HN
redanddead
45 seconds ago
[-]
the actual canary is the need for the canary itself
reply
evantahler
2 hours ago
[-]
I feel like asking the thing that you are measuring, and don’t trust, to measure itself might not produce the best measurements.
reply
john_strinlai
2 hours ago
[-]
"we investigated ourselves and found nothing wrong"
reply
Retr0id
1 hour ago
[-]
What is "drift"? It seems to be one of those words that LLMs love to say but it doesn't really mean anything ("gap" is another one).
reply
jldugger
28 minutes ago
[-]
IDK how it applies to LLMs but the original meaning was a change in a distribution over time. Like if you had some model based app trained on American English, but slowly more and more American Spanish users adopt your app; training set distribution is drifting away from the actual usage distribution.

In that situation, your model accuracy will look good on holdout sets but underperform in user's hands.

reply
idle_zealot
1 hour ago
[-]
I believe it's businessspeak for "change." Gap is suittongue for "difference."
reply
aleksiy123
2 hours ago
[-]
Interesting approach, I've been particularly interested in tracking and being able to understand if adding skills or tweaking prompts is making things better or worse.

Anyone know of any other similar tools that allow you to track across harnesses, while coding?

Running evals as a solo dev is too cost restrictive I think.

reply
FrankRay78
43 minutes ago
[-]
See the very last section in this doc for how I minimise token usage and track savings, all three plugins co-exist fine: https://github.com/FrankRay78/NetPace/blob/main/docs/agentic...
reply
wongarsu
2 hours ago
[-]
See also https://marginlab.ai/trackers/claude-code-historical-perform... for a more conventional approach to track regressions

This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets

reply