So we built an analytics layer for it. After connecting our own sessions, we ended up with a dataset of 1,573 real Claude Code sessions, 15M+ tokens, 270K+ interactions.
Some things we found that surprised us: - Skills were only being used in 4% of our sessions - 26% of sessions are abandoned, most within the first 60 seconds - Session success rate varies significantly by task type (documentation scores highest, refactoring lowest) - Error cascade patterns appear in the first 2 minutes and predict abandonment with reasonable accuracy - There is no meaningful benchmark for 'good' agentic session performance, we are building one.
The tool is free to use and fully open source, happy to answer questions about the data or how we built it.
LLMs are far from consistent.
This works in my experience
Starting new sessions frequently and using separate new sessions for small tasks is a good practice.
Keeping context clean and focused is a highly effective way to keep the agent on task. Having an up to date AGENTS.md should allow for new sessions to get into simple tasks quickly so you can use single-purpose sessions for small tasks without carrying the baggage of a long past context into them.
I have longer threads that I don't want to pollute with side quests. I will pull up multiple other chats and ask one or two questions about completely tangential or unrelated things.
The gates categorize issues into auto-fix or human-review. Auto-fix gets sent back to the coding agent, it re-reviews, and only the hard stuff makes it to me. That structure took me from about 73% first-pass acceptance to over 90%.
What I've been focused on lately is figuring out which gates actually earn their keep and which ones overlap with each other. The session-level analytics you're building would be useful on top of this, I don't have great visibility into token usage or timing per stage right now.
I wrote up the analysis: https://michael.roth.rocks/research/543-hours/
I also open sourced my log analysis tools: https://github.com/mrothroc/claude-code-log-analyzer
Saw another comment on a different platform where someone floated the idea of dynamically injecting context with hooks in the workflow to make things more deterministic.
It's been really helpful for me to debug my own sessions and understand what the model is seeing (system prompts, tool definitions, tracing tool calls etc.).I do not see any link or source for the data. I assume it is to remain closed, if it exists.
but i think the prior on 'this team fabricated these findings' is v low
Curious what shape the benchmark takes. Are you thinking per-task-type baselines, or something more like an aggregate efficiency score?
I scrolled through and didn’t see enough to justify installing and running a thing
Thx for the link - sounds great !
With this data, you can measure if you are spending too many tokens on sessions, how successful sessions are, and what makes them successful. Developers can also share individual sessions where they struggle with their peers and share learnings and avoid errors that others have had.
No, thanks
Or you can run your own instance, but we will need to add docs, on how to control the endpoint properly in the CLI.
TBH, I am very hesitant to upload my CC logs to a third-party service.
The learning is that it is fixable. Better CLAUDE md instructions, clearer initial prompts, and skill configurations that reduce the uncertainty cut abandonment significantly in our team.
Does this include the files being worked on by the agent in the session, or just the chat transcript?
if you dont trust us with that data though (which i can understand) you can host that thing locally on your machine
would love to know your actual day to day use case for what you built
I would say roughly equal amount of sessions between them (very roughly)
Also maybe 40% of coding sessions in large brownfield project. 50% greenfield, and remaining 10% non coding tasks.
It seems to me that sometimes it's better and more effective to remove, clean up, and simplify (both from CLAUDE.md and the code) rather than having everything documented in detail.
Therefore, from session analysis, it would be interesting to identify the relationship between documentation in CLAUDE.md and model efficiency. How often does the developer reject the LLM output in relation to the level of detail in CLAUDE.md?
It became very hard to understand what exactly is sent to LLM as input/context and how exactly is the output processed.