So we built an analytics layer for it. After connecting our own sessions, we ended up with a dataset of 1,573 real Claude Code sessions, 15M+ tokens, 270K+ interactions.
Some things we found that surprised us: - Skills were only being used in 4% of our sessions - 26% of sessions are abandoned, most within the first 60 seconds - Session success rate varies significantly by task type (documentation scores highest, refactoring lowest) - Error cascade patterns appear in the first 2 minutes and predict abandonment with reasonable accuracy - There is no meaningful benchmark for 'good' agentic session performance, we are building one.
The tool is free to use and fully open source, happy to answer questions about the data or how we built it.
LLMs are far from consistent.
This works in my experience
Saw another comment on a different platform where someone floated the idea of dynamically injecting context with hooks in the workflow to make things more deterministic.
Starting new sessions frequently and using separate new sessions for small tasks is a good practice.
Keeping context clean and focused is a highly effective way to keep the agent on task. Having an up to date AGENTS.md should allow for new sessions to get into simple tasks quickly so you can use single-purpose sessions for small tasks without carrying the baggage of a long past context into them.
I have longer threads that I don't want to pollute with side quests. I will pull up multiple other chats and ask one or two questions about completely tangential or unrelated things.
Does this include the files being worked on by the agent in the session, or just the chat transcript?
if you dont trust us with that data though (which i can understand) you can host that thing locally on your machine
It seems to me that sometimes it's better and more effective to remove, clean up, and simplify (both from CLAUDE.md and the code) rather than having everything documented in detail.
Therefore, from session analysis, it would be interesting to identify the relationship between documentation in CLAUDE.md and model efficiency. How often does the developer reject the LLM output in relation to the level of detail in CLAUDE.md?
I do not see any link or source for the data. I assume it is to remain closed, if it exists.
but i think the prior on 'this team fabricated these findings' is v low
I scrolled through and didn’t see enough to justify installing and running a thing
Thx for the link - sounds great !
TBH, I am very hesitant to upload my CC logs to a third-party service.
With this data, you can measure if you are spending too many tokens on sessions, how successful sessions are, and what makes them successful. Developers can also share individual sessions where they struggle with their peers and share learnings and avoid errors that others have had.
No, thanks
Or you can run your own instance, but we will need to add docs, on how to control the endpoint properly in the CLI.
would love to know your actual day to day use case for what you built
I would say roughly equal amount of sessions between them (very roughly)
Also maybe 40% of coding sessions in large brownfield project. 50% greenfield, and remaining 10% non coding tasks.
It became very hard to understand what exactly is sent to LLM as input/context and how exactly is the output processed.