We built a shared memory layer you can drop in as a Claude Code Skill. It’s basically a tiny memory DB with recall that remembers your sessions. Not magic. Not AGI. Just state.
Install in Claude Code:
/plugin marketplace add https://github.com/mutable-state-inc/ensue-skill
/plugin install ensue-memory
# restart Claude Code
What it does: (1) persists context between sessions (2) semantic & temportal search (not just string grep). Basically git for your Claude brainWhat it doesn’t do: - it won’t read your mind - it’s alpha; it might break if you throw a couch at it
Repo: https://github.com/mutable-state-inc/ensue-skill
If you try it and it sucks, tell me why so I can fix it. Don't be kind, tia
Otherwise the ability to search back through history is a valuable simple git log/diff or (rip)grep/jq combo over the session directory. Simple example of mine: https://github.com/backnotprop/rg_history
I feel that way too. I have a lot of these things.
But the reality is, it doesn't really happen that often in my actual experience. Everyone is very slow as a whole to understand what these things mean, so far you get quite a bit of time just with an improved, customized system of your own.
https://backnotprop.com/blog/50-first-dates-with-mr-meeseeks...
So... it's tough. I think memory abstractions are generally a mistake, and generally not needed, however I also think that compacting has gotten so wrong recently that they are also required until Claude Code releases a version with improved compacting.
But I don't do memory abstraction like this at all. I use skills to manage plans, and the plans are the memory abstraction.
But that is more than memory. That is also about having a detailed set of things that must occur.
I think planning is a critical part of the process. I just built https://github.com/backnotprop/plannotator for a simple UX enhancement
Before planning mode I used to write plans to a folder with descriptive file names. A simple ls was a nice memory refresher for the agent.
I am working alone. So I am instead having plans automatically update. Same conception, but without a human in the mix.
But I am utilizing skills heavily here. I also have a python script which manages how the LLM calls the plans so it's all deterministic. It happens the same way every time.
That's my big push right now. Every single thing I do, I try to make as much of it as deterministic as possible.
Do the authors have any benchmarks or test to show that this genuinely improved outputs?
I have tried probably 10-20 other open source projects and closed source projects purporting to improve Claude Code with memory/context, and still to this date, nothing works better than simply keeping my own library of markdown files for each project specification, markdown files for decisions made etc, and then explicitly telling Claude Code to review x,y,z markdown files.
I would also suggest to the founders, don't found a startup based on improving context for Claude Code, why? Because this is the number 1 thing the Claude Code developers are working on too, and it's clearly getting better and better with every release.
So not only are you competing with like 20+ other startups and 20+ other open-source projects, you are competing with Anthropic too.
And I agree with your sentiment, that this is a "business field" that will get eaten by the next generations of base models getting better.
For a single agent and a single tool, keeping project specs and decisions in markdown and explicitly pointing the model at them works well. We do that too.
What we’re focused on is a different boundary: memory that isn’t owned by a specific agent or tool.
Once you start switching between tools (Claude, Codex, Cursor, etc.), or running multiple agents in parallel, markdown stops being “the memory” and becomes a coordination mechanism you have to keep in sync manually. Context created in one place doesn’t naturally flow to another, and you end up re-establishing state rather than accumulating it.
That’s why we're not thinking about this as "improving Claude Code”. We’re interested in the layer above that: a shared, external memory that can be plugged into any other model and tools, that any agent can read from or write to, and that can be selectively shared with collaborators. Context created in Claude can be reused in Codex, Manus, Cursor, or other agents from collaborators - and vice versa.
If one already built and is using one agent in one tool and is happy with markdown, they probably don’t need this. The value shows up once agents are treated as interchangeable workers and context needs to move across tools and people without being re-explained each time.
First thought: why do I need an API key for what can be local markdown files. Make contents of CLAUDE.md be "Refer to ROBOTS.md" and you've got yourself a multi-model solution.
Main objection to corporate AI uptake is what are you gonna do with our data. The value prop over local markdown files here is not at all clear to even begin asking that question.
You can work around a lot of the memory issues for large and complex tasks just by making the agent keep work logs. Critical context to keep throughout large pieces of work include decisions, conversations, investigations, plans and implementations - a normal developer should be tracking these and it's sensible to have the agent track them too in a way that survives compaction.
- `FEATURE_IMPL_PLAN.md` (master plan; or `NEXT_FEATURES_LIST.md` or somesuch)
- `FEATURE_IMPL_PROMPT_TEMPLATE.md` (where I replace placeholders with next feature to be implemented; prompt includes various points about being thorough, making sure to validate and loop until full test pipeline works, to git version tag upon user confirmation, etc.)
- `feature-impl-plans/` directory where Claude is to keep per-feature detailed docs (with current status) up to date - this is esp. useful for complex features which may require multiple sessions for example
- also instruct it to keep main impl plan doc up to date, but that one is limited in size/depth/scope on purpose, not to overwhelm it
- CLAUDE.md has summary of important code references (paths / modules / classes etc.) for lookup, but is also restricted in size. But it includes full (up-to-date) inventory of all doc files, for itself
- If I end up expanding CLAUDE.md for some reason or temporarily (before I offload some content to separate docs), I will say as part of prompt template to "make sure to read in the whole @CLAUDE.md without skipping any content"
They refuse to believe that it's possible to instruct these tools in terse plain English and get useful results.
One thing I’d clarify about what we’re building is that it’s not meant to be “the best memory for a single agent.”
The core idea is portability and sharing, not just persistence.
Concretely:
- you can give Codex access to memory created while working in Claude
- Claude Code can retrieve context from work done in other tools
- multiple agents can read/write the same memory instead of each carrying their own partial copy
- specific parts of context can be shared with teammates or collaborators
That’s the part that’s hard (or impossible) to do with markdown files or tool-local memory, and it’s also why we don’t frame this as “breaking the context limit.”
Measuring impact here is tricky, but the problem we’re solving shows up as fragmentation rather than forgetting: duplicated explanations, divergent state between agents, and lost context when switching tools or models.
If someone only uses a single agent in a single tool and already are using their customized CLAUDE.md, they probably don’t need this. The value shows up once you treat agents as interchangeable workers rather than a single long-running conversation.
I'm confused because every single thing in that list is trivial? Why would Codex have trouble reading a markdown file Claude wrote or vice versa? Why would multiple agents need their own copy of the markdown file instead of just referring to it as needed? Why would it be hard to share specific files with teammates or collaborators?
Edit - I realize I could be more helpful if I actually shared how I manage project context:
CLAUDE.md or Agents.md is not the only place to store context for agents in a project, you can just store docs at any layer of granularity you want. What's worked best for me is to:
1. Have a standards doc(s) (you can point the agents to the same standards doc in their respective claude.md/agents.md)
2. Before coding, have the agent create implementation plans that get stored in to tickets (markdown files) for each chunk of work that would take about a context window length (estimated).
3. Work through the tickets and update them as completed. Easy to refer back to when needed.
4. If you want you can ask the agent to contribute to an overall dev log as well, but this gets long fast. Is useful for agents to refer to the last 50 lines or so to immediately get up to speed on "what just happened?", but so could git history.
5. Ultimately the code is going to be the real "memory" of the true state, so try to organize it in a way that's easy for agents to comb through (no 5000 lines files that agents have trouble trying to carefully jump around in to find what they need without eating up their entire context window immediately).
Where it stopped being trivial for us was once multiple agents were working at the same time. For example, one agent is deciding on an architecture while another is already generating code. A constraint changes mid-way. With a flat file, both agents can read it, but you’re relying on humans as the coordination layer: deciding which docs are authoritative, when plans are superseded, which tickets are still valid, and how context should be scoped for a given agent.
This gets harder once context is shared across tools or collaborators’ agents. You start running into questions like who can read vs. update which parts of context, how to share only relevant decisions, how agents discover what matters without scanning a growing pile of files, and how updates propagate without state drifting apart.
You can build conventions around this with files, and for many workflows that works well. But once multiple agents are updating state asynchronously, the complexity shifts from storage to coordination. That boundary - sharing and coordinating evolving context across many agents and tools — is what we’re focused on and what an external memory network can solve.
If you’ve found ways to push that boundary further with files alone, I’d genuinely be curious - this still feels like an open design space.
> With a flat file, both agents can read it, but you’re relying on humans as the coordination layer: deciding which docs are authoritative, when plans are superseded, which tickets are still valid, and how context should be scoped for a given agent.
So the memory system also automates project management by removing "humans as the coordination layer"? From the OP the only details we got were
"What it does: (1) persists context between sessions (2) semantic & temportal search (not just string grep)"
Which are fine, but neither it nor you explain how it can solve any of these broader problems you bring up:
"deciding which docs are authoritative, when plans are superseded, which tickets are still valid, and how context should be scoped for a given agent, questions like who can read vs. update which parts of context, how to share only relevant decisions, how agents discover what matters without scanning a growing pile of files, and how updates propagate without state drifting apart."
You're claiming that semantic and temporal search has solved all of this for free? This project was presented as a memory solution and now it seems like you're saying its actually an agent orchestration framework, but the gap between what you're claiming your system can achieve and how you claim it works seems vast.
None, that's what I'm trying to say. My favorite is just storing project context locally in docs that agents can discover on their own or I can point to if needed. This doesn't require me to upload sensitive code or information to anonymous people's side projects and has and equivalent amount of hard evidence for efficacy (zero), but at least has my own anecdotal evidence of helping and doesn't invite additonal security risk.
People go way overboard with MCPs and armies of subagents built on wishes and unproven memory systems because no one really knows for sure how to get past the spot we all hit where the agentic project that was progressing perfectly hits a sharp downtrend in progress. Doesn't mean it's time to send our data to strangers.
FWIW, I find this eventual degradation point comes much later and with fewer consequences when there are strict guardrails inside and outside of the LLM itself.
From what I've seen, most people try to fix only the "inside" part - by tweaking the prompts, installing 500 MCPs (that ironically pollute the context and make problem worse), yell in uppercase in hopes that it will remember etc, and ignore that automated compliance checks existed way before LLMs.
Throw the strictest and most masochistic linting rules at it in a language that is masochistic itself (e.g. rust), add tons of integration tests that encode intent, add a stop hook in CC that runs all these checks and you've got a system that is simply not allowed to silently drift and can put itself back on track with feedback it gets from it.
Basically, rather than trying to hypnotize an agent to remember everything by writing a 5000 line agents.md, just let the code itself scream at it and feed the context.
All of these systems are for managing context.
You can generally tell which ones are actually doing something if they are using skills, with programs in them.
Because then, you're actually attaching some sort of feature to the system.
Otherwise, you're just feeding in different prompts and steps, which can add some value, but okay, it doesn't take much to do that.
Like adding image generation to claude code with google nano banana, a python script that does it.
That's actually adding something claude code doesn't have, instead of just saying "You are an expert in blah"
another is one claude code ships with, using rip grep.
Those are actual features. It's adding deterministic programs that the llm calls when it needs something.
> Otherwise, you're just feeding in different prompts and steps
"skills" are literally just .md files with different prompts and steps.
> That's actually adding something claude code doesn't have, instead of just saying "You are an expert in blah"
It's not adding anything but a prompt saying "when asked to do X invoke script Y or do steps Z"
So a better question to ask is - Do you have any ideas for an objective way to a measure a performance of agentic coding tools? So we can truly determine what improves performance or not.
I would hope that internal to OpenAI and Anthropic they use something similar to the harness/test cases they use for training their full models to determine if changes to claude code result in better performance.
I often use restore conversion checkpoint after successfully completing a side quest.
I’m not sure where the ‘despite’ comes in. Experts and vets have opinions and this is probably the best online forum to express them. Lots of experts and vets also dislike extremely popular unrelated tools like VB, Windows, “no-code” systems, and Google web search… it’s not a personality flaw. It doesn’t automatically mean they’re right, either, but ‘expert’ and ‘vet’ are earned statuses, and that means something. We’ve seen trends come and go and empires rise and fall, and been repeatedly showered in the related hype/PR/FUD. Not reflexively embracing everything that some critical mass of other people like is totally fine.
Because experts snd vets often use these tools and find them extremely lacking?
My approach is literally just a top-level, local, git version controlled memory system with 3 commands:
- /handoff - End of session, capture into an inbox.md
- /sync - Route inbox.md to custom organised markdown files
- /engineering (or /projects, /tasks, /research) - Load context into next session
I didn't want a database or an MCP server or embeddings or auto-indexing when I can build something frictionless that works with git and markdown.
Repo: https://github.com/ossa-ma/double (just published it publicly but its about the idea imo)
I will typically make multiple '/handoff's per day as I use Claude code whereas I typically use '/sync' at the end of the day to organise them all at once.
I think at this point in time, we both have it right.
My own fully-local, minimalistic take on this problem of "session continuation without compaction" is to rely on the session JSONL files directly rather than create separate "memory" artifacts, and seamlessly index them to enable fast full-text search. This is the idea behind the "aichat" command-group + plugin I just added to my claude-code-tools [1] repo. You can quit your Claude-Code/Codex-CLI session S and type
aichat resume <id-of-session-S-you-just-quit>
It launches a TUI, offering a few ways to continue your work:- blind trim - clones the session, truncates large tool calls/results and older assistant messages, which can clear up as much as 50% of context depending of course on what's going on; this is a quick hack to continue your work a bit longer
- smart trim - similar but uses headless agent to decide what to truncate
- rollover: the one I use most frequently; it creates a new session S1 (which can optionally be a different CLI agent, allowing cross-agent work continuation), and injects back-pointers to the parent session JSONL file of S, the parent's parent , and so on (what I call session lineage) , into the first user message, and the user can then prompt the agent to use a sub-agent to extract arbitrary context from the ancestor sessions to continue the work. E.g. you can, "Use sub-agent(s) to extract context from the last session shown in the lineage, about how we were working on fixing the tmux-cli issues". The repo has an aichat plugin that provides various slash commands, skills and agents, e.g. /recover-context can be used to extract context relevant to the last task in the parent session.
There is also an "aichat search" command that launches a Rust/Tantivy-based super-fast full-text search TUI to help find past work across Claude-Code/Codex-CLI sessions. (Note that claude --resume only searches session titles/names).
The search command has an agent-friendly JSONL mode for the agent/sub-agents to search for arbitrary past work across all sessions and query/filter with "jq" etc. This lets you open a fresh session and ask it recover context about your past work on X.
[1] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
The project is still in alpha, so you could shape what we build next - what do you need to see, or what gets you comfortable sending proprietary code to other external services?
Honestly? It just has to be local.
At work, we have contracts with OpenAI, Anthropic, and Google with isolated/private hosting requirements, coupled with internal, custom, private API endpoints that enforce our enterprise constraints. Those endpoints perform extensive logging of everything, and reject calls that contain even small portions of code if it's identified as belonging to a secret/critical project.
There's just no way we're going to negotiate, pay for, and build something like that for every possible small AI tooling vendor.
And at home, I feed AI a ton of personal/private information, even when just writing software for my own use. I also give the AI relatively wide latitude to vibe-code and execute things. The level of trust I need in external services that insert themselves in that loop is very high. I'm just not going to insert a hard dependency on an external service like this -- and that's putting aside the whole "could disappear / raise prices / enshittify at any time" aspect of relying on a cloud provider.
We use Cursor where I work and I find it a good medium for still being in control and knowing what is happening with all of the changes being reviewed in an IDE. Claude feels more like a black box, and one with so many options that it's just overwhelming, yet I continue to try and figure out the best way to use it for my personal projects.
Claude code suffers from initial decision fatigue in my opinion.
Agree with the other comments: pretty much running vanilla everything and only the Playwright MCP (IMO way better than the native chrome integration) and ccstatusline (for fun). Subagents can be as simple as saying "do X task(s) with subagent(s)". Skills are just self @-ing markdown files.
Two of the most important things are 1) maintaining a short (<250 lines) CLAUDE.md and 2) having a /scratch directory where the agent can write one-off scripts to do whatever it needs to.
This helps it organize temporary things it does like debugging scripts and lets it (or me) reference/build on them later, without filling the context window. Nothing fancy, just a bit of organization that collects in a repo (Git ignored)
That said, it's well know that Anthropic uses CC for production. You just slow things down a bit, spend more time on the spec/planning stage and manually approve each change. IMO the main hurdle to broader Claude Code adoption isn't a code quality one, it's mostly getting over the "that's not how I would have written it" mindset.
I've TL'd and PM'd as well as IC'd. Now my IC work feels a lot more like a cross between being a TL and being a senior with a handful of exuberant and reasonably competent juniors. Lots of reviewing, but still having to get into the weeds quickly and then get out of their way.
Things that need special settings now won’t in the future and vice versa.
It’s not worth investing a bunch of time into learning features and prompting tricks that will be obsoleted soon
They do get better, but not enough to change any of the configuration I have.
But you are correct, there is a real possibility that the time invested with be obsolete at some point.
For sure the work towards MCPs are basically obsolete via skills. These things happen.
how would that be a "skill"? just wrap the mcp in a cli?
fwiw this may be a skill issue, pun intended, but i can't seem to get claude to trigger skills, whereas it reaches for mcps more... i wonder if im missing something. I'm plenty productive in claude though.
So a Skill is just a smaller granulatrity level of that concept. It's just one of the individual things an MCP can do.
This is about context management at some level. When you need to do a single thing within that full list of potential things, you don't need the instructions about a ton of other unrelated things in the context.
So it's just not that deep. It would be having a python script or whatever that the skill calls that returns the runtime dependencies and gives them back to the LLM so they can refactor without blindly greping.
Does that make sense?
again going to my example, a skill to do a dependency graph would have to do a complex search. and in some languages the dependency might be hidden by macros/reflection etc which would obscure a result obtained by grep
how would you do this with a skill, which is just a text file nudging the llm whereas the MCP's server goes out and does things.
It's always interesting reading other people's approaches, because I just find them all so very different than my experience.
I need Agents, and Skills to perform well.
I agree that this level of finetuning feels overwhelming and might let yourself doubting whether you do utilize Claude to its optimum and the beauty is, that finetunging and macro usage don't interfere, when you stay in your lane.
For example I now don't use the planing agent anymore instead incorporated this process into the normal agents much to the project's advantage. Consistency is key. Anthropic did the right thing.
Codex is quite a different beast and comes from the opposite direction so to say.
I use both, Codex and Claude Opus especially, in my daily work and found them complementary not mutual exclusive. It is like two different evangelists who are on par exercising with different tools to achieve a goal, that both share.
It's also deeply interesting because it's essentially unsolved space. It's the same excitement as the beginning of the internet.
None of us know what the answers will be.
1. Current directory ./CLAUDE.md
2. User directory ~/.claude/CLAUDE.md
I stick general preferences in what it calls "user memory" and stick project specific preferences in the working directory.I've been trying to write blogs explaining it recently, but I don't think I'm very good at making it sound interesting to people.
What can I explain that you would be interested in?
Here was my latest attempt today.
https://vexjoy.com/posts/everything-that-can-be-deterministi...
Here is what I don't get. it's trivial to do this. Mine is of course customized to me and what I do.
The idea is to communicate the ideas, so you can use them in your own setup.
It's trivial to put for example, my do router blog post in claude code and generate one customized for you.
So what does it matter to see my exact version?
These are the type of things I don't get. If I give you my details, it's less approachable for sure.
The most approachable thing I could do would be to release individual skills.
Like I have skills for generating images with google nano banana. That would be approachable and easy.
But it doesn't communicate the why. I'm trying to communicate the why.
When you've tried 10 ways of doing it but they all end up getting into a "feed the error back into the LLM and see what it suggests next" you aren't that motivated to put that much effort into trying out an 11th.
The current state of things is extremely useful for a lot of things already.
I'm not sure if the problems you run into with using LLMs will be solved if you do it my way. My problems are solved doing it my way. If I heard more about your problems, I would have a specific answer to them.
These are the solutions to where I have run into issues.
For sure, but my solutions are not feed the error back into the LLM. My solutions are varied, but as the blog shows, they are move as much as possible into scripts, and deterministic solutions, and keep the LLM to the smallest possible scope.
The current state of things is extremely useful for a subset of things. That subset of things feels small to me. But it may be every thing a certain person wants to do exists in that subset of things.
It just depends. We're all doing radically different things, and trying very different things.
I certainly understand and appreciate your perspective.
My basic problem is: "first-run" LLM agent output frequently does one or more of the following: fails to compile/run, fails existing test coverage, or fails manual verification. The first two steps have been pretty well automated by agents: inspect output, try to fix, re-run. IME this works really well for things like Python, less-well for things like certain Rust edge cases around lifetimes and such, or goroutine coordination, which require a different sort of reasoning than "typical" procedural programming.
But let's assume that the agents get even better at figuring out the deal with the more specialized languages/features and are able to iterate w/o interaction to fix things.
If the first-pass output still has issues, I still have concerns. They aren't "I'm not going to use these tools" concerns, because I also sometimes write bugs, and they can write the vast majority of code faster than I can.
But they are "I'm not gonna vibe-code my day job" concerns because the existence of trivially-catchable issues suggests that there's likely harder-to-catch issues that will need manual review to make sure (a) test coverage is sufficient, (b) the mental model being implemented is correct, (c) the outside world is interacted with correctly. And I still find bugs in these areas that I have to fix manually.
This all adds up to "these tools save me 20-30% of my time" (the first-draft coding) vs "these agents save me 90% of my time."
So I'm kinda at a plateau for a few months where it'll be hard to convince me to try new things to try to close that 20-30% -> 90% number.
The real issue is I don’t know the issues ahead of time. So each experience is an iteration stopping things I didn’t know would happen.
Thankfully, I’m not trying to sell anyone anything. I don’t even want people to use what I use. I only want people to understand the why of what I do, and how it adds me value.
I think it’s important to understand this thing we use as best we can.
The personal value you can get, is entirely up to your tolerance for it.
I just enjoy the process
For large codebases (my own has 500k lines and my company has a few tens of millions) you need something better like RPI.
If nothing else just being able to understand code questions basically instantly should give you a large speed up, even without any fancy stuff.
In some sense, computers and digital things have now just become a part of reality, blending in by force.
But the things I am doing might not be the things you are doing.
If you want proof, I intend to release a game to the App Store and steam soon. At that point you can judge if it built a thing adequately.
I hope you're just one of the ones who figured it out early and all the hype isn't fake bullshit. I'd much rather be proven wrong than for humanity to have wasted all this time and resources.
I think of this stuff as trivial to understand from my point of view. I am trying to share that.
I have nothing to sell, I don’t want anyone to use my exact setup.
I just want to communicate the value as I see it, and be understood.
The vast majority of it all is complete bullshit, so of course I am not offended that I may sound like 1000 other people trying to get you to download my awesome Claude Code Plugins repo.
Except I’m not actually providing one lol
Consider more when you're 50+ hours in and understand what more you want.
the docs if you are curious: https://www.ensue-network.ai/docs
The PMs were right all along!
Deploy the service on your cloud server or your local computer, then add the streamable MCP and skill to Claude Code.
To activate in a new conversation, simply reference the skill first: `@~/.claude/skills/mem/SKILL.md`.
If you like this project, please give it a star on GitHub!
But imagine how hard it would be if these kids had short term memory only and they would not know what to focus on except what you tell them to. You literally have to tell them "Here is A-Z pay attention to 'X' only and go do your thing". Add in other managers for this party like a caterer, clowns, your spouse and they also have to tell them that and remember, communicate what other managers have done. No one has solved for this, really.
This is what it felt like in 2025 to code with LLMs on non trivial projects, with some what of an improvement as the year went by. But I am not sure much progress was made in fixing the process part of the problem.
Then again, this might be just me. When there's a task to be done, even without an LLM my thought process is about selecting the relevant parts of my context for solving it. What is relevant? What starting point has the best odds of being good? That translates naturally to tasking an LLM.
Let's say I have a spec I'm working on. It's based off of a requirements document. If I want to think about the spec in isolation (let's say I want to ask the LLM what requirements are actually being fulfilled by the spec), I can just pass the spec, without passing the requirements. Then I'll compare the response against the actual requirements.
At the end of the day, I guess I hate the automagicness of a silent context injection. Like I said, it also negates the perfect forgetfulness of LLMs.
Though I have found repo level claude.md that is updated everytime claude makes a mistake plus using —restore to select a previous relevant session works well.
There is no way for Anthropic to optimize Claude code or the underlying models for these custom setups. So it’s probably better to stick with the patterns Anthropic engineers use internally.
And also - I genuinely worry about vendor lock-in, do you?
Claude Code keeps all the conversation logs stored on-disk right? Why not parse them asynchronously and then use hooks to enrich the context as the conversation goes? (I mean in the most broad and generic way, I guess we’d have to embed them, do some RAG… the whole thing)
The issue we ran into when building agent systems was portability. Once you want multiple agents or models to share the same evolving context, each tool reconstructing its own memory from transcripts stops scaling.
We’re less focused on “making agents smarter” and more on avoiding fragmentation when context needs to move across agents, tools, or people — for example, using context created in Claude from Codex, or sharing specific parts of that context with a friend or a team.
That’s also why benchmarks are tricky here. The gains tend to show up as less duplication and less state drift rather than a single accuracy metric. What would constitute convincing proof in this space for you?
I’m curious how people think about portability: e.g. letting Claude Code retrieve context that was created while using Codex, Manus, or Cursor, or sharing specific parts of that context with other people or agents.
At that point, log parsing and summaries become per-tool views of state rather than shared state. Do people think a shared external memory layer is overkill here, or a necessary step once you have multiple agents/tools in play?
Quite a few of you have mentioned that you store a lot of your working context across sessions in some md file - what are you actually storing? What data do you actually go back to and refer to as you're building?
1a directly from Anthropic on agentic coding and Claude Code best practices.
"Create CLAUDE.md files"
https://www.anthropic.com/engineering/claude-code-best-pract...
It works great. You can put anything you want in there. Coding style, architecture guidelines, project explanation.
Anything the agent needs to know to work properly with your code base. Similar to an onboarding document.
Tools (Claude Code CLI, extensions) will pick them up hierarchically too if you want to get more specific about one subdirectory in your project.
AGENTS.md is similar for other AI agents (OpenAI Codex is one). It doesn't even have to be those - you can just @ the filename at the start of the chat and that information goes in the context.
The naming scheme just allows for it to be automatic.
I’m never stopped and Claude always remembers what we’re doing.
This pattern has been highly productive for 8 months.
Combined with a good AGENTS.md, it seems to be working really well.
Very clearly AI written
Even if most approaches fail, exploring that boundary feels useful - especially if the system is transparent about what it stores and why.
If you're using them though, we no longer have the problem of Claude forgetting things.
This wasn't mentioned in the first post, but the use case we’re focused on isn’t really “Claude forgetting,” but context living beyond a single agent or tool. Even if Claude remembers well within a session, that context is still owned by that agent instance.
The friction shows up when you switch tools or models (Claude → Codex / Cursor / etc.), run multiple agents in parallel, or want context created in one place to be reused elsewhere without re-establishing it.
In those cases, the problem isn’t forgetting so much as fragmentation. If someone is happy with one agent and one tool, there are probably a bunch of memory solutions to choose from. The value of this external memory network that you can plug into any model or agent shows up once context needs to move across tools and people.
Agents are an md file with instructions.
Skills are an md file with instructions.
Commands are.. you get the point.
We're just dealing with instructions. Claude.md is handled by Claude Code. It is forgotten almost entirely often when the context fills.
Okay, what is an agent? An agent is basically a Claude.md file, but you make it extremely granular. So it only has instructions of let's say, Typescript.
We're all just doing context management here. We're trying to make sure our instructions that matter stay.
To do that, we have to remove all other instructions from the picture.
When you're doing typescript, you only know type script things.
Okay, what's a skill? A skill is doing a single thing with type script. Why? So that the context is even smaller.
Instead of the agent having every single instruction you need about typescript, you put them in skills so they only get put into context when that thing is needed.
But skills are also where you connect deterministic programs. For example, I have a skill for creating images in nano banana.
So when the Typescript Agent needs to create an image, it calls the skill, that calls the python script, to create images in nano banana.
We're managing all the context to only be available when it's needed, keeping all other instructions out.
Does that help?
I run it in automatic mode with decent namespacing, so thoughts, notes, and whole conversations just accumulate in a structured way. As I work, it stores the session and builds small semantic, entity-based hypergraphs of what I was thinking about.
Later I’ll come back and ask things like:
what was I actually trying to fix here?
what research threads exist already?
where did my reasoning drift?
Sometimes I’ll even ask Claude to reflect on its own reasoning in a past session and point out where it was being reactive or missed connections.
Claude itself can just update the claude.md file with whatever you might have forgot to put in there.
You can stick it in git and it lives with the project.
Did Claude write this?
Just their thought management git system works pretty well for me TBH. https://www.humanlayer.dev/
I’ll give this a go though and let you know!
Not this. Not that. Just something.
What it does.
What it doesn't do.
> ... fix it.
Or, over continuing the same session and compacting?
AI writing slop is infecting everything. Nothing turns me off this product more than the feeling you can’t even write about it as a human. If you can’t do that, why would I use or value it?
It’s almost as if software authors are afraid that if their project names are too descriptive, they won’t be able to pivot to some other purpose, which ends up making every project name sound at once banal and vague.
I use things like claude projects on the web app and skills and stuff, and claude code heavily.
I want to manually curate the context, adding memory is a anti pattern for this, I don't want the LLM grabbing tokens from memory that may or may not be relevant, and most likely will be stale.
Each time an LLM looks at my project, it's like a newcomer has arrived. If it keeps repeating mistakes, it's because my project sucks.
It's an unique opportunity. You can have lots of repeated feedback from "infinite newcomers" to a project, each of their failures an opportunity to make things clearer. Better docs (for humans, no machine-specific hacks), better conventions, better examples, more intuitive code.
That, in my opinion, is how markdown (for machines only and not humans) will fall. There will be a breed of projects that thrives with minimal machine-specific context.
For example, if my project uses MIDI, I'm much better doing some specialized tools and examples that introduce MIDI to newcomers (machines and humans alike) than writing extensive "skill documents" that explain what MIDI is and how it works.
Think like a human do. Do you prefer being introduced to a codebase by reading lots of verbose docs or having some ready-to-run examples that can get you going right away? We humans also forget, or ignore, or keep redundant context sources away (for a good reason).
I work primarily in Python and maintain extensive coding conventions there - patterns allowed/forbidden, preferred libs, error handling, etc. Custom slash commands like `/use-recommended-python` (loads my curated libs: pendulum over datetime, httpx over requests) and `/find-reinvented-the-wheel` to catch when Claude ignored existing utilities.
My use case: multiple smaller Python projects (similar to steipete's workflow https://github.com/steipete), so cross-project consistency matters more than single-codebase context.
Yes, ~15k tokens for CLAUDE.md + rules. I sacrifice context for consistency. Worth it.
Also baked in my dev philosophy: Carmack-style - make it work first, then fast. Otherwise Claude over-optimizes prematurely.
These memory abstractions are too complicated for me and too inconsistent in practice. I'd rather maintain a living document I control and constantly refine.
Why did you need to use AI to write this post?
I'm sold.
With that said, I can't think of a way that this would work. How does this work? I took a very quick glance, and it's not obvious at first glance.
The whole problem is, the AI is short on context, it has limited memory. Of course, you can store lots of memory elsewhere, but how do you solve the problem of having the AI not know what's in the memory as it goes from step to step? How does it sort of find the relevant memory at the time that that relevance is most active?
Could you just walk through the sort of conceptual mechanism of action of this thing?
1. Embeds the current request.
2. Runs a semantic + timestamp-weighted search over your past sessions. Returns only the top N items that look relevant to this request.
3. Those get injected into the prompt as context (like extra system/user messages), so Claude sees just enough to stay oriented without blowing context limits.
Think of it like: Attention over your historical work, more so than brute force recall. Context on demand basically giving you an infinite context window. Bookmark + semantic grep + temporal rank. It doesn’t “know everything all the time.” It just knows how to ask its own past: “What from memory might matter for this?”
When you try it, I’d love to hear where the mechanism breaks for you.
Then Claude uses the MCP tools according to the SKILL definition: https://github.com/mutable-state-inc/ensue-skill/blob/main/s...
I think of it like a file tree with proper namespacing and keep abstract concepts in separate directories. so like my food preferences will be in like /preferences/sandos. or you can even do things like /system-design preferences and then load them into a relevant conversation for next time.
Text Index of past conversations, using prompt-like summaries.