FilterHN

Show HN: ClawMem – Open-source agent memory with SOTA local GPU retrieval

3 points

2 hours ago

| 0 comments

So I've been building ClawMem, an open-source context engine that gives AI coding agents persistent memory across sessions. It works with Claude Code (hooks + MCP) and OpenClaw (ContextEngine plugin + REST API), and both can share the same SQLite vault, so your CLI agent and your voice/chat agent build on the same memory without syncing anything.

The retrieval architecture is a Frankenstein, which is pretty much always my process. I pulled the best parts from recent projects and research and stitched them together: [QMD](https://github.com/tobi/qmd) for the multi-signal retrieval pipeline (BM25 + vector + RRF + query expansion + cross-encoder reranking), [SAME](https://github.com/sgx-labs/statelessagent) for composite scoring with content-type half-lives and co-activation reinforcement, [MAGMA](https://arxiv.org/abs/2501.13956) for intent classification with multi-graph traversal (semantic, temporal, and causal beam search), [A-MEM](https://arxiv.org/abs/2510.02178) for self-evolving memory notes, and [Engram](https://github.com/Gentleman-Programming/engram) for deduplication patterns and temporal navigation. None of these were designed to work together. Making them coherent was most of the work.

On the inference side, QMD's original stack uses a 300MB embedding model, a 1.1GB query expansion LLM, and a 600MB reranker. These run via llama-server on a GPU or in-process through node-llama-cpp (Metal, Vulkan, or CPU). But the more interesting path is the SOTA upgrade: ZeroEntropy's distillation-paired zembed-1 + zerank-2. These are currently the top-ranked embedding and reranking models on MTEB, and they're designed to work together. The reranker was distilled from the same teacher as the embedder, so they share a semantic space. You need ~12GB VRAM to run both, but retrieval quality is noticeably better than the default stack. There's also a cloud embedding option if you're tight on vram or prefer to offload embedding to a cloud model.

For Claude Code specifically, it hooks into lifecycle events. Context-surfacing fires on every prompt to inject relevant memory, decision-extractor and handoff-generator capture session state, and a feedback loop reinforces notes that actually get referenced. That handles about 90% of retrieval automatically. The other 10% is 28 MCP tools for explicit queries. For OpenClaw, it registers as a ContextEngine plugin with the same hook-to-lifecycle mapping, plus 5 REST API tools for the agent to call directly.

It runs on Bun with a single SQLite vault (WAL mode, FTS5 + vec0). Everything is on-device; no cloud dependency unless you opt into cloud embedding. The whole system is self-contained.

This is a polished WIP, not a finished product. I'm a solo dev. The codebase is around 19K lines and the main store module is a 4K-line god object that probably needs splitting. And of course, the system is only as good as what you index. A vault with three memory files gives deservedly thin results. One with your project docs, research notes, and decision records gives something actually useful.

Two questions I'd genuinely like input on: (1) Has anyone else tried running SOTA embedding + reranking models locally for agent memory, and is the quality difference worth the VRAM? (2) For those running multiple agent interfaces (CLI + voice/chat), how are you handling shared memory today?

No one has commented on this post.