FilterHN

Show HN: SiMM – Distributed KV Cache for the Long-Context and Agent Era

1 points

by SherryWong

1 hour ago

| past

| 0 comments

| github.com

| HN

We built SiMM because LLM context lengths are growing much faster than GPU memory.

With long Chain-of-Thought reasoning and multi-turn agents, prompts are getting much longer. According to OpenRouter’s State of AI 2025, average context length has grown about 4× in the past year.

This creates two problems in inference systems:

• Slow TTFT — long contexts make prefill expensive • High GPU memory cost — KV cache quickly exhausts HBM

Instead of recomputing long prompts or keeping all KV cache in GPU memory, we explored a different approach:

treat KV cache as a distributed memory system.

SiMM is an open-source distributed KV cache engine for LLM inference. It stores KV cache in a high-speed RDMA-backed memory pool and lets engines like SGLang and vLLM reuse cached states across requests.

This converts prefill from a compute-heavy step into a fast I/O lookup.

In our tests with long-context multi-turn workloads:

3.1× speedup vs no cache

2.1× vs local CPU cache

up to 9× lower KV I/O latency

SiMM scales horizontally across nodes and fully utilizes RDMA NIC bandwidth.

GitHub: https://github.com/scitix/SiMM

No one has commented on this post.