FilterHN

Show HN: A mini paged-KV and prefix-cache scheduler (learning inference engine)

1 points

by bofeng1997

21 hours ago

| 0 comments

Hi HN — I built Tailor, a small teaching/learning repo that’s basically a “mini inference engine” prototype for LLM decoding.

It includes:

1. Paged KV cache (block_size=1) + page-table semantics Trie/radix prefix cache with reference-counted KV blocks (safe prefix reuse) 2. Attention metadata builder (page_table / cu_seqlens / positions / out_loc) 3. A simple KV-capacity-bounded scheduler (admission control + continue-batching)

It’s inspired by nano-vllm and mini-sglang, but not a direct copy — I re-implemented components step-by-step to understand how the pieces fit, with help from GPT-5.2. The scheduler policy is intentionally simple (learning-first).

Performance note: with 80,000 blocks allocated, I measured ~1990 tokens/s on Llama 3.2 1B on a laptop RTX 4070.

Repo: https://github.com/tyfeng1997/tailor

No one has commented on this post.