In LLM serving, the input is computed into intermediate states called KV cache to further provide answers. These data are relatively large (~1-2GB for long context) and are often evicted when GPU memory is not enough. In these cases, when users ask a follow up question, the software needs to recompute for the same KV Cache. LMCache is designed to combat that by efficiently offloading and loading these KV cache to and from DRAM and disk.
Ask us anything!
So this is something that might in the future turning to a commercial product? something like Langchain and thousands of open source projects that started as "open source" but then ended up implementing proprietary features for a cost.
Future versions of LMCache are aiming to support this.
[1] CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion- https://arxiv.org/abs/2405.16444
* KV cache compression - compressing the bytes of the KV cache, taking advantage of patterns in the KV cache and with dynamic levels of compression
* KV cache blending - concatenating the KV caches of multiple reused prompts with minimal KV cache recomputation for use cases like RAG, where it's more performant than the standard lossless KV cache prefix optimization, and gives better results than naively concatenating the KV caches for the reused prompts
These optimizations are pretty cool and different than the standard KV cache optimizations. The title saying lossless seems misleading though.
See https://arxiv.org/abs/2405.16444v3
> To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks are not always the input prefix, which makes precomputed KV caches not directly usable since they ignore the text’s cross-attention with the preceding texts. Thus, the benefits of reusing KV caches remain largely unrealized.
> This paper tackles just one challenge: when an LLM input contains multiple text chunks, how to quickly combine their precomputed KV caches in order to achieve the same generation quality as the expensive full prefill (i.e., without reusing KV cache)? [..] We present a scheme that reuses the pre-computed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache.
I had recently touched on benefits of compute-in-network for KV cache management https://news.ycombinator.com/item?id=44371227 largely making arguments contra Bluefield. The CacheBlend authors note that the delay from recomputing some tokens can be hidden by pipelining it with KV loads. Note that the various systolic array/NoC architectures are well-suited for accelerating string matching tasks. A compute-in-network FPGA could therefore manage the entire process: identify viable chunks by indexing and matching of the hot substrings, prefetch the corresponding KV caches from network storage, and stitch up a new prefix before passing it to the primary inference hardware. It may as well be one of those weird cases where hard-coding the algorithm is possible in theory, but intractable in practice—because the optimal paths would be highly-dependent on topology.
Nobody wants one-trick hardware.
In view of Xilinx acquisition, AMD's death in the AI space appears to be greatly exaggerated!
Once you get X karma or account age >Y years, you can make one anonymous submissions each quarter that comes from an non-user but still get some sort of “verified” badge that proves it comes from a legit user.
It gathered hundreds of GitHub stars and was on the front page all day. When some of us finally had time to look at the code we discovered they didn't invent anything new at all. They took some existing command line options for llama.cpp and then changed the wording slightly to make them appear novel.
The strangest part was that everyone who pointed it out was downvoted at first. The first comment to catch it was even flagged away! You couldn't see it unless you had showdead turned on.
At first glance I don't see this repo as being in the same category, though the "3X throughput increase" claim is very clearly dependent on the level of caching for subsequent responses and the "lossless" claim doesn't hold up as analyzed by another top-level comment.
I think AI self-promoters have realized how easy it is to game Hacker News and GitHub stars if you use the right wording. You can make some big claims that are hard to examine in the quick turnaround times of a Hacker News front page cycle.
> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.
I quit my job at Google 2 years ago to do LLM stuff, was looking forward to having HN around, but discussions re: LLMs here are a minefield.
Why?
Everyone knows at least a little, and everyone has a strong opinion on it given the impact of it. People sharing stuff sell it way high, and as with any new thing where people are selling, there's a lot of skeptics. Then, throw in human bias towards disliking what seems like snark / complaining, so stuff with substance gets downvotes.
SNR ratio is continually decreasing.
Let's dig into why this one is weird:
My work inferences using either 3P provider, which do caching, or llama.cpp, in which I do caching. (basically, picture it as there's a super expensive step that you can skip by keeping Map<input string, gpu state>)
So I log into HN and see this and say to myself: 3x! throughput increase? This is either really clever or salesmanship, no way an optimization like that has been sitting around on the groud.
So I read the GitHub, see it's just "write everyones inputs and outputs to disk, you can then use them to cobble together what the GPU state would be for an incoming request!", and write a mostly-polite comment below flagging "hey, this means writing everything to disk"
Then I start replying to you...but then I throw away the comment, because I'm inviting drive-by downvotes. I.e. the minefield describe up top, and if you look like you're being mean, you'll eat downvotes, especially on a weekend.
And to your average reader, maybe I just don't understand vLLM, and am taking it out in good hackers just pushing code.
Then, when I go back, I immediately see a comment from someone who does use vLLM noting it already does caching.
Sigh.
Unfortunately this isn't new. Almost as long as people have been publishing papers, people have been using them this way. arXiv, arguably, makes it even worse because the papers haven't even gone through the pretense of a peer review, that does serve to filter out at least some of them.
1. For long inputs and short outputs, the inference can be arbitrarily number of times faster, as it avoids repeated KV computation.
2. Conversely, for short inputs and long outputs, it might be slightly slower, since loading and storing the KV cache are on the critical path of the execution.
Also, how realistic would it be to share the KV cache across vllm nodes within a data center? It would be really nice to be able to freely distribute requests to a pool of vLLM workers without worrying about prefix-aware routing, but maybe that isn't the right approach because moving the KV cache around would be too slow?
I suppose, combine this with pressure from public or private investment, and the way to get ahead is to package anything into a prospect of revenue generation. I'm sure that's part of it too. Everything has to monetize because some business school graduate hasn't "made it" until they have a yacht like their ivy league friends.
Eh, probably comes across as curmudgeonly or "who moved my cheese". But if there is an area that can improve this longstanding problem in tech, my guess is teaching the right skills and concepts at the collegiate level. And that's not a simple thing either.
Edit > reading a bit more, this focuses on chat applications and seems to be a decent caching implementation tailored to that domain, of which, I'm guessing will allow AT&T and Verizon to save money on their gobsmackingly horrible AI chat bot in their mobile app. As an individual, it's unclear how this benefits me though. I don't think it does. ME: asks chat bot question about insurance coverage, CHATBOT: immediately serves canned response in no time about how that's covered in my individual insurance plan which I read more about on their website (pro-tip: no, I can't, those details are actually never on the website)
It seems to me like you’re easily hand waving away a hard problem in a different part of the stack you’re less familiar with.
Again, the novelty is in getting cross attention to work correctly despite the fact that you’re stitching together arbitrary caches together. It’s akin to taking snippets of compressed portions of random compressed files and reconstructing a new correct plain text. That’s obviously not possible but clearly this has been accomplished with the KV cache for arbitrary models (ie not trained for it) despite the KV cache working like decompression where all the preceding bytes have to be computed correctly for the subsequent token to be correct.
"Lossless 3x Throughput Increase" == "Cache all inputs and output across everyone, in RAM and on disk, and if you assume the next request is covered by cache, its 3x faster!"
I'm more surprised it's only advertised as 3x under those conditions: my llama.cpp wrapper does the same -- caching in RAM while running locally seems fine to me -- and when input is cached, TTFT is ~instantaneous, modulo any add'l prompt you add.
I supposed it creates a little more distance, in that, instead of infinity times faster for latency, we measure throughput, and then our speedup can be adjusted as desired by adjusting output length, and thus we can pick a more reasonable-sounding metric like 3x. (though, the GitHub README still frames it in terms of latency / TTFT)