Fooling around with encrypted reasoning blobs
91 points
4 days ago
| 5 comments
| blog.cryptographyengineering.com
| HN
glitchc
5 hours ago
[-]
Very interesting. The state management is the really insightful find here.

I always wondered how these large AI companies managed access for millions of simultaneous users without having to allocate a dedicated LLM instance for each user. Pushing the complete state down to the user after every call makes perfect sense. The LLM itself stays memoryless and ready to respond to an arbitrary prompt. Very nice.

reply
geocar
5 hours ago
[-]
N.B. This is exactly how seaside, vba, and even arc[1] do server-side state generally: by encrypting the blob-representing-state and sending to the client to be sent back on future requests (where it will be decrypted and rehydrated).

It's an old trick that everyone designing protocols should know, since there are lots of applications beyond AI companies.

[1]: As in, pg's lisp: https://arclanguage.github.io/ref/srv.html#:~:text=The%20pre...

reply
tn1
2 hours ago
[-]
And don't forget the venerable .NET Forms with its kilobytes of __VIEWSTATE
reply
b65e8bee43c2ed0
3 hours ago
[-]
the exchange rate between text and its representation in memory is brutal. here's a bit from a recent article:

>An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.

262k tokens is not much at all. with ~5 characters per token, that's only 1.3 MB of plaintext.

reply
londons_explore
1 hour ago
[-]
The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.
reply
dist-epoch
41 minutes ago
[-]
This is one reason why price of SSDs also doubled, not just of RAM.

> LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs.

https://cloud.google.com/blog/topics/developers-practitioner...

reply
choppaface
1 hour ago
[-]
or maybe they don’t actually cache (fully) but lie and just don’t charge the user right now. at least half the users, who are probably also using the most similar tokens / prompts, wouldn’t really know the difference in latency (or care)
reply
londons_explore
31 minutes ago
[-]
If it actually cost that much RAM, they would almost certainly add extra things to the API to manage cache lifetime. Ie. A 'please cache this for X minutes' flag, or a setting for a single re-use cache (the most common use case)
reply
londons_explore
2 hours ago
[-]
Except the providers also cache the parsing of the prompt (the KV cache), and that has substantial cost savings (easily an 80% saving on typical coding use cases).

That caching is done server side and not passed to the client. Which in turn means they still need state management on the server side, although it perhaps doesn't need the same level of global replication and availability.

reply
Groxx
5 hours ago
[-]
One possible use for the "replay across accounts": if you can get a reasoning block that jailbreaks the model, you could share that block without sharing how you did it, and others can immediately take advantage of it too.
reply
denysvitali
3 hours ago
[-]
Not necessarily for the "without sharing" part, but to increase the reliability of the jailbreak. The same prompt isn't guaranteed to return the same result, but combining the internal thinking with the prompt might be a more effective way
reply
Retr0id
7 hours ago
[-]
Very cool idea to use thinking duration (either in tokens or in wall time) as a side-channel!
reply
hhh
3 hours ago
[-]
Awesome write-up. Seems like a great way to play with model responses now that prefill is gone.
reply
Reubend
6 hours ago
[-]
Super cool side channel attack. I tend to agree that it's pretty impractical, but it's such a fun discovery!
reply