FilterHN

MSA: Memory Sparse Attention

81 points

by chaosprint

3 days ago

| past

| 3 comments

| github.com

| HN

▲

kingstnap

6 hours ago

[-]

I do wonder about the usefulness about this massive context dumping exercise. 100M is a ridiculous amount. Usually to get good results on practical tasks you need to actually think about what you are dumping into context.

I also have my gripes about the way 2 hop is mentioned here. With figure 3 being the canonical example of what I would consider too trivial/misleading (The exact text match of "Eric Watts" being in the question and in the context). It leads to the natural question of how does it do compared to an LLM with a grep tool.

What I would consider more interesting is practical synthesis over such a large context where you can't just string lookup answers. For example maybe dumping all of Intel's x86 manuals into context and then asking an LLM to try to write assembly or something.

▲

wolttam

4 hours ago

[-]

The sky seems like the limit to me. 100M doesn't actually seem like that much when you get into vision models or embodied robots operating with contexts on the order of several days or weeks.

The more we can drive towards selective attention over larger and larger sets of "working memory", the better, I think.

▲

altruios

3 hours ago

[-]

Maybe the mechanism for memory is only tangentially related to the context window.

I suspect cleverer mechanisms of context injection/pruning/updating would result in effective memory more so than my suspicion increasing the context window forever will do, regardless of what tricks we apply to distil attention over it.

There is probably a lot of low hanging fruit in this area.

▲

baq

5 hours ago

[-]

100M tokens should be enough to put all but the absolutely biggest code bases into a single context. It’s probably also about as much as a single average person in the West reads in a lifetime (make of that what you will philosophically); all x86 manuals should fit nicely with room to spare.

▲

Albiruni

48 minutes ago

[-]

The idea is to infer from the manual how to write assembly.

▲

ting0

1 hour ago

[-]

So basically there's no excuse not to see ChatGPT and Claude release 10M -> 100M models within 6months or so. <9% degradation is crazy. Hopefully DeepSeek and Qwen4 can implement this.

▲

cyanydeez

8 hours ago

[-]

Neat. Can't wait for our language, framework specific tools for models. I don't need my models writing shakespeare, unless I'm working on shakespeare.