David looks into the LLM finds the thinking layers and cut duplicates then and put them back to back.
This increases the LLM scores with basically no over head.
Very interesting read.
But what's in the context window is sharp, the exact text or video frame right in front of them.
The goal is to bring more of the world into that context.
Compression gives it intuition. Context gives it precision.
Imagine if we could extract the model's reasoning core and plug it anywhere we want.
Training data quality does matter but even with "perfect" data and a prompt in the training data it can still happen. LLMs don't actually know anything and they also don't know what they don't know.
they sort of do tho:
https://transformer-circuits.pub/2025/introspection/index.ht...
I will play along and assume this is sound. 10-40% +/- 10% is along the lines of "sort of" in a completely unreliable, unguaranteed and unproven way sure.
>Imagine if we could extract the model's reasoning core and plug it anywhere we want.
Aren't a lot of the latest model variants doing something very similar? Stuff more domain-relevant knowledge into the model itself on top of a core generally-good reasoning piece, to reduce need to perfectly handle giant context?
Zurada was one of our AI textbook that makes it visual that right from a simple classifier to a large language model, we are mathematically creating a shape(, that the signal interacts with). More parameters would mean shape can be curved in more ways and more data means the curve is getting hi-definition.
They reach something with data, treating neural network as blackbox, which could be derived mathematically using the information we know.
However: the labs releasing these high-intelligence-density models are getting them by first training much larger models and then distilling down. So the most interesting question to me is, how can we accelerate learning in small networks to avoid the necessity of training huge teacher networks?