This article goes into the implicit prior/posterior updating during LLM inference; you can even go a step further and directly implement hierarchical relationships between layers with H-Nets. However, even under an explicit Bayesian framework, there’s a stark difference in robustness between these H-Nets and the equivalent Bayesian model with the only variable being the parameter estimation process. [1]
These papers look promising, but a few initial strikes - first, the research itself was clearly done with agentic support; I'd guess from the blog post and the papers that actually the research was done by agents with human support. Lots of persistent give aways like overcommitting to weird titles like "Wind Tunnel" and all of the obvious turns of phrase in the medium post unfortunately carry on into the papers themselves. This doesn't mean they're wrong but I do think it means what they have is less info dense and less obviously correct, given today's state of the art with agentic research.
Upshot of the papers, there's one claim - each layer of a well trained transformer network allows a bayesian 'update' and selection of "truth" or preference of the model; deeper layers in the architecture = more accuracy. Thinking models = a chance to refresh the context and get back to the start of the layers to do further refinement.
There's a followup claim - that thinking about what the models are doing as solely updating weights for this bayesian process will get more efficient training.
Data in the paper - I didn't read deeply enough to decide if this whole "it's all Bayes all the way down" seems true to me. they show that if you ablate single layers then accuracy drops. But that is not news.
They do show significantly faster (per round) loss reduction using EM training vs SGD, but they acknowledge this converges to the same loss eventually (although their graphs do not show this convergence, btw), and crucially they do absolutely no reporting on compute required, or comparison with more modern methods.
Upshot - I think I'd skip this and kind of regret the time I spent reading the papers. Might be true, but a) so what, and b) we don't have anything falsifiable or genuinely useful out of the theory. Maybe if we could splice together different models in a new and cool way past merging layers, then I'd say we have something interesting out of this.
Here, the authors have taken set up two synthetic experiments where transformers have to learn the probability of observing events from a sampled from a "ground truth" Bayesian model. If the probability assigned by the transformers to the event space matches the Bayesian posterior predictive distribution, then the authors infer that the model is performing Bayesian inference for these tasks. Furthermore, they use this to argue that transformers are performing Bayesian inference in general (belief-propagation throughout layers).
The transformers are trained on thousands of different "ground truth" Bayesian models, each randomly initialized which means that there's no underlying signal to be learned besides the belief propagation mechanism itself. This makes me wonder if any sufficiently powerful maximum likelihood-based model would meet this criteria of "doing Bayesian inference" in this scenario.
The transformers in this paper do not intrinsically know to perform inference due to the fact that they're transformers. They perform inference because the optimal solution to the problems in the experiments is specifically to do inference, and transformers are powerful enough to model belief propagation. I find it hard to extrapolate that this is what is happening for LLMs, for example.
But a lot of people are of the opinion that for many papers it helps to have a secondary publication where the author puts the work in the appropriate context. I’m trying to build a shared mental model with the author, to help me better understand the underlying work; that is harder to do when there’s no mind behind the words.
> that is harder to do when there’s no mind behind the words.
Presumably the author read the text before publish and agreed with the summary. What's the problem exactly?
On the other hand, if one uses AI but keeps content density constant (e.g. grammar fixes for non-native speakers) or even negative (compress this repetitive paragraph), I think it can be a useful net productivity boost.
Current AI can't really add information, but a lot of editing is subtracting, and as long as you check the output for hallucinations (and prompt-engineer a lot since models like to add) imo LLMs can be a subtraction-force-multiplier.
Ironically: anti-slop; or perhaps, fighting slop with slop.
The essay kind of works for me as an impressionistic context for the three papers, but without those three papers I think it's almost more confusing than it helps.
Eg
> This suggests that the EM structure isn’t just an analogy — it’s the natural grain of the optimization landscape
I don't care if someone uses llm. But it shows a lack of care to do it in this blatant way without noting it. Eg at work I'll often link prompt-response in docs as an appendix, but I will call out the provenance
If you find those sentences to be helpful, great! I find it decreases the signal in the article and makes me skim it. If you're wondering why people complain, it's because sharing a post intended to be skimmed without saying, hey you should skim this, is a little disrespectful of someone's time
As someone in the field, this means nothing, and I'm very suspicious of the article as a whole because it has so many sentences like this.