LLM Architecture Gallery
355 points
15 hours ago
| 17 comments
| sebastianraschka.com
| HN
libraryofbabel
7 hours ago
[-]
This is great - always worth reading anything from Sebastian. I would also highly recommend his Build an LLM From Scratch book. I feel like I didn’t really understand the transformer mechanism until I worked through that book.

On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area. The best open weight models today still look a lot like GPT-2 if you zoom out: it’s a bunch of attention layers and feed forward layers stacked up.

Another way of putting this is that astonishing improvements in capabilities of LLMs that we’ve seen over the last 7 years have come mostly from scaling up and, critically, from new training methods like RLVR, which is responsible for coding agents going from barely working to amazing in the last year.

That’s not to say that architectures aren’t interesting or important or that the improvements aren’t useful, but it is a little bit of a surprise, even though it shouldn’t be at this point because it’s probably just a version of the Bitter Lesson.

reply
imjonse
1 hour ago
[-]
> On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area.

After years of showing up in papers and toy models, hybrid architectures like Qwen3.5 contain one such fundamental innovation - linear attention variants which replace the core of transformer, the self-attention mechanism. In Qwen3.5 in particular only one of every four layers is a self-attention layer.

MoEs are another fundamental innovation - also from a Google paper.

reply
libraryofbabel
1 hour ago
[-]
Thanks for the note about Qwen3.5. I should keep up with this more. If only it were more relevant to my day to day work with LLMs!

I did consider MoEs but decided (pretty arbitrarily) that I wasn’t going to count them as a truly fundamental change. But I agree, they’re pretty important. There’s also RoPE too, perhaps slightly less of a big deal but still a big difference from the earlier models. And of course lots of brilliant inference tricks like speculative decoding that have helped make big models more usable.

reply
iroddis
9 hours ago
[-]
This is amazing, such a nice presentation. It reminds me of the Neural Network Zoo [1], which was also a nice visualization of different architectures.

[1] https://www.asimovinstitute.org/neural-network-zoo/

reply
bicepjai
1 hour ago
[-]
Currently working on a similar project for myself. This looks like a great resource. Thanks for sharing. https://llm-lab.bicepjai.com/
reply
wood_spirit
10 hours ago
[-]
Lovely!

Is there a sort order? Would be so nice to understand the threads of evolutions and revolution in the progression. A bit of a family tree and influence layout? It would also be nice to have a scaled view so you can sense the difference in sizes over time.

reply
krackers
9 hours ago
[-]
There is https://magazine.sebastianraschka.com/p/technical-deepseek which shows an evolution in deepseek family
reply
andai
1 hour ago
[-]
> The goal of the proof verifier (LLM 2) is to check the generated proofs (LLM 1), but who checks the proof verifier? To make the proof verifier more robust and prevent it from hallucinating issues, they developed a third LLM, a meta-verifier.
reply
krackers
1 hour ago
[-]
The one thing I didn't quite understand (and wasn't mentioned in their paper unless I missed it), is why you can't keep stacking turtles. You probably get diminishing returns at some point, but why not have a meta-meta-verifier?
reply
gasi
9 hours ago
[-]
So cool — thanks for sharing! Here’s a zoomable version of the diagram: https://zoomhub.net/LKrpB
reply
Slugcat
8 hours ago
[-]
What tool was used to draw the diagrams?
reply
nxobject
7 hours ago
[-]
Thank you so much! As a (bio)statistician, I've always wanted a "modular" way to go from "neural networks approximate functions" to a high-level understanding about how machine learning practitioners have engineered real-life models.
reply
LuxBennu
7 hours ago
[-]
Interesting collection. The architecture differences show up in surprising ways when you actually look at prompt patterns across models. Longer context windows don't just let you write more, they change what kind of input structure works best.
reply
jasonjmcghee
6 hours ago
[-]
What's the structurally simplest architecture that has worked to a reasonably competitive degree?
reply
loveparade
6 hours ago
[-]
Competitiveness doesn't really come from architecture, but from scale, data, and fine-tuning data. There has been little innovation in architecture over the last few years, and most innovations are for the purpose of making it more efficient to run training or inference (fit in more data), not "fundamentally smarter"
reply
bigyabai
6 hours ago
[-]
If your definition of "competitive" is loose enough, you can write your own Markov chain in an evening. Transformer models rely on a lot of prior art that has to be learned incrementally.
reply
jasonjmcghee
6 hours ago
[-]
Not that loose lol.

I’m thinking it’s still llama / dense decoder only transformer.

reply
travisgriggs
7 hours ago
[-]
Darn. I clicked here hoping we were having LLMs design skyscrapers, dams, and bridges.

I even brought my popcorn :(

reply
jrvarela56
7 hours ago
[-]
Would be awesome to see something like this for agents/harnesses
reply
charcircuit
8 hours ago
[-]
I'm surprised at how similar all of them are with the main differences being the size of layers.
reply
arikrahman
5 hours ago
[-]
Thank you for the high quality diagrams!
reply
mvrckhckr
10 hours ago
[-]
What a great idea and nice execution.
reply
neuroelectron
7 hours ago
[-]
An older post from this blog, the linked article was updated recently: https://news.ycombinator.com/item?id=44622608
reply
jawarner
3 hours ago
[-]
Looks like this may have received the HN Hug of Death. I'm getting "Too Many Requests" error trying to load the images.
reply
brianjking
3 hours ago
[-]
I'm getting that trying to load the content at all, text included.
reply
FailMore
12 hours ago
[-]
Thanks! This is cool. Can you tell me if you learnt anything interesting/surprising when pulling this together? As in did it teach you something about LLM Architecture that you didn't know before you began?
reply