As particular motivation for my intuition, I expect that we had evolutionary pressure to adapt our defense mechanisms of predicting the movements of predators and prey, to handle human opponents.
[0] https://github.com/rasbt/LLMs-from-scratch
[1] https://www.manning.com/books/build-a-large-language-model-f...
[2] https://magazine.sebastianraschka.com/p/coding-llms-from-the...
A series of Jupyter notebooks explaining the whole machine learning mechanism, from the beginning
https://github.com/nickyreinert/DeepLearning-with-PyTorch-fr...
and of course also how to build an llm from scratch
https://github.com/nickyreinert/basic-llm-with-pytorch/blob/...
The engineering was horrible and very ad-hoc but I learned a lot. Results were ok-ish (I classified tweets) but it gave me a good perspective on the sheer GPU power (and engineering challenges) one would need to do this seriously. I didn't fully grasp the potential of generating output but spent quite some time chuckling at generated tweets (was just curious to try it).
...nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour...
runs on a Blackwell 6000 Max-Q, using 86GB VRAM. Training supposedly takes 3h40m
I doubt you have a machine big enough to make it "Large".
I'm not saying it's worth it but you don't need to buy a GPU yourself to be able to train.
And it's paired with 48 processor cores! I mean, they don't even support AVX512 but they can do math!
I could totally train a LLM! Or at least my family could... might need my kid to pick up and carry on the project.
But in all seriousness... you either missed the point, are being needlessly pedantic, or are... wrong?
This is about learning concepts, and the rest of this is mostly moot.
On the pedantic or wrong notes--What is the documented cut-off for a "large" language model? Because GPT-2 was and is described as a "large" language model. It had 1.5B parameters. You can just about get a consumer GPU capable of training that for about $400 these days.
In my own very humble opinion, it becomes "Large" when it's out of non-specialized hardware. So currently, a model which requires more than 32GB vram is large (as that's roughly where the high-end gaming GPUs cut off).
And btw, there is no way you can train a language model on a CPU, even with ddr5, lest you wait a whole week for a single training cycle. Give it a go! I know I did, it's a magnitude away from being feasible.
GPT would have been a better term than LLM, but unfortunately became too associated with OpenAI. And then, what about non-transformer LLMs? And multimodal LLMs?
Maybe we should just give up, shrug and call it "AI".
And no one is stopping anyone from tweaking few parameters in this repo to go above 10M parameters.
But that is just me. I think is more useful to understand the how and whys before training a LLM.