Hey, I am able to see kamon, karai, anna, and anton in the dataset, it'd be worth using some other names: https://raw.githubusercontent.com/karpathy/makemore/988aa59/...
In 3 days they've covered machine learning, geometry, cryptography, file formats and directory services.
The "TRAINING" visualization does seem synthetic though, the graph is a bit too "perfect" and it's odd that the generated names don't update for every step.
"How wrong was the prediction? We need a single number that captures "the model thought the correct answer was unlikely." If the model assigns probability 0.9 to the correct next token, the loss is low (0.1). If it assigns probability 0.01, the loss is high (4.6). The formula is − log ( � ) −log(p) where � p is the probability the model assigned to the correct token. This is called cross-entropy loss."
"The MLP (multilayer perceptron) is a two-layer feed-forward network: project up to 64 dimensions, apply ReLU (zero out negatives), project back to 16"
Which starts to feel pretty owly indeed.
I think the whole thing could be expanded to cover some more of it in greater depth.
For a long time, it seemed the answer was it doesn't. But now, using Claude code daily, it seems it does.
An enormous amount of research+eng work (most of the work of frontier labs) is being poured into making that 'correct' modifier happen, rather than just predicting the next token from 'the internet' (naive original training corpus). This work takes the form of improved training data (e.g. expert annotations), human-feedback finetuning (e.g. RLHF), and most recently reinforcement learning (e.g. RLVR, meaning RL with verifiable rewards), where the model is trained to find the correct answer to a problem without 'token-level guidance'. RL for LLMs is a very hot research area and very tricky to solve correctly.
Microgpt