I started to believe it after we (Joel Hestness in particular) reproduced it in so many experiments in “scaling is predictable empirically”.
The OpenAI work replicated it in a completely different environment, and at that point I was sure it was real.
Sometimes people ask me why I was so surprised by it. Prior work like Banko and Brill and the unreasonable effectiveness of data argued for more data. ML theory had similar models for toy problems, eg coin flips.
At the time I thought deep learning was supposed to be complex. Speech and language datasets seemed much more complex than toy problems. Optimization of deep transformers was complex.
The idea that it was possible for the whole thing to be governed by a 3 term equation seemed too simple. The implication was that it was simple to manufacture intelligence.
Ten years later, I still think it is still the most interesting observation I have seen. We are still learning what it looks like to live in a world where it is possible to manufacture intelligence.
GPT-3 was 175B, models like Gemma4 with 31B vastly outperform it, so there is more to it
as Karpathy noted, the initial GPTs were trained on complete garbage (literally, the average document from the Common Crawl is random nonsense), yet they worked. now we can use present LLMs to curate the data for the next generation
There is a lot of follow on work that explains what happens as you change them, e.g. Scaling Laws for Transfer - https://arxiv.org/pdf/2102.01293
I think it’s fortunate that transfer works in a similar way.
Common crawl (and Reddit, stack overflow, etc but not 4chan) was much easier to get access to at the time than using mechanical Turk.
There is certainly room for more work. There were many papers on scaling laws in NeurIPS this year.
From the Kaplan scaling laws paper:
> We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.
So the skeptics are right to be skeptical of LLMs being all you need for continued advancement in this space. It seems like the believers are the ones who need to learn about the scaling laws.
And so if the datastream has been produced by something intelligent, the resulting model is indistinguishable from that intelligence through the observed data. That is the whole compression idea behind artificial intelligence.
The limit is not a bug, it's a feature!
Also, linear gains in context length scale quadratically with compute because of attention, so depending on context growth means taking a bath on GPUs for as long as you can, right?
https://aclanthology.org/anthology-files/anthology-files/pdf...
"Significantly, we found that translation quality as indicated by BLEU score continues to improve with increasing language model size, at even the largest sizes considered. This finding underscores the value of being able to train and apply very large language models, and suggests that further performance gains may be had by pursuing this direction further."