The lottery ticket hypothesis: why neural networks work
31 points
4 hours ago
| 11 comments
| nearlyright.com
| HN
doctoboggan
2 minutes ago
[-]
This article definitely feels like chatgptese.

Also, I don't necessarily feel like the size of LLMs even comes close to overfitting the data. From a very unscientific standpoint it seems like the size of weights on disk would have to meet or exceed the size of the training data (modulo lossless encryption techniques). Since its orders of magnitude larger, isn't that proof that the weights are some sort of generalization of the input data rather than a memorization.

reply
highfrequency
43 minutes ago
[-]
Enjoyed the article. To play devil’s advocate, an entirely different explanation for why huge models work: the primary insight was framing the problem as next-word prediction. This immediately creates an internet-scale dataset with trillions of labeled examples, which also has rich enough structure to make huge expressiveness useful. LLMs don’t disprove bias-variance tradeoff; we just found a lot more data and the GPUs to learn from it.

It’s not like people didn’t try bigger models in the past, but either the data was too small or the structure too simple to show improvements with more model complexity. (Or they simply trained the biggest model they could fit on the GPUs of the time.)

reply
pixl97
9 minutes ago
[-]
I think a lot of it is the massive amount of compute we've got in the last decade. While inference may have been possible on the hardware the training would have taken lifetimes.
reply
nitwit005
5 minutes ago
[-]
> For over 300 years, one principle governed every learning system

This seems strangely worded. I assume that date is when some statistics paper was published, but there's no way to know with no definition or citations.

reply
derbOac
1 hour ago
[-]
In some sense, isn't this overfitting, but "hidden" by the typical feature sets that are observed?

Time and time again, some kind of process will identify some simple but absurd adversarial "trick stimulus" that throws off the deep network solution. These seem like blatant cases of over fitting that go unrecognized or unchallenged in typical life because the sampling space of stimuli doesn't usually include the adversarial trick stimuli.

I guess I've not really thought of the bias-variance tradeoff necessarily as being about number of parameters, but rather, the flexibility of the model relative to the learnable information in the sample space. There's some formulations (e.g., Shtarkov-Rissanen normalized maximum likelihood) that treat overfitting in terms of the ability to reproduce data that is wildly outside a typical training set. This is related to, but not the same as, the number of parameters per se.

reply
ghssds
13 minutes ago
[-]
Can someone explain how AI research can have a 300 years history?
reply
xg15
1 hour ago
[-]
Wouldn't this imply that most of the inference time storage and compute might be unnecessary?

If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them?

reply
FuckButtons
12 minutes ago
[-]
For any particular single pattern learned 99% of the weights are dead weight. But it’s not the same 99% for each lesson learned.
reply
tough
1 hour ago
[-]
someone on twitter was exploring and linked to some related papers where you can for example trim experts on a MoE model if you're 100% sure they're never active for your specific task

what the bigger wide net bigs you is generalization

reply
paulsutter
37 minutes ago
[-]
That’s exactly how it works, read up on pruning. You can ignore most of the weights and still get great results. One issue is that sparse matrices are vastly less efficient to multiply.

But yes you’ve got it

reply
markeroon
52 minutes ago
[-]
Look into pruning
reply
abhinuvpitale
39 minutes ago
[-]
Interesting article, is it concluding that different small networks are formed for different types of problems that we are trying to solve with the larger network?

How is this different from overfitting though? (PS: Overfitting isn't that bad if you think about it, as long as the test dataset or inference time model is trying to solve problems in the supposedly large enough training dataset)

reply
deepfriedchokes
36 minutes ago
[-]
Rather than reframing intelligence itself, wouldn’t Occam’s Razor suggest instead that this isn’t intelligence at all?
reply
gotoeleven
18 minutes ago
[-]
This article gives a really bad/wrong explanation of the lottery ticket hypothesis. Here's the original paper

https://arxiv.org/abs/1803.03635

reply
belter
41 minutes ago
[-]
This article is like a quick street rap. Lots of rhythm, not much thesis. Big on tone, light on analysis...Or no actual thesis other than a feelgood factor. I want these 5 min back.
reply
api
1 hour ago
[-]
This sounds like it's proposing that what's happening during large model training is a little bit akin to genetic algorithms: many small networks emerge and there is a selection process, some get fixed, and the rest fade and are then repurposed/drifted into other roles, repeat.
reply