First a couple of facts:
1) An ANN works by learning decision boundaries that separate and group training samples and their associated labels.
2) If you train an overparametized net on random data then it will memorize it, but if you train it on consistently labelled structured data lying on some lower dimensional manifold, then rather than memorizing it, it will instead generalize, so the behavior depends on the nature of the data it is trained on.
Now the hand-wavy bit:
As training progresses the weights move the decision surfaces around until each training sample maps to a region of output space corresponding to the correct label, with these regions of output/latent space being separated by the learnt decision surfaces.
Initially during training (up to the double descent phase in cases where that happens) these regions of "gerrymandered" output space may only correspond to a single or very few training samples, so there may be multiple disconnected regions each mapping to the label "cat", and another group of disconnected regions each mapping to the label "dog". This is the the overfitting phase.
Now, if the data permits, with the data manifold being consistently labelled (nothing that looks like a cat being labelled a dog), there will often be potential to merge some of these disconnected regions of output space that map to the same label. So, for example we might go from four small regions of "cat" space to two larger merged regions of "cat" space. This is the mechanism of generalization with the extra space contained by the merged regions corresponding to interpolation - no training samples "forced" those larger merged regions, but also none prevented it ("dog" that looks like a cat).
The question then remains why the dynamics of training may cause the decision surfaces to initially be highly "gerrimanderd" (because it's easier?), but on continued training to merge (because without any dogs among the cats there is no reason not to, and once merged no label error causing them to unmerge - a ratcheting up process from smaller to larger regions with increasing generalization?).
What's more interesting is as to why double descent happens
One intuitive way of looking at it is like so - let's say that you have a gaussian-looking plot. You want to fit a gaussian. You have a stupid simple model where you can slide your gaussian left and right.
If your initial starting point happens to be roughly within range, great, your optimizer will take care of it for you and slide it into the correct place. If you're too far, too bad, no meaningful gradient.
Instead, neural nets give you the option to spawn a gaussian anywhere you please. In this case, no sliding is necessary, but it comes at a heavy parametrization cost.
1. Avoid overparameterization by design. Manually create or choose a space of functions that has limited degrees of freedom by construction.
2. Accept overparameterization and regularize.
The latter tends to be more robust, because of the bitter lesson. It's not practical to manually design an ideal, on-demand, just-right limited-parameter model for every dataset we are presented with. The best way to approach that ideal, it turns out, is really to just let the computer figure it out via regularized optimization over an overparameterized space.
Statisticians started moving in favor of overparameterization long before deep learning got off the ground. This trend dates back at least to the machine learning bible, Elements of Statistical Learning (2001).
Could you elaborate on this?