In hindsight it makes total sense - generative image models don't automatically start out with an idea of semantic meaning or the world, and so they have to implicitly learn one during training. That's a hard task by itself, and it's not specifically trained for this task, but rather learns it on the go at the same time as the network learns to create images. The idea of the paper then is to provide the diffusion model with a preexisting concept of the world by nudging its internal representations to be similar to the visual encoders'. As I understand DINO isn't even used during inference after the model is ready, it's just about representations.
I wouldn't at all describe it as "a technique for transplanting an existing model onto a different architecture". It's different from distillation because again, DINO isn't an image generation model at all. It's more like (very roughly simplifying for the sake of analogy) instead of teaching someone to cook from scratch, we're starting with a chef who already knows all about ingredients, flavors, and cooking techniques, but hasn't yet learned to create dishes. This chef would likely learn to create new recipes much faster and more effectively than someone starting from zero knowledge about food. It's different from telling them to just copy another chef's recipes.
https://arxiv.org/pdf/1912.06719v1
And, arguably, Facebook's unsupervised pre-training for their multi-modal speech-to-text models is kind of the same idea as unsupervised pre-training for a multi-modal text-to-image diffuser.
https://ai.meta.com/research/publications/wav2vec-2.0-a-fram...
[1] https://huggingface.co/docs/diffusers/en/conceptual/evaluati...
Sequential generation used to be state of the art in 2016 and it's basically how current LLMs work:
I am not sure that a diffusion approach is all that suitable for generating language. Word are much more discrete than pixels.
Diffusion doesn't work on pixels directly either, it works on a latent representation.
A fairly new but promising approach for autoregression that seems to scale as well as diffusion is predicting the next image scale/resolution rather than the next image patch.
However diffusion models suck at details, like how many fingers on a hand, and with language words and characters matter, both which ones and where they are.
So while I'm sure diffusion could produce walls of text that look convincingly like a blog post at a glance say, I'm not sure it would hold up to anyone actually reading.