This looks really cool. Also nice to see another architecture being used for image generation besides diffusion. It seems like every NLP problem can be solved with transformers now: text generation/understanding, image generation/understanding, translation, OCR. Perhaps llama 4/5 will have image generation as well. eidt: llama 3.2 already has image editing, they probably just don't want to release an image generator for other reasons.
From scratch training of an image synthesis model for the price of a graphic card isn't something I expected anytime soon!