Modeling data distribution is challenging; DDN adopts a simple yet fundamentally different approach compared to mainstream generative models (Diffusion, GAN, VAE, autoregressive model):
1. The model generates multiple outputs simultaneously in a single forward pass, rather than just one output. 2. It uses these multiple outputs to approximate the target distribution of the training data. 3. These outputs together represent a discrete distribution. This is why we named it "Discrete Distribution Networks".
Every generative model has its unique properties, and DDN is no exception. Here, we highlight three characteristics of DDN:
- Zero-Shot Conditional Generation (ZSCG). - One-dimensional discrete latent representation organized in a tree structure. - Fully end-to-end differentiable.
Reviews from ICLR:
> I find the method novel and elegant. The novelty is very strong, and this should not be overlooked. This is a whole new method, very different from any of the existing generative models. > This is a very good paper that can open a door to new directions in generative modeling.
The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.
(This is mentioned in Q1 in the "Common Questions About DDN" section at the bottom.)
Now I just need a time-turner.
A genuinely interesting and novel approach, I'm very curious how it will perform when scaled up and applied to non-image domains! Where's the best place to follow your work?
Much like DiffusionDet, which applies diffusion models to detection, DDN can adopt the same philosophy. I expect DDN to offer several advantages over diffusion-based approaches: - Single forward pass to obtain results, no iterative denoising required. - If multiple samples are needed (e.g., for uncertainty estimation), DDN can directly produce multiple outputs in one forward pass. - Easy to impose constraints during generation due to DDN's Zero-Shot Conditional Generation capability. - DDN supports more efficient end-to-end optimization, thus more suitable for integration with discriminative models and reinforcement learning.
> Many high rated papers would have been done by someone else if their authors never published them or were rejected. However, if this paper is not published, it is not likely that anyone would come up with this approach. This is real publication value. I am reminding again the original diffusion paper from 2015 (Sohl-Dickstein) that was almost not noticed for 5 years. Had it not been published, would we have had the amazing generative models we have today?
Cite from: https://openreview.net/forum?id=xNsIfzlefG¬eId=Dl4bXmujh1
Besides, we compared DDN with other approaches in the Table 1 of original paper, including VQ-VAE.
The one shown on their page is L=3.
I made an initial attempt to combine [DDN with GPT](https://github.com/Discrete-Distribution-Networks/Discrete-D...), aiming to remove tokenizers and let LLMs directly model binary strings. In each forward pass, the model adaptively adjusts the byte length of generated content based on generation difficulty (naturally supporting speculative sampling).
> To our knowledge, Taiji-DDN is the first generative model capable of directly transforming data into a semantically meaningful binary string which represents a leaf node on a balanced binary tree.
This property excites me just as much.
The current goal in research is scaling up. Here are some thoughts in blog about future directions: https://github.com/Discrete-Distribution-Networks/Discrete-D...
The zero-shot conditional generation part is wild. Most methods rely on gradients or fine-tuning, so I wonder what makes DDN tick there. Maybe the tree structure of the latent space helps navigate to specific conditions without needing retraining? Also, I'm intrigued by the 1D discrete representation—how does that even work in practice? Does it make the model more interpretable?
The Split-and-Prune optimizer sounds new—I'd love to see how it performs against Adam or SGD on similar tasks. And the fact that it's fully differentiable end-to-end is a big plus for training stability.
I also wonder about scalability—can this handle high-res images without blowing up computationally? The hierarchical approach seems promising, but I'm not sure how it holds up when moving from simple distributions to something complex like natural images.
Overall though, this feels like one of those papers that could really shift the direction of generative models. Excited to dig into the code and see what kind of results people get with it!
1. The comparison with GANs and the issue of mode collapse are addressed in Q2 at the end of the blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...
2. Regarding scalability, please see “Future Research Directions” in the same blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...
3. Answers or relevant explanations to any other questions can be found directly in the original paper (https://arxiv.org/abs/2401.00036), so I won’t restate them here.
Similarities: - Both map data to a discrete latent space.
Differences: - VQ-VAE needs an external prior over code indices (e.g. PixelCNN or a hierarchical prior) to model distribution. DDN builds its own hierarchical discrete distribution and can even act as the prior for a VQ-VAE-like system. - DDN’s K outputs are features that change with the input; VQ-VAE’s codebook is a set of independent parameters (embeddings) that remain fixed regardless of the input. - VQ-VAE produces a 2-D grid of code indices; DDN yields a 1-D/tree-structured latent. - VQ-VAE needs Straight-Through Estimator. - DDN supports zero-shot conditional generation.
So I’d call them complementary rather than “80 % the same.” (See the paper’s “Connections to VQ-VAE.”)