https://www.practical-diffusion.org/lectures/
There is more math-heavy https://diffusion.csail.mit.edu/2026/index.html
The most common approach to modeling continuous distributions is to train a reversible model f that maps it to another continuous distribution P that is already known. The original image can be recovered by tracking the bits needed to encode its latent, as well as the reverse path:
−log P(f(x)) − log|det ∂f/∂x (x)|
This technique is known as normalizing flows, as usually a normal distribution is chosen for the known distribution. The second term can be a little hard to compute, so diffusion models approximate it by using a stochastic PDE for the mapping. When f is a solution to an ordinary differential equation, dx/dt = g(x)
then log|det ∂f/∂x (x)| = ∫ Tr(∂g(x)/∂x) dt = ∫ E_{ε∼N(0,I)} [εᵀ ∂g(x)/∂x ε] dt
The last equality is known as Hutchison's estimator. Switching to a stochastic PDE dx′ = g(x′)dt + ε(t)dW
and tracking the difference δx = x′ − x, the mean-squared error approximately satisfies d(δxᵀδx)/dt = 2δxᵀ ∂g(x)/∂x δx,
which is close to Hutchinson's estimator, but weighted a little strange.Flow maps / consistency models / shortcut models instead try to learn to compress this iterative work into 1 forward pass. This makes inference 100x faster as you'd only need to run the neural net forward pass once. Beyond speeding up inference, there are other advanced benefits to this, such as improved ability to perform inference-time steering.
Mathematically, learning a flow map corresponds to learning to solve an ordinary differential equation, i.e., learning the time integral of the velocity field. This mathematical foundation provides the basis for various training objectives for learning flow maps, which involve self-referential identities or identities such as the transport equation, which are discussed in the blog post.
Hope that helps! I'm an ML researcher currently researching flow maps.
To be able to specify that subset with relatively few examples, a good high-level understanding of the data distribution is necessary. The way I see this, is that training a diffusion model gets you to that point, and then once you've selected the part of the distribution you actually care about, you can distill it down quite aggressively, because you no longer need all of that computation to model a much simpler distribution (sometimes all the way to one step, but usually it's a few steps in practice).
I haven't read it carefully but I think it's pretty comprehensive. From SDE to Flow matching formulation, and different perspective of constructing the flow maps, i.e. x-formulation or v-formulation. It also deals with distillation and consistency, which is used to fast sampling.
Overall, it's a good read if you are new to the field.
Extreme TL;DR: Diffusion models are like getting f(x) by calculating and summing f'(0), f'(1)...f'(x). Flow models are like just calculating f(x).