Understanding Diffusion Models

When Stable Diffusion generates a photorealistic image from a text prompt, or Sora produces a video clip from scratch, the engine underneath is the same: a diffusion model that turns pure noise into structured data, one small step at a time.

The core challenge of generative modeling is mapping a simple, easy-to-sample distribution (like standard Gaussian noise) to a highly complex distribution (like the manifold of natural images).

Direct mapping methods, such as Generative Adversarial Networks (GANs), try to achieve this transformation in a single step. However, mapping a simple Gaussian space to a complex, discontinuous image manifold is highly non-linear and unstable, leading to optimization problems like mode collapse.

Diffusion models sidestep this by breaking the journey into $T$ microscopic, tractable steps. Instead of asking a neural network to generate a clean image from pure noise in one giant leap, we train it to perform a series of tiny denoising steps, making the learning objective stable and robust.

1. The Thermodynamics of an Ink Drop

To understand the core intuition of diffusion, we can look at a simple physical analogy: dropping a molecule of blue ink into a glass of still water.

Initially, the ink is concentrated in a highly structured drop. As time passes, the ink molecules collide randomly with water molecules in a chaotic trajectory known as Brownian motion. Over time, the ink spreads out until the water is a uniform blue. In physics, the Second Law of Thermodynamics states that this process is natural and one-way: entropy increases, and structure is destroyed.

Running this movie backward—forcing the pale blue water to assemble back into a concentrated drop of ink—seems impossible because it requires calculating the backward trajectories of all colliding water molecules.

Diffusion models do exactly this. They act as a time-reversal operator. By observing millions of noise-injection steps, a neural network learns the local corrective force needed to reverse chaotic collisions and guide unstructured noise back into structure.

2. The Forward Process: Destroying Information

We formalize the destruction of structure by defining a Markov chain that systematically adds Gaussian noise to a clean data sample $x_{0} \sim q (x)$ over $T$ discrete timesteps [1, 2].

The transition from timestep $t - 1$ to $t$ is governed by a variance schedule $β_{1}, β_{2}, \dots, β_{T}$ :

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)

Parameter Breakdown

$x_{t - 1}$ : The state of the image at the previous step.
$β_{t} \in (0, 1)$ : The variance of the noise injected at step $t$ .
$1 - β_{t}$ : The attenuation factor. This slightly shrinks the signal from the previous step.
$I$ : The identity matrix, signifying that noise is added independently to every pixel.

Why shrink the signal by $1 - β_{t}$ ?
If we did not shrink the signal and simply added noise ( $x_{t} = x_{t - 1} + ϵ$ ), the variance of the image would grow infinitely at each step. By scaling the previous image by $1 - β_{t}$ , we preserve a stationary variance:

Var (x_{t}) = (1 - β_{t})^{2} Var (x_{t - 1}) + β_{t} Var (ϵ) = (1 - β_{t}) (1) + β_{t} (1) = 1

This ensures that the total variance remains normalized to $1$ throughout the entire forward process, even as the content dissolves into noise. As $t \to T$ , the image converges to an isotropic Gaussian distribution: $x_{T} \sim N (0, I)$ . Try it: drag the slider all the way to the right — watch how the figure-8 structure dissolves into formless noise (Figure 1).

Forward Process (Adding Noise) t = 0.00

x₀ (Data Manifold) x_T (Isotropic Gaussian)

Figure 1: The forward diffusion process. As the parameter t increases from 0 to 1 (the normalized timestep t/T), the clean data distribution on the Figure-8 manifold (dashed line) is gradually perturbed by adding isotropic Gaussian noise, eventually collapsing to an unstructured Gaussian distribution p(x_T).

The Jumping Trick (Reparameterization)

To train a model, we need to generate noisy images at arbitrary timesteps $t$ . If we had to run the Markov chain step-by-step ( $x_{0} \to x_{1} \to x_{2} \to \dots \to x_{t}$ ), training would be slow.

We can solve this recursively. Let $α_{t} = 1 - β_{t}$ . We rewrite the transitions using the reparameterization trick:

x_{t} = α_{t} x_{t - 1} + 1 - α_{t} ϵ_{t - 1} where ϵ_{t - 1} \sim N (0, I)

Expanding this recursively:

x_{t} = α_{t} (α_{t - 1} x_{t - 2} + 1 - α_{t - 1} ϵ_{t - 2}) + 1 - α_{t} ϵ_{t - 1}

x_{t} = α_{t} α_{t - 1} x_{t - 2} + α_{t} (1 - α_{t - 1}) ϵ_{t - 2} + 1 - α_{t} ϵ_{t - 1}

Because the sum of two independent Gaussians $X \sim N (0, σ_{1}^{2} I)$ and $Y \sim N (0, σ_{2}^{2} I)$ is a new Gaussian $Z \sim N (0, (σ_{1}^{2} + σ_{2}^{2}) I)$ , we merge the noise terms. The combined variance is:

Var = α_{t} (1 - α_{t - 1}) + (1 - α_{t}) = α_{t} - α_{t} α_{t - 1} + 1 - α_{t} = 1 - α_{t} α_{t - 1}

Thus, we simplify to:

x_{t} = α_{t} α_{t - 1} x_{t - 2} + 1 - α_{t} α_{t - 1} \overset{ϵ}{ˉ} where \overset{ϵ}{ˉ} \sim N (0, I)

Repeating this recursion all the way back to $x_{0}$ yields the closed-form transition formula:

q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)

where $\overset{α}{ˉ}_{t} = \prod_{s = 1}^{t} α_{s}$ . This closed-form property allows us to jump to any noise level during training instantly:

x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ

3. The Reverse Process: Reconstructing Order

To generate new data, we need to run the process in reverse: sample $q (x_{t - 1} ∣ x_{t})$ . By Bayes' rule, the reverse transition is:

q (x_{t - 1} ∣ x_{t}) = q (x_{t} ∣ x_{t - 1}) \frac{q ( x _{t - 1} )}{q ( x _{t} )}

Here lies the problem: the term $q (x_{t})$ is the probability density of a noisy image at step $t$ . To compute this, we must integrate over all possible clean images in the data distribution:

q (x_{t}) = \int q (x_{t} ∣ x_{0}) q (x_{0}) d x_{0}

This integral is completely intractable. To evaluate it, a model would need to know the density of all possible natural images in the universe.

However, if the step size $β_{t}$ is sufficiently small, the reverse transition $q (x_{t - 1} ∣ x_{t})$ is also Gaussian. While we cannot compute this true distribution, we can train a neural network $p_{θ} (x_{t - 1} ∣ x_{t})$ to approximate it [2]:

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

The network takes the noisy image $x_{t}$ and the current timestep $t$ as input, estimating the mean $μ_{θ}$ and covariance $Σ_{θ}$ to step backward toward structure. Try it: start with the slider at the far left (pure noise) and drag right — each position corresponds to one step of the reverse chain recovering the manifold (Figure 2).

Reverse Generative Process t = 0.99

x_T (Sampled Noise) x₀ (Reconstructed Manifold)

Figure 2: The reverse generative process. Starting from sampled isotropic Gaussian noise p(x_T), a neural network iteratively predicts and removes noise, transporting the points back to the structured data manifold (dashed line) at t = 0.

4. Deriving the Variational Lower Bound (ELBO)

To train our neural network, we want to maximize the log-likelihood of our data: $lo g p_{θ} (x_{0})$ [1, 2]. Since computing this directly is intractable, we optimize the Variational Lower Bound (ELBO).

Think of the ELBO as a proxy. Instead of maximizing the peak of a mountain directly (which is hidden behind clouds), we construct a lower floor (a deck) that is guaranteed to sit below the mountain. If we push this floor as high as possible, it forces the mountain peak up with it.

We start by expressing the negative log-likelihood:

- lo g p_{θ} (x_{0}) = - lo g \int p_{θ} (x_{0 : T}) d x_{1 : T}

We introduce the auxiliary distribution $q (x_{1 : T} ∣ x_{0})$ representing the forward trajectory:

- lo g p_{θ} (x_{0}) = - lo g \int q (x_{1 : T} ∣ x_{0}) \frac{p _{θ} ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )} d x_{1 : T}

Using Jensen's Inequality (since the negative logarithm is a convex function, $- lo g E [X] \leq E [- lo g X]$ ):

- lo g p_{θ} (x_{0}) \leq E_{q (x_{1 : T} ∣ x_{0})} [- lo g \frac{p _{θ} ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}] = E_{q} [lo g \frac{q ( x _{1 : T} ∣ x _{0} )}{p _{θ} ( x _{0 : T} )}]

By expanding the joint distributions of both chains:

Forward chain: $q (x_{1 : T} ∣ x_{0}) = \prod_{t = 1}^{T} q (x_{t} ∣ x_{t - 1})$
Reverse chain: $p_{θ} (x_{0 : T}) = p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} ∣ x_{t})$

and substituting them into the fraction, algebraic cancellations (using Bayes' rule to rewrite $q (x_{t} ∣ x_{t - 1})$ in terms of the posterior $q (x_{t - 1} ∣ x_{t}, x_{0})$ ) yields:

L = E_{q} [D_{K L} (q (x_{T} ∣ x_{0}) ∥ p (x_{T})) + t = 2 \sum T D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t})) - lo g p_{θ} (x_{0} ∣ x_{1})]

Breaking Down the Terms: What do they actually do?

Prior Matching ( $L_{T} = D_{K L} (q (x_{T} ∣ x_{0}) ∥ p (x_{T}))$ ): Compares the final noise level with standard Gaussian noise. Since the forward schedule is fixed, this term contains no parameters and is ignored during training.
Denoising Transitions ( $L_{t - 1} = D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t}))$ ): Compares our model's step $p_{θ} (x_{t - 1} ∣ x_{t})$ with the posterior $q (x_{t - 1} ∣ x_{t}, x_{0})$ . Crucially, $q (x_{t - 1} ∣ x_{t}, x_{0})$ is computable because it is conditioned on the clean image $x_{0}$ . If we know where the particle started, we can calculate where it came from in the previous step:
$q (x_{t - 1} ∣ x_{t}, x_{0}) = N (x_{t - 1}; \tilde{μ}_{t} (x_{t}, x_{0}), \tilde{β}_{t} I)$
Our model only needs to predict the mean $\tilde{μ}_{t} (x_{t}, x_{0})$ .
Reconstruction ( $L_{0} = - lo g p_{θ} (x_{0} ∣ x_{1})$ ): Measures how well the model reconstructs the exact original pixels at the final step ( $t = 1 \to 0$ ).

5. The Noise Compass and the Score Function

To train the model to predict the mean, we write the mathematical definition of the true posterior mean $\tilde{μ} (x_{t}, x_{0})$ :

\tilde{μ}_{t} (x_{t}, x_{0}) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ)

where $ϵ$ is the noise vector added to $x_{0}$ . Since our neural network's mean $μ_{θ}$ is trying to match this, we reparameterize $μ_{θ}$ to match this exact structure:

μ_{θ} (x_{t}, t) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t))

where $ϵ_{θ} (x_{t}, t)$ is a neural network trained to predict the noise $ϵ$ . This simplifies our loss function to a Mean-Squared Error [2]:

L_{s i m p l e} (θ) = E_{t, x_{0}, ϵ} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}]

Physical Intuition: The Noise Compass

Think about what this reparameterization represents. If you are lost in a dense forest (the data manifold) and walk $10$ steps in a random direction (adding noise $ϵ$ ), you do not need to rebuild the entire forest from memory to get back. You only need a compass that tells you: 'You walked $10$ steps north-east. To get back, walk $10$ steps south-west.'

Predicting the noise $ϵ$ is exactly like predicting the direction of the step you took. By training a network to predict the noise, we are training a noise compass that points back to the data manifold. Try it: click anywhere in the visualization to place the noisy point $x_{t}$ , then lower the accuracy slider to see how an imprecise prediction misses the manifold (Figure 3).

Vector Addition Logic

x_recon = x_t - ε_θ

Target x₀: (49, 49)

Noisy x_t: (45, 45)

Reconstructed: (48, 48)

Network Prediction Accuracy (Compass Alignment) 70%

Figure 3: Denoising as vector subtraction. The true noise vector ε (red arrow) pushes the data point x₀ off the manifold to x_t. The neural network predicts the noise vector ε_θ (green arrow). Subtracting the prediction from x_t gives x_recon, guiding the point back to the manifold as accuracy reaches 100%. (Manifold simplified to a circle here for clarity.)

The Score Function

In statistical physics, this noise compass is called The Score Function, defined as the gradient of the log-density: $\nabla_{x} lo g p (x)$ [3, 7]. This vector field points in the direction of the steepest increase in data density.

Minimizing the noise-prediction MSE is mathematically equivalent to Denoising Score Matching [3]. The predicted noise is proportional to the negative score:

ϵ_{θ} (x_{t}, t) \approx - 1 - \overset{α}{ˉ}_{t} \nabla_{x_{t}} lo g q (x_{t})

Try it: set the noise level to maximum, then slowly drag left — notice how the diffuse, long-range arrows sharpen into precise gradients pointing directly at the manifold (Figure 4).

Noise Level (Score Field Smoothing) t = 0.40

x₀ (Sharp Gradients) x_T (Smoothed Gradients)

Figure 4: The score field ∇_x log p_t(x). The vector field (teal arrows) points towards regions of higher density. At low noise levels (t → 0), the score is sharp and concentrated near the manifold; at high noise levels (t → 1), the score becomes smooth and diffuse.

When the noise level is high ( $t \to T$ ), the score field is smooth and diffuse, guiding particles from far away toward the general shape. When the noise level is low ( $t \to 0$ ), the score field becomes sharp and local, carving out fine, intricate details.

6. Training and Sampling Algorithms

Here is how you would implement training and sampling in practice, written in clean, step-by-step engineering logic. In the sampling loop, sigma[t] is the reverse-process noise standard deviation — typically $σ_{t} = β_{t}$ for the simple case, or $σ_{t} = \tilde{β}_{t}$ for the posterior-optimal variance derived in Section 4.

# Algorithm 1: Training Loop
def train_step(x_0):
    # 1. Sample a random timestep from 1 to T
    t = sample_uniform(1, T)
    
    # 2. Sample random Gaussian noise
    epsilon = sample_gaussian_like(x_0)
    
    # 3. Compute the noisy image x_t using the Jumping Trick
    x_t = sqrt(alpha_bar[t]) * x_0 + sqrt(1 - alpha_bar[t]) * epsilon
    
    # 4. Predict the noise using the neural network
    predicted_noise = network(x_t, t)
    
    # 5. Compute mean squared error loss
    loss = mean_squared_error(epsilon, predicted_noise)
    
    # 6. Take gradient step to minimize loss
    optimizer.step(loss)

# Algorithm 2: Sampling Loop (Reverse Generation)
def generate_sample():
    # 1. Start with pure Gaussian noise
    x = sample_gaussian(shape)
    
    # 2. Iterate backward from T down to 1
    for t in reversed(range(1, T + 1)):
        # Sample standard Gaussian noise (z = 0 if t = 1)
        z = sample_gaussian_like(x) if t > 1 else zeros_like(x)
        
        # Predict the noise vector
        predicted_noise = network(x, t)
        
        # Calculate the mean of the previous step x_(t-1)
        mean = (1 / sqrt(alpha[t])) * (x - (beta[t] / sqrt(1 - alpha_bar[t])) * predicted_noise)
        
        # Add scaled noise to step backward stochastically
        variance = sigma[t] * z
        x = mean + variance
        
    return x # Clean reconstructed image x_0

7. The Continuous-Time SDE & ODE View

What happens if we make the time steps infinitely small ( $T \to \infty$ )? The discrete Markov chain converges to a continuous path described by a Stochastic Differential Equation (SDE) [3]:

d x = f (x, t) d t + g (t) d w

where:

$d t$ represents an infinitesimal step forward in time.
$d w$ represents Brownian motion (continuous white noise).
$f (x, t)$ is the drift term, which pushes the particles in a deterministic direction.
$g (t)$ is the diffusion coefficient, representing the strength of the noise added.

By the reverse-time theorem derived by Anderson [5], if we know the forward SDE and can estimate the score function $\nabla_{x} lo g p_{t} (x)$ , the reverse-time SDE is:

d x = [f (x, t) - g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t + g (t) d \overset{w}{ˉ}

where $d \overset{w}{ˉ}$ is a backward Brownian motion. Generating data is equivalent to integrating this SDE backward in time.

The Probability Flow ODE

A remarkable mathematical consequence is that for this stochastic path, there exists a deterministic sibling called the Probability Flow ODE [3]:

d x = [f (x, t) - \frac{1}{2} g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t

This ODE shares the exact same marginal probability densities $p_{t} (x)$ as the SDE at every point in time, but contains no noise.

Deterministic Mapping: Every noise vector $x_{T}$ maps deterministically to a single, unique data point $x_{0}$ .
Fast Sampling: Since it is an ODE, we can use adaptive-step numerical ODE solvers (like Runge-Kutta) to synthesize samples in $10$ to $20$ steps instead of $1000$ stochastic steps.
Flow Matching [6]: This formulation is the basis of modern Flow Matching architectures. Rather than modeling noise addition, flow matching defines a straight velocity vector field directly connecting noise to data, accelerating convergence. Try it: click "Simulate Trajectories" in both ODE and SDE modes — in ODE mode the paths never cross; in SDE mode, stochastic noise causes visible branching (Figure 5).

Figure 5: SDE vs. Probability Flow ODE. In Stochastic Flow (SDE), noise is added at each step, causing particles to take chaotic, overlapping Brownian paths. In Probability Flow (ODE), the process is completely deterministic and noise-free, tracing smooth, non-crossing streamlines that preserve the exact same marginal probability densities at all times.

8. Noise Scheduling: Linear vs. Cosine

The variance schedule controls the rate of signal decay. The choice of this schedule has a massive impact on generation quality.

Linear Schedule [2]: $β_{t}$ scales linearly from $β_{1} = 1 0^{- 4}$ to $β_{T} = 0.02$ . While effective for low-resolution datasets like CIFAR-10, it decays the signal too quickly in the early steps on high-resolution images, destroying fine details before the network can learn to map them.
Cosine Schedule [4]: Defines $\overset{α}{ˉ}_{t}$ using a squared cosine function:
$\overset{α}{ˉ}_{t} = \frac{f ( t )}{f ( 0 )}, f (t) = cos^{2} (\frac{t / T + s}{1 + s} \cdot \frac{π}{2})$
where $s$ is a small offset (typically $0.008$ ) to prevent $β_{t}$ from becoming too small at $t = 0$ . This schedule ensures a smooth, gradual loss of information, preserving high-frequency features (like edges and textures) deep into the forward chain. Try it: drag to $t \approx 0.3$ — the linear schedule (blue) has already dissolved most structure, while the cosine schedule (coral) still holds the shape (Figure 6).

Linear (Fast decay) Cosine (Gradual decay)

Linear Schedule ᾱ = 0.658

Cosine Schedule ᾱ = 0.899

Diffusion Timestep t = 0.20

Figure 6: Comparison of noise schedules. The linear schedule (blue) decays the signal variance too rapidly in the early steps, completely dissolving structural features. The cosine schedule (coral) decreases the signal variance gradually, preserving structural contours longer.

Conclusion

The elegance of diffusion lies in framing an intractable generative mapping problem as a sequence of tractable, linear regressions. By establishing the mathematical equivalence between predicting Gaussian noise and estimating the score function of a data distribution, diffusion models offer a robust, theoretically grounded architecture that gracefully sidesteps the training instabilities of prior generative frameworks.

References

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning (ICML).
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems (NeurIPS).
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-Based Generative Modeling through Stochastic Differential Equations. International Conference on Learning Representations (ICLR).
Nichol, A. Q., & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. International Conference on Learning Representations (ICLR).
Anderson, B. D. O. (1982). Reverse-time stochastic differential equations. Stochastic Processes and their Applications, 12(3), 313-326.
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nicklas, M., & Le, M. (2022). Flow Matching for Generative Modeling. International Conference on Learning Representations (ICLR).
Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution in Noise-Conditioned Score Networks. Advances in Neural Information Processing systems (NeurIPS).