Improved Denoising Diffusion Probabilistic Models

Abstract

Denoising diffusion probabilistic models (DDPM) are a class of generative models which have recently been shown to produce excellent samples. We show that with a few simple modifications, DDPMs can also achieve competitive log-likelihoods while maintaining high sample quality. Additionally, we find that learning variances of the reverse diffusion process allows sampling with an order of magnitude fewer forward passes with a negligible difference in sample quality, which is important for the practical deployment of these models. We additionally use precision and recall to compare how well DDPMs and GANs cover the target distribution.

Summary

The original DDPM framework produced excellent samples but suffered from poor log-likelihoods and extremely slow sampling (thousands of steps). Nichol and Dhariwal address both problems with three key modifications: (1) learning the reverse process variances via interpolation in log-space between theoretical upper and lower bounds (β_t and β̃_t), trained with a hybrid objective L_hybrid = L_simple + 0.001 · L_vlb; (2) a cosine noise schedule that destroys information more gradually than the linear schedule; and (3) importance sampling over timesteps to reduce gradient noise when optimizing the VLB.

These changes achieve competitive log-likelihoods (2.94 bits/dim on CIFAR-10, 3.53 on ImageNet 64×64) while the learned variance parameterization enables fast sampling: the model produces high-quality samples with 50-100 steps instead of 4000, because the variances automatically rescale for shorter processes. Precision/recall analysis reveals diffusion models achieve much higher recall (mode coverage) than GANs at comparable FID, and FID follows a power law with training compute.

Key Contributions

Learned reverse process variances via log-space interpolation with hybrid training objective
Cosine noise schedule for more gradual information destruction
Importance sampling for VLB optimization reducing gradient noise
Fast sampling with 10-40x fewer steps via automatic variance rescaling
Competitive log-likelihoods matching best convolutional likelihood-based models
Precision/recall analysis showing superior mode coverage vs GANs
Scaling laws: FID follows power law with training compute

Methodology

Variance parameterization: Σ_θ(x_t, t) = exp(v · log β_t + (1-v) · log β̃_t) where v is a learned output. The hybrid loss uses stop-gradient on μ_θ in L_vlb so VLB only guides variance. The cosine schedule: ᾱ_t = f(t)/f(0) where f(t) = cos²((t/T + s)/(1+s) · π/2) with s = 0.008. Fast sampling uses K evenly-spaced timesteps from [1,T].

Key Findings

L_hybrid with cosine schedule: 3.57 bits/dim on ImageNet 64×64 (vs 3.99 baseline)
L_vlb with importance sampling: best NLL at 3.53 bits/dim but worse FID (40.1 vs 19.2)
Learned variance models maintain near-optimal FID with 100 steps vs 4000
L_hybrid outperforms DDIM when using 50+ sampling steps
BigGAN-deep: lower FID but much worse recall (0.59 vs 0.72)
FID scales as power law with compute across model sizes (30M to 270M params)

Important References

Denoising Diffusion Probabilistic Models — Foundation paper this directly improves
Score-Based Generative Modeling through Stochastic Differential Equations — Concurrent SDE framework
Denoising Diffusion Implicit Models — Concurrent fast sampling approach (DDIM)

Atomic Notes

paper

Alethograph

Explorer

Improved Denoising Diffusion Probabilistic Models

Summary

Key Contributions

Methodology

Key Findings

Important References

Atomic Notes

Graph View

Table of Contents

Backlinks