Abstract
Denoising diffusion probabilistic models (DDPM) are a class of generative models which have recently been shown to produce excellent samples. We show that with a few simple modifications, DDPMs can also achieve competitive log-likelihoods while maintaining high sample quality. Additionally, we find that learning variances of the reverse diffusion process allows sampling with an order of magnitude fewer forward passes with a negligible difference in sample quality, which is important for the practical deployment of these models. We additionally use precision and recall to compare how well DDPMs and GANs cover the target distribution.
Summary
The original DDPM framework produced excellent samples but suffered from poor log-likelihoods and extremely slow sampling (thousands of steps). Nichol and Dhariwal address both problems with three key modifications: (1) learning the reverse process variances via interpolation in log-space between theoretical upper and lower bounds (β_t and β̃_t), trained with a hybrid objective L_hybrid = L_simple + 0.001 · L_vlb; (2) a cosine noise schedule that destroys information more gradually than the linear schedule; and (3) importance sampling over timesteps to reduce gradient noise when optimizing the VLB.
These changes achieve competitive log-likelihoods (2.94 bits/dim on CIFAR-10, 3.53 on ImageNet 64×64) while the learned variance parameterization enables fast sampling: the model produces high-quality samples with 50-100 steps instead of 4000, because the variances automatically rescale for shorter processes. Precision/recall analysis reveals diffusion models achieve much higher recall (mode coverage) than GANs at comparable FID, and FID follows a power law with training compute.
Key Contributions
- Learned reverse process variances via log-space interpolation with hybrid training objective
- Cosine noise schedule for more gradual information destruction
- Importance sampling for VLB optimization reducing gradient noise
- Fast sampling with 10-40x fewer steps via automatic variance rescaling
- Competitive log-likelihoods matching best convolutional likelihood-based models
- Precision/recall analysis showing superior mode coverage vs GANs
- Scaling laws: FID follows power law with training compute
Methodology
Variance parameterization: Σ_θ(x_t, t) = exp(v · log β_t + (1-v) · log β̃_t) where v is a learned output. The hybrid loss uses stop-gradient on μ_θ in L_vlb so VLB only guides variance. The cosine schedule: ᾱ_t = f(t)/f(0) where f(t) = cos²((t/T + s)/(1+s) · π/2) with s = 0.008. Fast sampling uses K evenly-spaced timesteps from [1,T].
Key Findings
- L_hybrid with cosine schedule: 3.57 bits/dim on ImageNet 64×64 (vs 3.99 baseline)
- L_vlb with importance sampling: best NLL at 3.53 bits/dim but worse FID (40.1 vs 19.2)
- Learned variance models maintain near-optimal FID with 100 steps vs 4000
- L_hybrid outperforms DDIM when using 50+ sampling steps
- BigGAN-deep: lower FID but much worse recall (0.59 vs 0.72)
- FID scales as power law with compute across model sizes (30M to 270M params)
Important References
- Denoising Diffusion Probabilistic Models — Foundation paper this directly improves
- Score-Based Generative Modeling through Stochastic Differential Equations — Concurrent SDE framework
- Denoising Diffusion Implicit Models — Concurrent fast sampling approach (DDIM)