Summary

The original DDPM framework produced excellent samples but suffered from poor log-likelihoods and extremely slow sampling (thousands of steps). Nichol and Dhariwal address both problems with three key modifications: (1) learning the reverse process variances via interpolation in log-space between theoretical upper and lower bounds (β_t and β̃_t), trained with a hybrid objective L_hybrid = L_simple + 0.001 · L_vlb; (2) a cosine noise schedule that destroys information more gradually than the linear schedule; and (3) importance sampling over timesteps to reduce gradient noise when optimizing the VLB.

These changes achieve competitive log-likelihoods (2.94 bits/dim on CIFAR-10, 3.53 on ImageNet 64×64) while the learned variance parameterization enables fast sampling: the model produces high-quality samples with 50-100 steps instead of 4000, because the variances automatically rescale for shorter processes. Precision/recall analysis reveals diffusion models achieve much higher recall (mode coverage) than GANs at comparable FID, and FID follows a power law with training compute.

Key Contributions

  • Learned reverse process variances via log-space interpolation with hybrid training objective
  • Cosine noise schedule for more gradual information destruction
  • Importance sampling for VLB optimization reducing gradient noise
  • Fast sampling with 10-40x fewer steps via automatic variance rescaling
  • Competitive log-likelihoods matching best convolutional likelihood-based models
  • Precision/recall analysis showing superior mode coverage vs GANs
  • Scaling laws: FID follows power law with training compute

Methodology

Variance parameterization: Σ_θ(x_t, t) = exp(v · log β_t + (1-v) · log β̃_t) where v is a learned output. The hybrid loss uses stop-gradient on μ_θ in L_vlb so VLB only guides variance. The cosine schedule: ᾱ_t = f(t)/f(0) where f(t) = cos²((t/T + s)/(1+s) · π/2) with s = 0.008. Fast sampling uses K evenly-spaced timesteps from [1,T].

Key Findings

  • L_hybrid with cosine schedule: 3.57 bits/dim on ImageNet 64×64 (vs 3.99 baseline)
  • L_vlb with importance sampling: best NLL at 3.53 bits/dim but worse FID (40.1 vs 19.2)
  • Learned variance models maintain near-optimal FID with 100 steps vs 4000
  • L_hybrid outperforms DDIM when using 50+ sampling steps
  • BigGAN-deep: lower FID but much worse recall (0.59 vs 0.72)
  • FID scales as power law with compute across model sizes (30M to 270M params)

Important References

  1. Denoising Diffusion Probabilistic Models — Foundation paper this directly improves
  2. Score-Based Generative Modeling through Stochastic Differential Equations — Concurrent SDE framework
  3. Denoising Diffusion Implicit Models — Concurrent fast sampling approach (DDIM)

Atomic Notes


paper