Image and Video Diffusion Models

Definition: A class of generative-ai models that generate data (images, video, audio, text) by iteratively denoising a random Gaussian distribution. The process involves two phases: a forward diffusion process that adds noise to data until it becomes pure noise, and a reverse diffusion process that learns to remove noise to reconstruct the original data distribution.

Core Mechanism

  • Forward Process: , where is the final time step resulting in pure Gaussian noise.
  • Reverse Process: Predicts noise at each timestep to reconstruct from .
  • Loss Function: Typically optimized using Mean Squared Error (MSE) between predicted and actual noise.

Modalities & Applications

Image Generation

  • Dominant architecture for high-fidelity image synthesis.
  • Key models: Stable Diffusion, dall-e-3, Midjourney.
  • Uses Autoregressive or parallel denoising strategies depending on implementation.

Video Generation

  • Extends spatial diffusion with a temporal dimension.
  • Models must maintain coherence across frames while generating motion.
  • Examples: sora, Runway Gen-2.

Text Generation

Comparison with Autoregressive Models

FeatureDiffusion ModelsAutoregressive (LLMs)
Generation StyleParallel/Denoising stepsSequential token-by-token
LatencyCan be optimized via distillation; inherently parallelizableSequential bottleneck
StrengthsHigh-fidelity imagery, complex distributionsLong-context coherence, logic, reasoning

References

  • Ho, J., et al. “Denoising Diffusion Probabilistic Models.” NeurIPS 2020.
  • Song, J., et al. “Score-Based Generative Modeling through Stochastic Differential Equations.” ICLR 2021.