Image and Video Diffusion Models
Definition: A class of generative-ai models that generate data (images, video, audio, text) by iteratively denoising a random Gaussian distribution. The process involves two phases: a forward diffusion process that adds noise to data until it becomes pure noise, and a reverse diffusion process that learns to remove noise to reconstruct the original data distribution.
Core Mechanism
- Forward Process: , where is the final time step resulting in pure Gaussian noise.
- Reverse Process: Predicts noise at each timestep to reconstruct from .
- Loss Function: Typically optimized using Mean Squared Error (MSE) between predicted and actual noise.
Modalities & Applications
Image Generation
- Dominant architecture for high-fidelity image synthesis.
- Key models: Stable Diffusion, dall-e-3, Midjourney.
- Uses Autoregressive or parallel denoising strategies depending on implementation.
Video Generation
- Extends spatial diffusion with a temporal dimension.
- Models must maintain coherence across frames while generating motion.
- Examples: sora, Runway Gen-2.
Text Generation
- Traditionally, text generation has been dominated by Autoregressive models (e.g., transformers).
- Recent Development: Non-autoregressive approaches are emerging to reduce latency.
- See: Text Diffusion: Google DeepMind’s Faster Parallel Text Generation via Denoising for details on Google DeepMind’s parallel denoising approach.
Comparison with Autoregressive Models
| Feature | Diffusion Models | Autoregressive (LLMs) |
|---|---|---|
| Generation Style | Parallel/Denoising steps | Sequential token-by-token |
| Latency | Can be optimized via distillation; inherently parallelizable | Sequential bottleneck |
| Strengths | High-fidelity imagery, complex distributions | Long-context coherence, logic, reasoning |
References
- Ho, J., et al. “Denoising Diffusion Probabilistic Models.” NeurIPS 2020.
- Song, J., et al. “Score-Based Generative Modeling through Stochastic Differential Equations.” ICLR 2021.