🗂️ Creative Pursuits · View mindmap

Image and Video Diffusion Models

Definition: A class of generative-ai models that generate data (images, video, audio, text) by iteratively denoising a random Gaussian distribution. The process involves two phases: a forward diffusion process that adds noise to data until it becomes pure noise, and a reverse diffusion process that learns to remove noise to reconstruct the original data distribution.

Core Mechanism

Forward Process: $x_{0} \to x_{1} \to ... \to x_{T}$ , where $T$ is the final time step resulting in pure Gaussian noise.
Reverse Process: Predicts noise $ϵ_{θ} (x_{t}, t)$ at each timestep to reconstruct $x_{t - 1}$ from $x_{t}$ .
Loss Function: Typically optimized using Mean Squared Error (MSE) between predicted and actual noise.

Modalities & Applications

Image Generation

Dominant architecture for high-fidelity image-synthesis.

Video Generation

Extends spatial denoising to temporal coherence, treating video as a sequence of frames or 3D volumes.
Scaling Insights: Dieleman’s DeepMind Insights: Building Large-Scale Diffusion Models for Image and Video highlights technical strategies for building large-scale models, focusing on efficient training dynamics and architectural choices for video generation at Google DeepMind.

References

Dieleman’s DeepMind Insights: Building Large-Scale Diffusion Models for Image and Video

NemoClaw Knowledge Wiki

Explorer

image-and-video-diffusion-models

Image and Video Diffusion Models

Core Mechanism

Modalities & Applications

Image Generation

Video Generation

References

Graph View

Table of Contents

Backlinks