Diffusion Models: How AI Creates Art from Noise
Midjourney and DALL-E don't paint; they 'denoise'. We explain the physics-inspired magic behind D...
Abstract Algorithms
TLDR: Diffusion models work by first learning to add noise to an image, then learning to undo that noise. At inference time you start from pure static and iteratively denoise into a meaningful image. They power DALL-E, Midjourney, and Stable Diffusion.
๐ The Reverse Photograph: Making Sense from Static
Imagine you take a clear photo and run it through a machine that adds a tiny bit of static 1,000 times in a row. At step 1,000 the image is pure white noise โ indistinguishable from random snow on a TV.
Now you train a neural network to reverse that process: given an image at step $t$, predict what it looked like at step $t-1$. Once the model can do this reliably, you can start from pure noise at step 1,000 and run the denoiser 1,000 times to arrive at a sharp, coherent image.
That is the entire intuition behind diffusion models.
๐ข Two Processes: Forward Diffusion and Reverse Denoising
Forward process (training only):
- Deterministic and mathematical. No network needed.
- At each step $t$, add a small amount of Gaussian noise.
- After $T$ steps (~1000), the image is purely random noise.
Reverse process (what the model learns):
- The model learns to predict and remove the noise added at each step.
- At inference time, run the reverse process from $t=T$ to $t=0$.
flowchart LR
Clean[Clean Image\nt=0] -->|Add noise step by step| Noisy[Pure Noise\nt=1000]
Noisy -->|Learned denoising| Clean2[Generated Image\nt=0]
| Step | Forward (training) | Reverse (inference) |
| Input | Clean image | Pure noise |
| Operation | Add Gaussian noise | Remove predicted noise |
| Known? | Yes โ we add it | No โ model predicts it |
| Output | Noisy image | Clean image |
โ๏ธ What the Model Actually Learns: Predicting Noise, Not Pixels
The model is not trained to directly predict the clean image. It is trained to predict the noise that was added.
$$L = \| \epsilon - \epsilon_\theta(x_t, t) \|^2$$
- $\epsilon$: the actual Gaussian noise we added (known, because we added it)
- $\epsilon_\theta(x_t, t)$: the model's prediction of that noise
- $x_t$: the noisy image at step $t$
Plain English: "Look at this noisy image at step $t$. Guess what static I mixed in. The better you guess, the more precisely I can subtract it to recover the clean image."
The U-Net architecture (with skip connections) is commonly used as $\epsilon_\theta$ because it can process images at multiple scales while retaining fine-grained spatial information.
๐ง Stable Diffusion: Latent Space and Text Conditioning
Running 1,000 denoising steps on a full 512ร512 image is slow. Stable Diffusion (Rombach et al., 2022) adds two improvements:
Latent diffusion: Compress the image into a small latent representation using a VAE first. Run the diffusion process on the smaller latent (8ร smaller), then decode back to full resolution.
Text conditioning (CLIP): Feed the text prompt through a text encoder (CLIP). Inject the text embedding into each denoising step via cross-attention. The model learns to denoise toward images that match the text.
flowchart TD
Prompt[Text Prompt] --> CLIP[CLIP Text Encoder]
Noise[Latent Noise] --> UNet[U-Net Denoiser\nwith Cross-Attention]
CLIP --> UNet
UNet --> VAE[VAE Decoder]
VAE --> Image[Generated Image]
๐ Real-World Generators and What They Use
| Product | Underlying model | Key innovation |
| Stable Diffusion | Latent diffusion (Runway ML / Stability AI) | Open weights, runs on consumer GPUs |
| DALL-E 3 (OpenAI) | Diffusion with better text alignment | Trained with synthetic high-quality captions |
| Midjourney | Proprietary diffusion | Aesthetic tuning and community ranking |
| Adobe Firefly | Diffusion trained on licensed images | IP-safe training data |
| Sora (OpenAI, video) | Diffusion Transformers (DiT) on video tokens | Temporal coherence over long clips |
โ๏ธ Inference Speed vs Quality: Steps, Samplers, and Guidance
Steps: More denoising steps = sharper image but slower. 20โ50 steps is a common sweet spot.
Samplers/Schedulers: DDPM (original) needs 1,000 steps. DDIM, DPM++, LCM reduce this to 4โ50 steps with comparable quality.
Classifier-Free Guidance (CFG scale): Trades creativity for prompt adherence.
- Low CFG (1โ3): dreamlike, diverse, may ignore prompt
- High CFG (10โ20): closely follows prompt but can look oversaturated
๐ Key Takeaways
- Diffusion models learn to remove noise iteratively; inference starts from pure noise and denoises.
- The model predicts noise $\epsilon$, not pixels โ subtraction recovers the cleaned image.
- Stable Diffusion runs in latent space (8ร compression) and uses CLIP for text conditioning.
- Common samplers (DDIM, DPM++) reduce required steps from 1,000 to ~20 with comparable quality.
- CFG scale controls the prompt-adherence vs creativity trade-off.
๐งฉ Test Your Understanding
- Why does training a diffusion model require adding noise to images first?
- The model outputs $\epsilon_\theta(x_t, t)$. How is this used to recover the image at step $t-1$?
- How does text conditioning work in Stable Diffusion?
- A user complains the output looks oversaturated and plasticky. Which setting might be too high?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
