Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

The quality of generative visuals has leaped from blurry blobs to near-photorealistic films in a few years. Yet, the core technology—a diffusion process of adding and then removing noise—has remained consistent. Progress stems from optimizations and architectural improvements, not a complete paradigm shift.

Related Insights

The SNR-T bias can be fixed efficiently without retraining models. At each denoising step, the image is broken into frequency bands using wavelets. Each band is then given a small correction based on its specific noise mismatch before being recombined. This surgical approach is computationally cheap and universally effective.

Diffusion models work on a continuous medium like an image by adding noise until it's unrecognizable, then training a model to reverse the process. This holistic, denoising method is fundamentally different from autoregressive models like large language models, which predict data one token at a time.

The perceived intelligence of video generation models is often an illusion. The heavy lifting is done by a large language model that rewrites simple user prompts into highly detailed scenes. The video diffusion model itself is less intelligent, simply executing these detailed instructions literally.

While GANs failed for protein systems, diffusion models became the key primitive. Now, the frontier of diffusion research is in specialized scientific areas like 3D structure prediction, surpassing the innovation seen in more mainstream AI applications like image generation.

While AI progress is marketed in revolutionary "step-changes" (e.g., GPT-3 to GPT-4), the underlying reality is more like compounding interest. A continuous stream of small, incremental improvements are accumulating, and their combined effect is what creates the feeling of an exponential leap in capability over time.

Flow matching is a technical evolution of diffusion that learns a 'flow map' which guides a noisy input toward the manifold of 'real images.' It's analogous to creating a wind map that directs a paper airplane to a specific house from anywhere in a city, resulting in a cleaner, more direct generation process.

During training, diffusion models learn a perfect relationship between noise level (SNR) and denoising step (T). During inference, this relationship breaks as the model's own predictions introduce errors, creating SNR values it never trained on for a given step. This causes compounding errors and quality loss.

Diffusion models naturally reconstruct images in layers. In early denoising stages with high noise, they focus on low-frequency information like overall composition and color. As noise decreases in later steps, they add high-frequency details like textures and sharp edges. This hierarchical process is key to understanding their behavior.

Models like Stable Diffusion achieve massive compression ratios (e.g., 50,000-to-1) because they aren't just storing data; they are learning the underlying principles and concepts. The resulting model is a compact 'filter' of intelligence that can generate novel outputs based on these learned principles.

The primary performance bottleneck for LLMs is memory bandwidth (moving large weights), making them memory-bound. In contrast, diffusion-based video models are compute-bound, as they saturate the GPU's processing power by simultaneously denoising tens of thousands of tokens. This represents a fundamental difference in optimization strategy.

Generative Image Quality Skyrocketed Without Fundamentally Changing Core Diffusion Technology | RiffOn