During training, diffusion models learn a perfect relationship between noise level (SNR) and denoising step (T). During inference, this relationship breaks as the model's own predictions introduce errors, creating SNR values it never trained on for a given step. This causes compounding errors and quality loss.
Diffusion models naturally reconstruct images in layers. In early denoising stages with high noise, they focus on low-frequency information like overall composition and color. As noise decreases in later steps, they add high-frequency details like textures and sharp edges. This hierarchical process is key to understanding their behavior.
The SNR-T bias can be fixed efficiently without retraining models. At each denoising step, the image is broken into frequency bands using wavelets. Each band is then given a small correction based on its specific noise mismatch before being recombined. This surgical approach is computationally cheap and universally effective.
