何恺明团队新作：扩散模型可能被用错了

Core Viewpoint - The article discusses a new paper by He Kaiming that challenges the mainstream approach to diffusion models by advocating for a return to the original purpose of denoising, suggesting that models should directly predict clean images instead of noise [2][5][6]. Summary by Sections Diffusion Models - Diffusion models have become increasingly complex over the years, often focusing on predicting noise rather than the clean images they were originally designed to denoise [4][6]. - The new paper emphasizes that since diffusion models are fundamentally denoising models, they should directly perform denoising [5][6]. Manifold Hypothesis - The article explains the manifold hypothesis, stating that natural images exist on a low-dimensional manifold within a high-dimensional pixel space, while noise is uniformly distributed across the entire space [7][9]. - This distinction leads to challenges when neural networks attempt to fit high-dimensional noise, requiring significant model capacity and often resulting in training failures [9]. JiT Architecture - The proposed architecture, JiT (Just image Transformers), is a simplified model that processes images directly without relying on complex components like VAE or tokenizers [10][11]. - JiT operates by taking raw pixel data, dividing it into large patches, and setting the output target to predict clean image blocks [12]. Experimental Results - Experimental results indicate that while predicting noise and predicting original images perform similarly in low-dimensional spaces, traditional noise prediction models fail in high-dimensional spaces, while JiT remains robust [14]. - JiT demonstrates excellent scalability, maintaining high-quality generation even when input dimensions are significantly increased [15][17]. - The JiT architecture achieved state-of-the-art FID scores of 1.82 and 1.78 on ImageNet datasets of 256x256 and 512x512, respectively, without relying on complex components or pre-training [18][19]. Research Focus - The primary research direction of He Kaiming includes representation learning, generative models, and their synergistic effects, aiming to build intelligent visual systems that understand the world beyond human perception [21].