何恺明团队新作：扩散模型可能被用错了

Core Insights - The latest paper challenges the mainstream approach of diffusion models by suggesting that instead of predicting noise, models should directly generate clean images [1][2] - The research emphasizes a return to the fundamental purpose of diffusion models, which is denoising, rather than complicating the architecture with additional components [2][3] Summary by Sections Diffusion Models Misuse - Current mainstream diffusion models often predict noise or a mixture of images and noise, rather than focusing on generating clean images [3][5] - This approach creates a significant challenge, as predicting noise requires a large model capacity to capture the high-dimensional noise, leading to potential training failures [5][6] Proposed Solution: JiT Architecture - The paper introduces a simplified architecture called JiT (Just image Transformers), which directly predicts clean images without relying on complex components like VAE or tokenizers [7][8] - JiT operates purely from pixel data, treating the task as a denoising problem, which aligns better with the original design of neural networks [6][8] Experimental Results - Experimental results indicate that while traditional noise prediction models struggle in high-dimensional spaces, JiT maintains robustness and achieves superior performance [10] - JiT demonstrates excellent scalability, maintaining high-quality generation even with larger input dimensions without increasing network width [11][13] - The architecture achieved state-of-the-art FID scores of 1.82 and 1.78 on ImageNet datasets of 256x256 and 512x512, respectively [13][14] Author Background - The lead author, Li Tianhong, is a notable researcher in representation learning and generative models, having previously collaborated with renowned researcher He Kaiming [15][17]