扩散模型走了十年弯路！何恺明重磅新作JiT：回归真正“去噪”本质

Core Viewpoint - The article discusses the limitations of current diffusion models in denoising tasks and introduces a simplified architecture called JiT (Just image Transformers) that focuses on predicting clean images directly rather than noise, leading to improved performance in high-dimensional pixel spaces [10][18][34]. Group 1: Diffusion Models and Noise Prediction - Traditional diffusion models are designed to predict noise or the amount of mixed noise, which is fundamentally different from predicting clean images [6][7]. - The authors argue that the essence of denoising should be to let the network predict clean data instead of noise, simplifying the task and improving model performance [18][19]. Group 2: JiT Architecture - JiT is a minimalist framework that operates directly on pixel patches without relying on latent spaces, tokenizers, or additional loss functions, making it more efficient [10][25][34]. - The architecture demonstrates that even with high-dimensional patches (up to 3072 dimensions), the model can maintain stable training and performance by focusing on predicting clean images [23][30]. Group 3: Experimental Results - In experiments on ImageNet at various resolutions, JiT models achieved impressive FID scores, with JiT-G/16 reaching 1.82, comparable to complex models that utilize latent spaces [30][31]. - The model's performance remained stable even at higher resolutions (1024×1024), showcasing its capability to handle high-dimensional data without increased computational costs [32][34]. Group 4: Implications for Future Research - The JiT framework suggests a potential shift in generative modeling, emphasizing the importance of working directly in pixel space for applications in embodied intelligence and scientific computing [34].