Workflow
JiT(Just image Transformers)
icon
Search documents
何恺明带大二本科生颠覆扩散图像生成:扔掉多步采样和潜空间,一步像素直出
量子位· 2026-02-02 05:58
Core Viewpoint - The article discusses the introduction of a new method called Pixel Mean Flow (pMF), which simplifies the architecture of diffusion models by eliminating traditional components like multi-step sampling and latent space, allowing for direct image generation in pixel space [2][3][5]. Group 1: Methodology and Innovations - pMF achieves significant performance improvements, with a FID score of 2.22 at a resolution of 256×256 and 2.48 at 512×512, marking it as one of the best single-step, non-latent space diffusion models [4][27]. - The elimination of multi-step sampling and latent space reduces the complexity of the generation process, allowing for a more efficient architecture [6][36]. - The core design of pMF involves the network directly outputting pixel-level denoised images while using a velocity field to compute loss during training [13][25]. Group 2: Experimental Results - In experiments, the pMF model outperformed the previous method EPG, which had a FID of 8.82, demonstrating a substantial improvement in image generation quality [27]. - The addition of perceptual loss during training led to a reduction in FID from 9.56 to 3.53, showcasing the effectiveness of this approach [26]. - The computational efficiency of pMF is highlighted, as it requires significantly less computational power compared to GAN methods like StyleGAN-XL, which demands 1574 Gflops for each forward pass, while pMF-H/16 only requires 271 Gflops [27]. Group 3: Challenges and Future Directions - The integration of single-step and pixel space models presents increased challenges in architecture design, necessitating advanced solutions to handle the complexities involved [10][12]. - The article emphasizes that as model capabilities improve, the historical compromises of multi-step sampling and latent space encoding are becoming less necessary, encouraging further exploration of direct, end-to-end generative modeling [36].
扩散模型走了十年弯路!何恺明重磅新作JiT:回归真正“去噪”本质
自动驾驶之心· 2025-12-01 00:04
Core Viewpoint - The article discusses the limitations of current diffusion models in denoising tasks and introduces a simplified architecture called JiT (Just image Transformers) that focuses on predicting clean images directly rather than noise, leading to improved performance in high-dimensional pixel spaces [10][18][34]. Group 1: Diffusion Models and Noise Prediction - Traditional diffusion models are designed to predict noise or the amount of mixed noise, which is fundamentally different from predicting clean images [6][7]. - The authors argue that the essence of denoising should be to let the network predict clean data instead of noise, simplifying the task and improving model performance [18][19]. Group 2: JiT Architecture - JiT is a minimalist framework that operates directly on pixel patches without relying on latent spaces, tokenizers, or additional loss functions, making it more efficient [10][25][34]. - The architecture demonstrates that even with high-dimensional patches (up to 3072 dimensions), the model can maintain stable training and performance by focusing on predicting clean images [23][30]. Group 3: Experimental Results - In experiments on ImageNet at various resolutions, JiT models achieved impressive FID scores, with JiT-G/16 reaching 1.82, comparable to complex models that utilize latent spaces [30][31]. - The model's performance remained stable even at higher resolutions (1024×1024), showcasing its capability to handle high-dimensional data without increased computational costs [32][34]. Group 4: Implications for Future Research - The JiT framework suggests a potential shift in generative modeling, emphasizing the importance of working directly in pixel space for applications in embodied intelligence and scientific computing [34].
何恺明团队新作:扩散模型可能被用错了
3 6 Ke· 2025-11-19 11:22
Core Insights - The latest paper challenges the mainstream approach of diffusion models by suggesting that instead of predicting noise, models should directly generate clean images [1][2] - The research emphasizes a return to the fundamental purpose of diffusion models, which is denoising, rather than complicating the architecture with additional components [2][3] Summary by Sections Diffusion Models Misuse - Current mainstream diffusion models often predict noise or a mixture of images and noise, rather than focusing on generating clean images [3][5] - This approach creates a significant challenge, as predicting noise requires a large model capacity to capture the high-dimensional noise, leading to potential training failures [5][6] Proposed Solution: JiT Architecture - The paper introduces a simplified architecture called JiT (Just image Transformers), which directly predicts clean images without relying on complex components like VAE or tokenizers [7][8] - JiT operates purely from pixel data, treating the task as a denoising problem, which aligns better with the original design of neural networks [6][8] Experimental Results - Experimental results indicate that while traditional noise prediction models struggle in high-dimensional spaces, JiT maintains robustness and achieves superior performance [10] - JiT demonstrates excellent scalability, maintaining high-quality generation even with larger input dimensions without increasing network width [11][13] - The architecture achieved state-of-the-art FID scores of 1.82 and 1.78 on ImageNet datasets of 256x256 and 512x512, respectively [13][14] Author Background - The lead author, Li Tianhong, is a notable researcher in representation learning and generative models, having previously collaborated with renowned researcher He Kaiming [15][17]
何恺明团队新作:扩散模型可能被用错了
量子位· 2025-11-19 09:01
Core Viewpoint - The article discusses a new paper by He Kaiming that challenges the mainstream approach to diffusion models by advocating for a return to the original purpose of denoising, suggesting that models should directly predict clean images instead of noise [2][5][6]. Summary by Sections Diffusion Models - Diffusion models have become increasingly complex over the years, often focusing on predicting noise rather than the clean images they were originally designed to denoise [4][6]. - The new paper emphasizes that since diffusion models are fundamentally denoising models, they should directly perform denoising [5][6]. Manifold Hypothesis - The article explains the manifold hypothesis, stating that natural images exist on a low-dimensional manifold within a high-dimensional pixel space, while noise is uniformly distributed across the entire space [7][9]. - This distinction leads to challenges when neural networks attempt to fit high-dimensional noise, requiring significant model capacity and often resulting in training failures [9]. JiT Architecture - The proposed architecture, JiT (Just image Transformers), is a simplified model that processes images directly without relying on complex components like VAE or tokenizers [10][11]. - JiT operates by taking raw pixel data, dividing it into large patches, and setting the output target to predict clean image blocks [12]. Experimental Results - Experimental results indicate that while predicting noise and predicting original images perform similarly in low-dimensional spaces, traditional noise prediction models fail in high-dimensional spaces, while JiT remains robust [14]. - JiT demonstrates excellent scalability, maintaining high-quality generation even when input dimensions are significantly increased [15][17]. - The JiT architecture achieved state-of-the-art FID scores of 1.82 and 1.78 on ImageNet datasets of 256x256 and 512x512, respectively, without relying on complex components or pre-training [18][19]. Research Focus - The primary research direction of He Kaiming includes representation learning, generative models, and their synergistic effects, aiming to build intelligent visual systems that understand the world beyond human perception [21].