流形假设
Search documents
何恺明团队新作:扩散模型可能被用错了
3 6 Ke· 2025-11-19 11:22
Core Insights - The latest paper challenges the mainstream approach of diffusion models by suggesting that instead of predicting noise, models should directly generate clean images [1][2] - The research emphasizes a return to the fundamental purpose of diffusion models, which is denoising, rather than complicating the architecture with additional components [2][3] Summary by Sections Diffusion Models Misuse - Current mainstream diffusion models often predict noise or a mixture of images and noise, rather than focusing on generating clean images [3][5] - This approach creates a significant challenge, as predicting noise requires a large model capacity to capture the high-dimensional noise, leading to potential training failures [5][6] Proposed Solution: JiT Architecture - The paper introduces a simplified architecture called JiT (Just image Transformers), which directly predicts clean images without relying on complex components like VAE or tokenizers [7][8] - JiT operates purely from pixel data, treating the task as a denoising problem, which aligns better with the original design of neural networks [6][8] Experimental Results - Experimental results indicate that while traditional noise prediction models struggle in high-dimensional spaces, JiT maintains robustness and achieves superior performance [10] - JiT demonstrates excellent scalability, maintaining high-quality generation even with larger input dimensions without increasing network width [11][13] - The architecture achieved state-of-the-art FID scores of 1.82 and 1.78 on ImageNet datasets of 256x256 and 512x512, respectively [13][14] Author Background - The lead author, Li Tianhong, is a notable researcher in representation learning and generative models, having previously collaborated with renowned researcher He Kaiming [15][17]
何恺明团队新作:扩散模型可能被用错了
量子位· 2025-11-19 09:01
Core Viewpoint - The article discusses a new paper by He Kaiming that challenges the mainstream approach to diffusion models by advocating for a return to the original purpose of denoising, suggesting that models should directly predict clean images instead of noise [2][5][6]. Summary by Sections Diffusion Models - Diffusion models have become increasingly complex over the years, often focusing on predicting noise rather than the clean images they were originally designed to denoise [4][6]. - The new paper emphasizes that since diffusion models are fundamentally denoising models, they should directly perform denoising [5][6]. Manifold Hypothesis - The article explains the manifold hypothesis, stating that natural images exist on a low-dimensional manifold within a high-dimensional pixel space, while noise is uniformly distributed across the entire space [7][9]. - This distinction leads to challenges when neural networks attempt to fit high-dimensional noise, requiring significant model capacity and often resulting in training failures [9]. JiT Architecture - The proposed architecture, JiT (Just image Transformers), is a simplified model that processes images directly without relying on complex components like VAE or tokenizers [10][11]. - JiT operates by taking raw pixel data, dividing it into large patches, and setting the output target to predict clean image blocks [12]. Experimental Results - Experimental results indicate that while predicting noise and predicting original images perform similarly in low-dimensional spaces, traditional noise prediction models fail in high-dimensional spaces, while JiT remains robust [14]. - JiT demonstrates excellent scalability, maintaining high-quality generation even when input dimensions are significantly increased [15][17]. - The JiT architecture achieved state-of-the-art FID scores of 1.82 and 1.78 on ImageNet datasets of 256x256 and 512x512, respectively, without relying on complex components or pre-training [18][19]. Research Focus - The primary research direction of He Kaiming includes representation learning, generative models, and their synergistic effects, aiming to build intelligent visual systems that understand the world beyond human perception [21].
何恺明重磅新作:Just image Transformers让去噪模型回归基本功
机器之心· 2025-11-19 02:09
Core Insights - The article discusses the relationship between image generation and denoising diffusion models, emphasizing that high-quality image generation relies on diffusion models [1] - It questions whether denoising diffusion models truly achieve "denoising," highlighting a shift in focus from predicting clean images to predicting noise itself [2][5] - The research proposes a return to directly predicting clean data, which allows networks with seemingly insufficient capacity to operate effectively in high-dimensional spaces [7][8] Group 1: Denoising Diffusion Models - Denoising diffusion models do not function in the classical sense of "denoising," as they predict noise or noisy quantities instead of clean images [5][6] - The manifold assumption suggests that natural images exist on a low-dimensional manifold, while noise is off-manifold, indicating a fundamental difference in predicting clean data versus noisy data [4][6] - The study introduces a model that directly predicts clean data, which could enhance the performance of diffusion models [7] Group 2: Just Image Transformers (JiT) - The paper presents the "Just image Transformers (JiT)" architecture, which utilizes simple large patch pixel-level transformers to create powerful generative models without the need for tokenizers or pre-training [11] - JiT achieves competitive pixel-space image generation on ImageNet, with FID scores of 1.82 at 256x256 resolution and 1.78 at 512x512 resolution [12] - The architecture is designed to be self-consistent and applicable across various fields involving natural data, such as protein and molecular data [12] Group 3: Model Performance and Design - The JiT architecture operates by dividing images into non-overlapping patches, allowing for effective processing of high-dimensional data [14] - The study finds that the performance of the model is significantly influenced by the prediction method used, with -prediction yielding the best results across various loss functions [21][23] - Increasing the number of hidden units is not necessary for model performance, as demonstrated by JiT's effective operation at higher resolutions without additional modifications [28][31] Group 4: Scalability and Generalization - The research emphasizes the scalability of the JiT model, showing that it maintains similar computational costs across different resolutions while achieving strong performance [42][44] - The findings suggest that the design of the network can be decoupled from the observed dimensions, allowing for flexibility in model architecture [31] - The introduction of bottleneck structures in the network design can enhance performance, encouraging the learning of intrinsic low-dimensional representations [33] Group 5: Conclusion and Future Implications - The study concludes that the findings regarding -prediction are a natural outcome of the limitations of neural networks in modeling noise rather than data [51] - The proposed "Diffusion + Transformer" paradigm has the potential to serve as a foundational method in various fields, particularly where obtaining tokenizers is challenging [52]
宇宙尺度压缩:Scaling Law的边界,柏拉图表征收敛于物质和信息交汇,解决P与NP问题,Simulation假说……
AI科技大本营· 2025-11-13 05:59
Core Viewpoint - The article discusses the successful implementation of scientific multitask learning at a cosmic scale through the BigBang-Proton project, proposing the concept of Universe Compression, which aims to pre-train models using the entirety of the universe as a unified entity [1][7]. Group 1: Scientific Multitask Learning - Scientific multitask learning is essential for achieving Universe Compression, as it allows for the integration of highly heterogeneous datasets across various disciplines, which traditional models struggle to converge [2][4]. - The BigBang-Proton project demonstrates that with the right representation and architecture, diverse scientific data can converge, indicating the potential for transfer learning across scales and structures [2][4]. Group 2: Scaling Law and Platonic Representation - The Scaling Law observed in language models can extend beyond language to encompass physical realities, suggesting that the limits of these models may align with the fundamental laws of the universe [5][6]. - The Platonic Representation Hypothesis posits that AI models trained on diverse datasets tend to converge on a statistical representation of reality, which aligns with the findings from the BigBang-Proton project [6][7]. Group 3: Universe Compression Plan - The proposed Universe Compression plan involves creating a unified spacetime framework that integrates all scientific knowledge and experimental data across scales, structures, and disciplines [25][26]. - This approach aims to reveal the underlying homogeneity of structures in the universe, facilitating deep analogies across various scientific fields [26]. Group 4: Next Steps and Hypotheses - The company proposes a second hypothesis that suggests reconstructing any physical structure in the universe through next-word prediction, enhancing the model's ability to simulate complex physical systems [28]. - This hypothesis aims to integrate embodied intelligence capabilities, improving generalization in complex mechanical systems like aircraft and vehicles [28].