Workflow
流形假设
icon
Search documents
何恺明团队新作:扩散模型可能被用错了
3 6 Ke· 2025-11-19 11:22
何恺明又一次返璞归真。 最新论文直接推翻扩散模型的主流玩法——不让模型预测噪声,而是直接画干净图。 如果你熟悉何恺明的作品,会发现这正是他创新的典型路径,不提出更复杂的架构,而是把问题拆回最初的样子,让模型做它最擅长的那件事。 实际上,扩散模型火了这么多年,架构越做越复杂,比如预测噪声、预测速度、对齐latent、堆tokenizer、加VAE、加perceptual loss…… 但大家似乎忘了,扩散模型原本就是去噪模型。 现在这篇新论文把这件事重新摆上桌,既然叫denoising模型,那为什么不直接denoise? 于是,在ResNet、MAE等之后,何恺明团队又给出了一个"大道至简"的结论:扩散模型应该回到最初——直接预测图像。 扩散模型可能被用错了 当下的主流扩散模型,虽然设计思想以及名为"去噪",但在训练时,神经网络预测的目标往往并不是干净的图像,而是噪声, 或者是一个混合了图像与 噪声的速度场。 实际上,预测噪声和预测干净图差得很远。 根据流形假设,自然图像是分布在高维像素空间中的低维流形上的,是有规律可循的干净数据;而噪声则是均匀弥散在整个高维空间中的,不具备这种低 维结构。 简单理解就是,把高 ...
何恺明团队新作:扩散模型可能被用错了
量子位· 2025-11-19 09:01
何恺明又一次返璞归真。 最新论文直接推翻扩散模型的主流玩法——不让模型预测噪声,而是直接画干净图。 闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 如果你熟悉何恺明的作品,会发现这正是他创新的典型路径,不提出更复杂的架构,而是把问题拆回最初的样子, 让模型做它最擅长的那件 事 。 实际上,扩散模型火了这么多年,架构越做越复杂,比如预测噪声、预测速度、对齐latent、堆tokenizer、加VAE、加perceptual loss…… 但大家似乎忘了,扩散模型原本就是 去噪 模型。 现在这篇新论文把这件事重新摆上桌,既然叫denoising模型,那为什么不直接denoise? 于是,在ResNet、MAE等之后,何恺明团队又给出了一个"大道至简"的结论: 扩散模型应该回到最初——直接预测图像 。 扩散模型可能被用错了 当下的主流扩散模型,虽然设计思想以及名为"去噪",但在训练时,神经网络预测的目标往往并不是干净的图像,而是 噪声 , 或者是一个混 合了图像与噪声的 速度场 。 实际上,预测噪声和预测干净图差得很远。 根据 流形假设 ,自然图像是分布在高维像素空间中的低维流形上的,是有规律可循的干净数据; ...
何恺明重磅新作:Just image Transformers让去噪模型回归基本功
机器之心· 2025-11-19 02:09
Core Insights - The article discusses the relationship between image generation and denoising diffusion models, emphasizing that high-quality image generation relies on diffusion models [1] - It questions whether denoising diffusion models truly achieve "denoising," highlighting a shift in focus from predicting clean images to predicting noise itself [2][5] - The research proposes a return to directly predicting clean data, which allows networks with seemingly insufficient capacity to operate effectively in high-dimensional spaces [7][8] Group 1: Denoising Diffusion Models - Denoising diffusion models do not function in the classical sense of "denoising," as they predict noise or noisy quantities instead of clean images [5][6] - The manifold assumption suggests that natural images exist on a low-dimensional manifold, while noise is off-manifold, indicating a fundamental difference in predicting clean data versus noisy data [4][6] - The study introduces a model that directly predicts clean data, which could enhance the performance of diffusion models [7] Group 2: Just Image Transformers (JiT) - The paper presents the "Just image Transformers (JiT)" architecture, which utilizes simple large patch pixel-level transformers to create powerful generative models without the need for tokenizers or pre-training [11] - JiT achieves competitive pixel-space image generation on ImageNet, with FID scores of 1.82 at 256x256 resolution and 1.78 at 512x512 resolution [12] - The architecture is designed to be self-consistent and applicable across various fields involving natural data, such as protein and molecular data [12] Group 3: Model Performance and Design - The JiT architecture operates by dividing images into non-overlapping patches, allowing for effective processing of high-dimensional data [14] - The study finds that the performance of the model is significantly influenced by the prediction method used, with -prediction yielding the best results across various loss functions [21][23] - Increasing the number of hidden units is not necessary for model performance, as demonstrated by JiT's effective operation at higher resolutions without additional modifications [28][31] Group 4: Scalability and Generalization - The research emphasizes the scalability of the JiT model, showing that it maintains similar computational costs across different resolutions while achieving strong performance [42][44] - The findings suggest that the design of the network can be decoupled from the observed dimensions, allowing for flexibility in model architecture [31] - The introduction of bottleneck structures in the network design can enhance performance, encouraging the learning of intrinsic low-dimensional representations [33] Group 5: Conclusion and Future Implications - The study concludes that the findings regarding -prediction are a natural outcome of the limitations of neural networks in modeling noise rather than data [51] - The proposed "Diffusion + Transformer" paradigm has the potential to serve as a foundational method in various fields, particularly where obtaining tokenizers is challenging [52]
宇宙尺度压缩:Scaling Law的边界,柏拉图表征收敛于物质和信息交汇,解决P与NP问题,Simulation假说……
AI科技大本营· 2025-11-13 05:59
Core Viewpoint - The article discusses the successful implementation of scientific multitask learning at a cosmic scale through the BigBang-Proton project, proposing the concept of Universe Compression, which aims to pre-train models using the entirety of the universe as a unified entity [1][7]. Group 1: Scientific Multitask Learning - Scientific multitask learning is essential for achieving Universe Compression, as it allows for the integration of highly heterogeneous datasets across various disciplines, which traditional models struggle to converge [2][4]. - The BigBang-Proton project demonstrates that with the right representation and architecture, diverse scientific data can converge, indicating the potential for transfer learning across scales and structures [2][4]. Group 2: Scaling Law and Platonic Representation - The Scaling Law observed in language models can extend beyond language to encompass physical realities, suggesting that the limits of these models may align with the fundamental laws of the universe [5][6]. - The Platonic Representation Hypothesis posits that AI models trained on diverse datasets tend to converge on a statistical representation of reality, which aligns with the findings from the BigBang-Proton project [6][7]. Group 3: Universe Compression Plan - The proposed Universe Compression plan involves creating a unified spacetime framework that integrates all scientific knowledge and experimental data across scales, structures, and disciplines [25][26]. - This approach aims to reveal the underlying homogeneity of structures in the universe, facilitating deep analogies across various scientific fields [26]. Group 4: Next Steps and Hypotheses - The company proposes a second hypothesis that suggests reconstructing any physical structure in the universe through next-word prediction, enhancing the model's ability to simulate complex physical systems [28]. - This hypothesis aims to integrate embodied intelligence capabilities, improving generalization in complex mechanical systems like aircraft and vehicles [28].