Just image Transformers(JiT)
Search documents
何恺明重磅新作:Just image Transformers让去噪模型回归基本功
机器之心· 2025-11-19 02:09
Core Insights - The article discusses the relationship between image generation and denoising diffusion models, emphasizing that high-quality image generation relies on diffusion models [1] - It questions whether denoising diffusion models truly achieve "denoising," highlighting a shift in focus from predicting clean images to predicting noise itself [2][5] - The research proposes a return to directly predicting clean data, which allows networks with seemingly insufficient capacity to operate effectively in high-dimensional spaces [7][8] Group 1: Denoising Diffusion Models - Denoising diffusion models do not function in the classical sense of "denoising," as they predict noise or noisy quantities instead of clean images [5][6] - The manifold assumption suggests that natural images exist on a low-dimensional manifold, while noise is off-manifold, indicating a fundamental difference in predicting clean data versus noisy data [4][6] - The study introduces a model that directly predicts clean data, which could enhance the performance of diffusion models [7] Group 2: Just Image Transformers (JiT) - The paper presents the "Just image Transformers (JiT)" architecture, which utilizes simple large patch pixel-level transformers to create powerful generative models without the need for tokenizers or pre-training [11] - JiT achieves competitive pixel-space image generation on ImageNet, with FID scores of 1.82 at 256x256 resolution and 1.78 at 512x512 resolution [12] - The architecture is designed to be self-consistent and applicable across various fields involving natural data, such as protein and molecular data [12] Group 3: Model Performance and Design - The JiT architecture operates by dividing images into non-overlapping patches, allowing for effective processing of high-dimensional data [14] - The study finds that the performance of the model is significantly influenced by the prediction method used, with -prediction yielding the best results across various loss functions [21][23] - Increasing the number of hidden units is not necessary for model performance, as demonstrated by JiT's effective operation at higher resolutions without additional modifications [28][31] Group 4: Scalability and Generalization - The research emphasizes the scalability of the JiT model, showing that it maintains similar computational costs across different resolutions while achieving strong performance [42][44] - The findings suggest that the design of the network can be decoupled from the observed dimensions, allowing for flexibility in model architecture [31] - The introduction of bottleneck structures in the network design can enhance performance, encouraging the learning of intrinsic low-dimensional representations [33] Group 5: Conclusion and Future Implications - The study concludes that the findings regarding -prediction are a natural outcome of the limitations of neural networks in modeling noise rather than data [51] - The proposed "Diffusion + Transformer" paradigm has the potential to serve as a foundational method in various fields, particularly where obtaining tokenizers is challenging [52]