从视觉出发统一多模态！颜水成团队最新研究：不再把图像编解码器塞进LLM｜ICLR'2026

Core Viewpoint - The article discusses the emergence of a new unified discrete diffusion model called Muddit, which challenges the traditional language-first approach in multi-modal AI models by proposing a visual prior as the foundation for generating text and images [2][5][37]. Group 1: Paradigm Shift in AI Models - The AI industry has predominantly focused on a pre-training paradigm centered around next word prediction, which has been successful but is now being questioned [3][4]. - NVIDIA researcher Jim Fan suggests that AI is undergoing a second paradigm shift towards world modeling, emphasizing the prediction of physical states rather than just token relationships [5][7]. - This shift prompts a reevaluation of whether future foundational models should continue to prioritize language [7][17]. Group 2: Limitations of Current Multi-Modal Models - Many existing unified models remain fundamentally language-first, where visual inputs are processed through a language backbone, leading to inefficiencies in image generation and reasoning [8][10]. - The article critiques the notion that these models achieve true unification, as they often rely on separate mechanisms for text and image generation [11][19]. Group 3: Muddit's Approach - Muddit aims to redefine the foundation of unified models by starting from a strong visual prior rather than a language prior, allowing for a more natural integration of text and image generation [15][17]. - The model employs a fully discrete diffusion framework, treating both text and images as discrete tokens, which facilitates a shared generative process across different tasks [19][20]. - Muddit's architecture allows for task switching without changing internal mechanisms, emphasizing a unified generative syntax [22][23]. Group 4: Performance and Efficiency - Muddit demonstrates competitive performance in text-to-image generation, achieving an overall accuracy of 0.61 on the GenEval benchmark, surpassing several existing models [27]. - The model also excels in image understanding and image-to-text tasks, with notable scores in various benchmarks, indicating its effectiveness in unified training [28][29]. - Importantly, Muddit achieves these results with a relatively small dataset, highlighting the efficiency of its visual prior and unified modeling approach [30]. Group 5: Future Implications - The article posits that Muddit represents a potential shift in multi-modal foundational models, moving away from language-centric designs towards those that better reflect the structure of the world [33][34]. - This shift could influence future developments in AI, particularly in areas like video, 3D modeling, and embodied intelligence, where understanding spatial and dynamic relationships is crucial [33][35]. - Ultimately, Muddit challenges the prevailing notion that multi-modal models must be built on language, suggesting a rethinking of what should serve as the foundational basis for these systems [40][41].