Workflow
MMaDA
icon
Search documents
冲击自回归,扩散模型正在改写下一代通用模型范式
机器之心· 2025-06-04 01:59
Core Viewpoint - The article discusses the advancements in diffusion language models (dLLMs), particularly focusing on Google's Gemini Diffusion and its implications for AI development, highlighting the speed and performance improvements over traditional autoregressive models [1][8][35]. Group 1: Gemini Diffusion and Its Features - Gemini Diffusion is noted for its impressive generation speed, being five times faster than previous models, and its ability to handle programming tasks effectively [2][8]. - The underlying mechanism of diffusion models allows for rapid iteration and error correction during the generation process, distinguishing it from autoregressive models [2][3]. - Gemini Diffusion's sampling speed can reach an astonishing 1479 tokens per second, showcasing its potential in various benchmarks [8][9]. Group 2: Development of Diffusion Language Models - Prior to Gemini Diffusion, several research teams explored the feasibility of diffusion-based LLMs, including Stanford's Diffusion-LM and Fudan University's DiffusionBERT [3][4]. - The introduction of LLaDA, the first 8 billion parameter diffusion language model, marked a significant milestone in the field, achieving performance comparable to LLaMA 3 [4][21]. - Following LLaDA, other models like d1 and LaViDa have emerged, further establishing LLaDA as a foundational model in dLLM research [20][21]. Group 3: Multimodal Diffusion Language Models - The emergence of diffusion multimodal language models (dMLLMs) is highlighted, with LLaDA-V and MMaDA being prominent examples that integrate visual and language processing capabilities [10][31]. - LLaDA-V combines visual instruction fine-tuning with the diffusion mechanism, demonstrating strong performance in multimodal understanding tasks [26][27]. - MMaDA showcases innovations in text reasoning and multimodal understanding, solidifying its position as a leading research outcome in the dMLLM space [31][32]. Group 4: Future Directions and Implications - The article emphasizes the shift from autoregressive models to diffusion models as a significant paradigm change in AI, suggesting broader implications for future research and applications [35][36]. - The ongoing evolution of models like LLaDA and Gemini Diffusion indicates a growing ecosystem around dLLMs and dMLLMs, with potential applications extending into quantum computing [35][36].
比Gemini Diffusion更全能!首个多模态扩散大语言模型MMaDA发布,同时实现强推理与高可控性
机器之心· 2025-05-22 08:46
Core Insights - The article discusses the advancements in large language models (LLMs) and their application in multimodal tasks, highlighting the challenges in architecture uniformity and post-training methods [1] - DeepMind's Gemini Diffusion has demonstrated the potential of diffusion models in text modeling, leading to the development of MMaDA, which integrates text reasoning, multimodal understanding, and image generation into a unified model [1][4] Group 1: Model Development - MMaDA is the first systematic exploration of a diffusion architecture for multimodal foundational models, achieving breakthroughs through three core technologies [1] - The team has open-sourced the training, inference, and weights for MMaDA-8B-Base, with plans to release additional weights [4] Group 2: Performance Metrics - MMaDA achieved state-of-the-art (SOTA) performance in three major tasks: - Textual reasoning with an MMLU accuracy of 68.4%, surpassing models like LLaMA-3-8B and Qwen2-7B [7] - Multimodal understanding, matching specialized models on benchmarks like POPE and VQAv2 [7] - Image generation with a CLIP Score of 32.46, significantly improving accuracy in cultural knowledge generation tasks [7] Group 3: Cross-Task Synergy - During mixed training phases, improvements in text reasoning and image generation metrics were observed, indicating a strong cross-task synergy [9] - MMaDA supports three types of cross-modal completion tasks, showcasing its flexibility and generalization capabilities in complex generation and reasoning tasks [11][13] Group 4: Key Technical Innovations - MMaDA's architecture unifies the text and image generation processes within a diffusion framework, eliminating the complexity of traditional mixed architectures [15] - The model employs a mixed long-chain thinking fine-tuning strategy to address challenges in complex tasks, enhancing its reasoning capabilities [15][19] - A unified inference format is defined to ensure the model outputs cross-modal reasoning steps before generating answers [18] Group 5: Training Strategies - The model utilizes structured noise strategies and diversified reward modeling to enhance performance across different tasks [19][21] - The UniGRPO algorithm has shown a 40% improvement in convergence speed during training compared to baseline methods [21]