混合长链思维微调

Search documents
比Gemini Diffusion更全能!首个多模态扩散大语言模型MMaDA发布,同时实现强推理与高可控性
机器之心· 2025-05-22 08:46
Core Insights - The article discusses the advancements in large language models (LLMs) and their application in multimodal tasks, highlighting the challenges in architecture uniformity and post-training methods [1] - DeepMind's Gemini Diffusion has demonstrated the potential of diffusion models in text modeling, leading to the development of MMaDA, which integrates text reasoning, multimodal understanding, and image generation into a unified model [1][4] Group 1: Model Development - MMaDA is the first systematic exploration of a diffusion architecture for multimodal foundational models, achieving breakthroughs through three core technologies [1] - The team has open-sourced the training, inference, and weights for MMaDA-8B-Base, with plans to release additional weights [4] Group 2: Performance Metrics - MMaDA achieved state-of-the-art (SOTA) performance in three major tasks: - Textual reasoning with an MMLU accuracy of 68.4%, surpassing models like LLaMA-3-8B and Qwen2-7B [7] - Multimodal understanding, matching specialized models on benchmarks like POPE and VQAv2 [7] - Image generation with a CLIP Score of 32.46, significantly improving accuracy in cultural knowledge generation tasks [7] Group 3: Cross-Task Synergy - During mixed training phases, improvements in text reasoning and image generation metrics were observed, indicating a strong cross-task synergy [9] - MMaDA supports three types of cross-modal completion tasks, showcasing its flexibility and generalization capabilities in complex generation and reasoning tasks [11][13] Group 4: Key Technical Innovations - MMaDA's architecture unifies the text and image generation processes within a diffusion framework, eliminating the complexity of traditional mixed architectures [15] - The model employs a mixed long-chain thinking fine-tuning strategy to address challenges in complex tasks, enhancing its reasoning capabilities [15][19] - A unified inference format is defined to ensure the model outputs cross-modal reasoning steps before generating answers [18] Group 5: Training Strategies - The model utilizes structured noise strategies and diversified reward modeling to enhance performance across different tasks [19][21] - The UniGRPO algorithm has shown a 40% improvement in convergence speed during training compared to baseline methods [21]