蚂蚁联手人大，发布MoE扩散模型

Core Insights - Ant Group and Renmin University of China jointly released the industry's first native MoE architecture diffusion language model "LLaDA-MoE" at the 2025 Bund Conference, marking a significant advancement towards AGI [1][2] - The LLaDA-MoE model was trained on approximately 20 terabytes of data, demonstrating scalability and stability in industrial-grade large-scale training, outperforming previous models like LLaDA1.0/1.5 and Dream-7B, while maintaining several times the inference speed advantage [1][2] - The model achieved language intelligence comparable to Qwen2.5, challenging the prevailing notion that language models must be autoregressive, and only required activation of 1.4 billion parameters to match the performance of a 3 billion dense model [1][2] Model Performance and Features - LLaDA-MoE demonstrated an average performance improvement of 8.4% across 17 benchmarks, surpassing LLaDA-1.5 by 13.2% and equaling Qwen2.5-3B-Instruct [3] - The model's development involved a three-month effort to rewrite training code based on LLaDA-1.0, utilizing Ant Group's self-developed distributed framework ATorch for parallel acceleration [2][3] - The model's architecture, based on a 7B-A1B MoE structure, successfully addressed core challenges such as load balancing and noise sampling drift during training [2] Future Developments - Ant Group plans to open-source the model weights and a self-developed inference engine optimized for dLLM parallel characteristics, which has shown significant acceleration compared to NVIDIA's official fast-dLLM [3] - The company aims to continue investing in the AGI field based on dLLM, collaborating with academia and the global AI community to drive new breakthroughs [3] - The statement emphasizes that autoregressive models are not the endpoint, and diffusion models can also serve as a main pathway towards AGI [3]