扩散语言模型
Search documents
扩散语言模型也有MoE版本了!蚂蚁&人大从头训练LLaDA-MoE,即将完全开源
机器之心· 2025-09-12 11:31
Core Viewpoint - The article discusses the development of the LLaDA-MoE model, the first native MoE architecture diffusion language model trained from scratch, which demonstrates significant performance and efficiency advantages over traditional autoregressive models [2][15][18]. Group 1: Model Development and Performance - The LLaDA-MoE model was trained on 20 terabytes of data and features 1.4 billion active parameters, achieving performance comparable to denser autoregressive models like Qwen2.5-3B while maintaining faster inference speeds [15][17][29]. - The LLaDA series has rapidly evolved, with LLaDA-MoE being a notable milestone, surpassing previous models like LLaDA1.0/1.5 and Dream-7B in various benchmark tests [13][18][29]. - The model's architecture allows for significant scaling potential, with plans to explore higher sparsity ratios and larger MoE diffusion language models [29][40]. Group 2: Technical Innovations and Advantages - The diffusion model approach allows for parallel decoding, bidirectional modeling, and iterative correction, addressing limitations of autoregressive models such as serial bottlenecks and lack of error correction capabilities [38][40]. - Evidence suggests that diffusion language models can achieve better learning outcomes than autoregressive models, particularly in scenarios with limited data, demonstrating a data utilization efficiency that can exceed three times that of autoregressive models [40][41]. - The training framework and infrastructure developed by Ant Group, including the ATorch framework, supports the efficient training of large-scale MoE models [25][26]. Group 3: Strategic Vision and Future Directions - The development of LLaDA-MoE reflects a strategic choice to explore high-potential areas in AI, moving beyond established paths to enhance the limits of intelligence [44][47]. - Ant Group's commitment to innovation is evident in its previous projects and ongoing research in areas like dynamic MoE architectures and hybrid linear architectures, all aimed at achieving general artificial intelligence (AGI) [45][46][47].
蚂蚁联手人大,发布MoE扩散模型
Hua Er Jie Jian Wen· 2025-09-12 06:02
Core Insights - Ant Group and Renmin University of China jointly released the industry's first native MoE architecture diffusion language model "LLaDA-MoE" at the 2025 Bund Conference, marking a significant advancement towards AGI [1][2] - The LLaDA-MoE model was trained on approximately 20 terabytes of data, demonstrating scalability and stability in industrial-grade large-scale training, outperforming previous models like LLaDA1.0/1.5 and Dream-7B, while maintaining several times the inference speed advantage [1][2] - The model achieved language intelligence comparable to Qwen2.5, challenging the prevailing notion that language models must be autoregressive, and only required activation of 1.4 billion parameters to match the performance of a 3 billion dense model [1][2] Model Performance and Features - LLaDA-MoE demonstrated an average performance improvement of 8.4% across 17 benchmarks, surpassing LLaDA-1.5 by 13.2% and equaling Qwen2.5-3B-Instruct [3] - The model's development involved a three-month effort to rewrite training code based on LLaDA-1.0, utilizing Ant Group's self-developed distributed framework ATorch for parallel acceleration [2][3] - The model's architecture, based on a 7B-A1B MoE structure, successfully addressed core challenges such as load balancing and noise sampling drift during training [2] Future Developments - Ant Group plans to open-source the model weights and a self-developed inference engine optimized for dLLM parallel characteristics, which has shown significant acceleration compared to NVIDIA's official fast-dLLM [3] - The company aims to continue investing in the AGI field based on dLLM, collaborating with academia and the global AI community to drive new breakthroughs [3] - The statement emphasizes that autoregressive models are not the endpoint, and diffusion models can also serve as a main pathway towards AGI [3]
蚂蚁、中国人民大学发布行业首个原生MoE扩散语言模型
第一财经· 2025-09-12 03:08
Core Viewpoint - Ant Group and Renmin University of China have jointly developed the native MoE architecture diffusion language model (dLLM) LLaDA-MoE, demonstrating the scalability and stability of industrial-grade large-scale training on approximately 20 terabytes of data [1] Group 1 - The LLaDA-MoE model has been trained from scratch using the MoE architecture [1] - The model will be fully open-sourced in the near future [1]
阿里巴巴发布最强语言模型挑战者:扩散模型能否颠覆ChatGP
Sou Hu Cai Jing· 2025-08-20 02:41
Core Insights - The research on diffusion language models represents a potential paradigm shift in AI dialogue systems, moving away from traditional autoregressive methods to a more parallel and efficient approach [2][8]. - Diffusion language models can generate text in a manner akin to an artist painting, allowing for simultaneous processing of multiple words, which significantly enhances speed and contextual understanding [3][4]. Development and Mechanism - The evolution of diffusion language models began with the D3PM model in 2021, transitioning from continuous to discrete spaces, ultimately leading to models like DiffusionBERT and LLaDA series that operate directly in the text space [3][4]. - The training strategy for diffusion models resembles a fill-in-the-blank game, enhancing the model's ability to understand bidirectional relationships between words [5]. Performance and Comparison - Recent findings indicate that diffusion language models, such as LLaDA-8B, can perform comparably or even exceed traditional autoregressive models like LLaMA3-8B in various benchmarks, suggesting no compromise between speed and quality [4][5]. - The unique inference optimization of diffusion models allows for iterative adjustments during text generation, improving overall output quality [5][6]. Applications and Challenges - Diffusion language models have shown promising results in applications like code generation, mathematical reasoning, and document summarization, particularly in tasks requiring global planning [6][7]. - Challenges include the "curse of parallel generation," where dependencies between generated words may not be adequately considered, and the need for infrastructure support tailored to diffusion models [6][7]. Future Directions - Future development of diffusion language models will focus on improving training efficiency, enhancing long-text generation capabilities, and refining inference algorithms to close the gap with traditional models [7]. - Companies are beginning to commercialize diffusion language models, with models like Mercury claiming to generate thousands of words per second, indicating significant potential for real-time applications [7][8].
Meta没做的,英伟达做了,全新架构吞吐量狂飙6倍,20万亿Token训练
3 6 Ke· 2025-08-19 02:33
Core Insights - NVIDIA has launched a new 9B model, the NVIDIA Nemotron Nano 2, utilizing a revolutionary Mamba-Transformer hybrid architecture that achieves up to 6 times higher inference throughput compared to the industry benchmark Qwen3-8B, while maintaining or exceeding performance in complex reasoning tasks [1][23]. Group 1: Model Architecture and Performance - The Nemotron Nano 2 model is based on the innovative Mamba-2 architecture, which replaces most self-attention layers in traditional Transformer architectures, resulting in significant speed improvements during complex reasoning tasks [10][15]. - The model demonstrates competitive accuracy in various benchmarks, including mathematics, code generation, and general reasoning, performing on par or better than similar open-source models like Qwen3-8B and Gemma3-12B [23][24]. - In specific benchmarks, the model achieved notable scores, such as 97.8% in MATH500 and 72.1% in AIME25, showcasing its capabilities in mathematical reasoning and general knowledge [24]. Group 2: Training and Data Utilization - The training process for the Nemotron Nano 2 involved a massive dataset of 20 trillion tokens, utilizing advanced FP8 training techniques to create a foundational model with 120 billion parameters, which was later distilled to 9 billion parameters [17][22]. - The model's training included high-quality data from various sources, focusing on mathematics, code, and multilingual question-answering, ensuring a robust pre-training dataset [18][25]. - NVIDIA has also released a comprehensive pre-training dataset, Nemotron-Pre-Training-Dataset-v1, which includes 6.6 trillion tokens from diverse domains, further enhancing the model's training foundation [25][27]. Group 3: Open Source Commitment - NVIDIA has committed to open-sourcing the Nemotron models on the HuggingFace platform, providing access to the 9B model, its base version, and the larger 12B model, along with the associated datasets [25][30]. - This move reflects NVIDIA's ongoing efforts to contribute to the open-source community, contrasting with other companies that are shifting towards more closed-source strategies [27].
字节跳动Seed团队发布扩散语言模型,每秒推理速度2146 tokens
news flash· 2025-07-31 12:35
Core Insights - ByteDance's Seed team released an experimental diffusion language model called Seed Diffusion Preview, aiming to validate the feasibility of discrete diffusion technology as a foundational framework for next-generation language models [1] Group 1: Model Performance - The experimental results indicate that the code inference speed of Seed Diffusion Preview can reach 2146 tokens per second, which represents a 5.4 times improvement in speed compared to autoregressive models of similar scale [1]