Diffusion Language Models

Search documents
Qwen3 变身扩散语言模型?不从零训练也能跑,30B参数创纪录
机器之心· 2025-10-12 04:05
Core Insights - The article discusses the development of the RND1-Base, the largest open-source diffusion language model (DLM) to date, which aims to overcome the challenges faced by traditional autoregressive (AR) models in terms of training efficiency and scalability [2][3][6]. Group 1: Model Development - RND1-Base is a 30 billion parameter sparse MoE model, with 3 billion active parameters, derived from a pre-trained AR model (Qwen3-30BA3B) and trained on 500 billion tokens to achieve full diffusion behavior [6]. - The research team from Radical Numerics has successfully demonstrated that scaling diffusion language models beyond 8 billion parameters is feasible and effective [9]. Group 2: Performance Evaluation - RND1 was tested against various benchmarks, including MMLU, ARC-C, RACE, and BBH, showing stable performance that surpasses existing models like Dream-7B and LLaDA-8B while retaining the strong performance of its AR foundation [7]. - Although RND1 performed well, it was not compared with the latest LLaDA model (LLaDA-MoE-7B-A1B), indicating that further comparisons are needed to determine which model is superior [9]. Group 3: Training Methodology - The research identified key factors in the autoregressive-to-diffusion (A2D) conversion process, such as initialization strategies, hierarchical learning rates, and critical batch sizes, which contribute to scalability and stability [10]. - A simpler method called Simple Continuous Pretraining (SCP) was found to achieve comparable performance to more complex A2D conversion processes, allowing for effective retention of AR pre-training knowledge [13][14]. Group 4: Training Efficiency - The study revealed that A2D conversion performs better with larger batch sizes, indicating that diffusion language models can effectively utilize larger batch sizes during continuous pre-training [15][17]. - The article emphasizes the importance of replacing causal masks with bidirectional masks during initialization and continuing pre-training under a masked diffusion objective [18]. Group 5: Company Vision - Radical Numerics aims to create an automated AI research platform that recursively improves itself, with RND1 being one of the first tangible outcomes of this vision [20]. - The founding team of Radical Numerics comprises members from top institutions like DeepMind and Stanford, focusing on hybrid architectures and innovative technologies [21].