扩散语言模型
Search documents
小众架构赢麻了,通过编辑功能让100B扩散模型飙出892 tokens/秒的速度
3 6 Ke· 2026-02-11 05:21
Core Insights - The article highlights the significant advancements of Ant Group's LLaDA2.1 model, which has achieved a peak speed of 892 tokens per second in complex programming tasks, outperforming mainstream autoregressive models that operate at much lower speeds [1][18][20]. Group 1: Model Development and Features - LLaDA2.1 represents a historic shift from being a research model to a practical tool, showcasing improved efficiency and usability [2][5]. - The model introduces a dual-mode design, allowing users to switch between Speedy Mode and Quality Mode with a single configuration, thus simplifying user experience and model management [4][6]. - The Speedy Mode allows for rapid initial draft generation, while the Quality Mode focuses on accuracy, catering to different user needs [6][21]. Group 2: Technical Innovations - The model employs an Error-Correcting Editable (ECE) mechanism, enabling self-correction during the generation process, which addresses the common issues of inconsistency in earlier diffusion models [8][13]. - LLaDA2.1 successfully implements reinforcement learning (RL) on a 100B parameter diffusion model, a feat previously considered impossible, enhancing its performance in alignment tasks [16][22]. Group 3: Performance Metrics - In benchmark tests, LLaDA2.1 outperformed its predecessor LLaDA2.0 across various tasks, demonstrating superior performance in both speed and quality [22][23]. - The model's peak speed in Speedy Mode reached 892 tokens per second on the HumanEval+ benchmark, while the mini version exceeded 1500 tokens per second in certain tasks [18][24]. Group 4: Industry Implications - The advancements in LLaDA2.1 challenge the dominance of autoregressive models, suggesting a potential shift in industry standards towards more efficient and versatile models [20][26]. - The open-sourcing of LLaDA2.1 and its mini version indicates a strategic move to foster wider adoption and innovation within the AI community [24][27].
里程碑时刻,100B扩散语言模型跑出892 Tokens /秒,AI的另一条路走通了
3 6 Ke· 2026-02-11 04:31
Core Insights - The release of LLaDA2.1 marks a significant transformation in the field of diffusion language models (dLLM), which was previously considered a niche area. The new version includes LLaDA2.1-Mini (16 billion parameters) and LLaDA2.1-Flash (100 billion parameters) [1][3] - LLaDA2.1 achieves a peak speed of 892 tokens per second, demonstrating a practical efficiency advantage and breaking the "fast but inaccurate" paradigm with its error-correcting mechanism [3][10] - The model introduces a dual-mode system allowing users to switch between quality and speed, addressing the trade-off between these two aspects effectively [15][19] Model Performance - LLaDA2.1's 100 billion parameter version achieved a peak speed of 892 tokens per second, which is particularly notable given the complexity of tasks it can handle, such as programming benchmarks [10][11] - The model's architecture allows for parallel generation and self-correction, which enhances its usability compared to traditional autoregressive models that lack this capability [13][14] - In experimental evaluations, LLaDA2.1 outperformed its predecessor LLaDA2.0 in quality mode across various benchmarks, while also showing significant improvements in throughput in speed mode [20][22] Technical Innovations - The introduction of an Error-Correcting Editable (ECE) mechanism allows LLaDA2.1 to draft answers quickly and then edit them, enabling a more flexible and accurate output generation process [13][18] - The model employs a reinforcement learning phase to enhance its understanding of instructions and alignment with user intent, marking a first for diffusion models at this scale [16][17] - The dual-mode design allows users to configure the model for either speed or quality, simplifying user experience and model management [15][19] Industry Implications - LLaDA2.1's advancements suggest a potential shift in the landscape of AI models, challenging the dominance of autoregressive architectures and opening up new avenues for research and application in language modeling [26] - The successful implementation of a 100 billion parameter diffusion model indicates that the barriers to scaling such models may be diminishing, encouraging further investment and exploration in this area [11][26] - The model's ability to handle complex tasks efficiently positions it as a competitive alternative in the AI landscape, potentially influencing future developments in language processing technologies [10][26]
里程碑时刻!100B扩散语言模型跑出892 Tokens /秒,AI的另一条路走通了
机器之心· 2026-02-11 01:59
Core Insights - The article discusses the significant advancements in diffusion language models (dLLM), particularly highlighting the release of LLaDA2.1, which marks a transformative moment in this research area [2][4]. - LLaDA2.1 demonstrates a peak speed of 892 tokens per second (TPS) for its 100 billion parameter version, showcasing its efficiency and practical applicability [13][14]. - The model introduces a novel error-correcting editable mechanism, allowing for real-time corrections during text generation, which addresses the limitations of traditional autoregressive models [16][17]. Group 1: Model Features and Innovations - LLaDA2.1 includes two versions: LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B), with the latter achieving remarkable performance metrics [2][4]. - The model employs a dual-mode system, enabling users to switch between a speed-focused mode and a quality-focused mode, thus enhancing usability [20][26]. - The introduction of reinforcement learning in the training process allows LLaDA2.1 to better understand instructions and align with user intent, improving its overall reliability [21][22]. Group 2: Performance Metrics and Comparisons - In benchmark tests, LLaDA2.1 outperformed its predecessor LLaDA2.0 in various tasks, particularly in the quality mode where it exceeded previous performance scores [24][30]. - The model's speed advantage is particularly evident in coding tasks, where it achieved a peak TPS of 891.74 in the HumanEval+ benchmark, significantly enhancing its practical application in programming [28][30]. - The article presents comparative performance data, indicating that LLaDA2.1 consistently surpasses other models in terms of speed and efficiency across multiple benchmarks [25][27]. Group 3: Implications for the Industry - The advancements represented by LLaDA2.1 suggest a potential shift in the landscape of AI language models, moving beyond the dominance of autoregressive models to explore the capabilities of diffusion models [33]. - The successful implementation of a scalable diffusion model at a 100 billion parameter level indicates a breakthrough in overcoming previous limitations related to model size and performance [14][33]. - The article emphasizes that while autoregressive models have been the primary focus, LLaDA2.1 illustrates the viability of alternative approaches, potentially leading to a more diverse range of solutions in the AI language model space [33].
小众架构赢麻了!通过编辑功能让100B扩散模型飙出892 tokens/秒的速度!
量子位· 2026-02-11 01:55
Core Viewpoint - The article discusses the emergence of the LLaDA2.1 model from Ant Group, which has achieved a remarkable speed of 892 tokens per second in complex programming tasks, marking a significant advancement over traditional autoregressive models [1][3][11]. Group 1: Model Performance and Features - LLaDA2.1 operates on a 100 billion parameter scale and has transitioned from a research model to a practical tool, demonstrating superior efficiency [3][4]. - The model introduces a dual-mode decoding strategy, allowing users to switch between Speedy Mode and Quality Mode with a single configuration, thus enhancing usability [9][10]. - In Speedy Mode, LLaDA2.1 achieves a peak speed of 892 tokens per second on the HumanEval+ benchmark, while in Quality Mode, it surpasses previous models in various reasoning tasks [11][31]. Group 2: Technical Innovations - The model employs an Error-Correcting Editable (ECE) mechanism, enabling it to generate drafts quickly and then refine them, addressing the limitations of traditional diffusion models [16][21]. - LLaDA2.1 successfully implements reinforcement learning (RL) on a 100 billion scale, enhancing its performance in instruction-following tasks and demonstrating that diffusion models can achieve both speed and understanding [23][26]. - The introduction of the EBPO algorithm allows for efficient training and editing, marking a significant milestone in the application of RL to diffusion models [25][28]. Group 3: Competitive Advantage - LLaDA2.1's performance in benchmark tests shows a significant advantage over mainstream autoregressive architectures, achieving high speeds without compromising quality [29][30]. - The model's ability to maintain quality even in Speedy Mode demonstrates its robustness, achieving a balance between speed and accuracy [32]. - A lighter 16 billion parameter Mini version has been released, achieving peak speeds exceeding 1500 tokens per second, indicating potential for more lightweight deployments [33].
Stable-DiffCoder超越自回归模型!扩散模型在代码生成取得新突破
机器之心· 2026-02-05 23:45
Core Insights - The article discusses the launch of Stable-DiffCoder, a new diffusion language model developed by Huazhong University of Science and Technology and ByteDance, which aims to explore whether diffusion training can enhance model capabilities beyond traditional autoregressive (AR) models [1] Group 1: Model Performance - Stable-DiffCoder outperformed its AR counterparts and several strong open-source models like Qwen2.5-Coder and DeepSeek-Coder on multiple mainstream code benchmarks, demonstrating the effectiveness of the diffusion training paradigm as a powerful data augmentation method [1] - In the 8B model category, Stable-DiffCoder achieved a score of 79.3 on HumanEval and 83.6 on MBPP, surpassing many existing models [23][24] Group 2: Training Methodology - The model utilizes a continuous pre-training (CPT) approach with Block Diffusion and various stability optimization strategies to enhance performance [1] - The training process is designed to first compress knowledge using AR methods before transitioning to diffusion techniques, which helps in efficiently learning a diffusion language model [15][16] Group 3: Knowledge Learning Challenges - The article highlights challenges in the diffusion process, such as the introduction of noise and incorrect knowledge mapping, which can hinder effective learning [5][11] - It emphasizes the importance of maintaining a clean sample distribution during training to ensure effective knowledge transfer [11][20] Group 4: Future Implications - The release of Stable-DiffCoder suggests a new path for the evolution of large models, indicating that AR models can be used as efficient knowledge compressors while diffusion methods can act as enhancers to elevate model intelligence [31]
姚班传奇陈立杰入职OpenAI!16岁保送清华,30岁拿下UC伯克利助理教授
创业邦· 2026-01-15 10:15
Core Insights - Chen Lijie, a prominent figure from Tsinghua University's Yao Class, has joined OpenAI to focus on mathematical reasoning [3][6] - His recent research is centered on Diffusion Language Models, aligning with the current evolution of generative models [6] Group 1: Background of Chen Lijie - Born in 1995, Chen Lijie won a gold medal at the National Olympiad in Informatics at the age of 16 and was admitted to Tsinghua University [11] - He became an assistant professor at UC Berkeley in 2025, specializing in computational complexity theory [11][16] - Chen has a remarkable academic history, having published multiple papers in prestigious conferences during his undergraduate studies [14] Group 2: Academic Achievements - He was the first Chinese undergraduate to publish at the FOCS conference in 2017, solving significant problems in computational complexity [15] - Chen received his PhD from MIT in 2022 and was awarded the Miller Fellowship at UC Berkeley, a prestigious honor for outstanding young scholars [15] - His research contributions include advancements in derandomization and complexity lower bounds, with a recent paper addressing a long-standing problem in complexity theory [15][19] Group 3: Current Research Focus - Chen's primary research areas include P vs NP, circuit complexity, and algorithmic lower bounds, with applications in quantum physics and AI safety [19] - His involvement with OpenAI marks a significant step in exploring AI safety, particularly in the context of his expertise in complexity theory [19]
姚班传奇陈立杰入职OpenAI!16岁保送清华,30岁拿下UC伯克利助理教授
量子位· 2026-01-15 01:23
Core Insights - Chen Lijie, a prominent figure from Tsinghua University's Yao Class and an assistant professor at UC Berkeley, has joined OpenAI to focus on mathematical reasoning [2][10][30] Group 1: Chen Lijie's Background - Chen Lijie was born in 1995 and won a gold medal in the National Olympiad in Informatics at the age of 16, leading to his admission to Tsinghua University [10][12] - He graduated from Tsinghua University in 2017 and pursued a Ph.D. at MIT, where he researched computational complexity theory under Ryan Williams [21][22] - Chen has published multiple papers in top-tier conferences and received several awards, including the Best Student Paper Award at FOCS in 2019 [24][27] Group 2: Research Contributions - His research interests include P vs. NP problems, circuit complexity, fine-grained complexity, and derandomization, contributing significantly to the field of theoretical computer science [27][28] - Chen's recent work has focused on the connection between derandomization and complexity lower bounds, as well as applying complexity theory methods to quantum physics and AI safety [28][29] Group 3: OpenAI Involvement - At OpenAI, Chen will be involved in exploring diffusion language models, aligning with current advancements in generative models [7][30] - His previous research was cited in OpenAI's paper on language model hallucinations, indicating his influence in the field [4][30]
Sebastian Raschka 2026预测:Transformer统治依旧,但扩散模型正悄然崛起
机器之心· 2026-01-14 07:18
Core Insights - The article discusses the evolving landscape of large language models (LLMs) as of 2026, highlighting a shift from the dominance of the Transformer architecture to a focus on efficiency and hybrid architectures [1][4][5]. Group 1: Transformer Architecture and Efficiency - The Transformer architecture is expected to maintain its status as the foundation of the AI ecosystem for at least the next few years, supported by mature toolchains and optimization strategies [4]. - Recent developments indicate a shift towards hybrid architectures and efficiency improvements, rather than a complete overhaul of existing models [5]. - The industry is increasingly focusing on mixed architectures and efficiency, as demonstrated by models like DeepSeek V3 and R1, which utilize mixture of experts (MoE) and multi-head latent attention (MLA) to reduce inference costs while maintaining large parameter counts [7]. Group 2: Linear and Sparse Attention Mechanisms - The standard Transformer attention mechanism has a complexity of O(N^2), leading to exponential growth in computational costs with increasing context length [9]. - New models like Qwen3-Next and Kimi Linear are adopting hybrid strategies that combine efficient linear layers with full attention layers to balance long-distance dependencies and inference speed [14]. Group 3: Diffusion Language Models - Diffusion language models (DLMs) are gaining attention for their ability to generate tokens quickly and cost-effectively through parallel generation, contrasting with the serial generation of autoregressive models [12]. - Despite their advantages, DLMs face challenges in integrating tool calls within response chains due to their simultaneous generation nature [15]. - Research indicates that DLMs may outperform autoregressive models when high-quality data is scarce, as they can benefit from multiple training epochs without overfitting [24][25]. Group 4: Data Scarcity and Learning Efficiency - The concept of "Crossover" suggests that while autoregressive models learn faster with ample data, DLMs excel when data is limited, achieving significant accuracy on benchmarks with relatively small datasets [27]. - DLMs demonstrate that increased training epochs do not necessarily lead to a decline in downstream task performance, offering a potential solution in an era of data scarcity [28].
微信炼出扩散语言模型,实现vLLM部署AR模型3倍加速,低熵场景超10倍
机器之心· 2026-01-03 04:13
Core Viewpoint - Tencent's WeChat AI team has introduced WeDLM (WeChat Diffusion Language Model), which achieves over 3 times acceleration in mathematical reasoning tasks compared to AR models deployed with vLLM, and up to 10 times in low-entropy scenarios, while maintaining or even improving generation quality [2][4][13]. Group 1: Introduction and Background - The current mainstream decoding paradigm for large language models is autoregressive (AR) generation, but its token-by-token generation limits inference efficiency. Diffusion language models (Diffusion LLMs) offer an alternative by restoring multiple masked tokens in parallel, yet existing models struggle to surpass optimized AR inference engines like vLLM in speed [3]. - The key issue is that most diffusion language models use bidirectional attention mechanisms, which are incompatible with standard KV caching, preventing the advantages of parallel prediction from translating into actual speed improvements [4]. Group 2: WeDLM Model Insights - WeDLM is the first diffusion language model that surpasses equivalent AR models in inference speed under industrial-grade inference engine (vLLM) optimization conditions [4]. - The core insight of WeDLM is that mask recovery does not require bidirectional attention. It allows each masked position to access all observed tokens, which can be achieved under standard causal attention [11]. - A critical metric introduced is Prefix Cacheability, which indicates that in KV caching decoding, only tokens forming a continuous left-to-right prefix can be cached and reused. Thus, the efficiency of inference is influenced more by how many predictions can convert into cacheable prefixes rather than how many tokens are predicted at each step [11]. Group 3: Technical Solutions - WeDLM employs Topological Reordering to maintain causal attention while allowing masked positions to access the complete observed context. This involves moving all observed tokens to the front of the physical sequence while preserving their logical positions through RoPE positional encoding [16]. - The model also features Dual-Stream Masking to reduce the distribution gap between training and inference, creating a clean "memory stream" and a masked "prediction stream" that share positional encoding [18]. - During inference, WeDLM utilizes Streaming Parallel Decoding, allowing immediate submission of parsed prefixes rather than waiting for an entire block to complete [21]. Group 4: Performance Metrics - In mathematical reasoning tasks, WeDLM achieves approximately 3 times acceleration and significantly outperforms other diffusion models like LLaDA and Dream in both accuracy and inference speed [13]. - In benchmark evaluations, WeDLM-8B scores an average of 74.72, surpassing Qwen3-8B by 2.1 points, with notable improvements in mathematical reasoning tasks such as GSM8K and MATH [24]. - The model demonstrates significant speed advantages in various task scenarios, achieving 3-6 times acceleration in structured outputs for mathematical reasoning, 2-3 times in code generation, and over 10 times in low-entropy tasks like sequence counting [27]. Group 5: Conclusion - The contributions of WeDLM highlight that Prefix Cacheability should be a primary design goal for parallel text generation. Future diffusion language models should be viewed as efficient multi-token prediction mechanisms, where the value of parallel token generation depends on how quickly these tokens can be converted into cacheable prefixes [31].
跳过“逐字生成”,蚂蚁集团赵俊博:扩散模型让我们能直接修改Token
3 6 Ke· 2025-12-12 07:17
Core Insights - The main focus of the news is on the emerging diffusion architecture for language models, which offers advantages over traditional autoregressive models in terms of speed and computational efficiency [1][4][20]. Group 1: Diffusion Architecture Advantages - Diffusion architecture allows for direct modification and control of tokens during inference, eliminating the need to regenerate entire segments of content as required by autoregressive models [1][5]. - The newly released LLaDA 2.0 model has achieved a scale of 100 billion parameters, marking a significant milestone in the development of diffusion language models [1][20]. - Diffusion models are described as "data-hungry," requiring larger datasets for training compared to autoregressive models, but they can absorb data more quickly [5][8]. Group 2: Technical Developments - The LLaDA model employs a "fill-in-the-blank" prediction method, which contrasts with the sequential token generation of autoregressive models [6][8]. - The architecture includes both global and causal attention mechanisms to enhance computational efficiency and maintain coherence in generated sequences [16]. - The research team has made significant strides in addressing architectural challenges, including the integration of mixture of experts (MoE) within the diffusion framework [19]. Group 3: Industry Impact and Future Directions - Major tech companies, including Google and ByteDance, are actively exploring diffusion models, indicating a growing interest in this technology [1][19]. - The development of a new inference engine, dInfer, is expected to enhance the performance of diffusion models, with potential for significant speed improvements in key applications [24][25]. - The community is encouraged to collaborate in building the ecosystem around diffusion language models, which are still in the early stages of development [27].