Workflow
自回归模型
icon
Search documents
小众架构赢麻了,通过编辑功能让100B扩散模型飙出892 tokens/秒的速度
3 6 Ke· 2026-02-11 05:21
Core Insights - The article highlights the significant advancements of Ant Group's LLaDA2.1 model, which has achieved a peak speed of 892 tokens per second in complex programming tasks, outperforming mainstream autoregressive models that operate at much lower speeds [1][18][20]. Group 1: Model Development and Features - LLaDA2.1 represents a historic shift from being a research model to a practical tool, showcasing improved efficiency and usability [2][5]. - The model introduces a dual-mode design, allowing users to switch between Speedy Mode and Quality Mode with a single configuration, thus simplifying user experience and model management [4][6]. - The Speedy Mode allows for rapid initial draft generation, while the Quality Mode focuses on accuracy, catering to different user needs [6][21]. Group 2: Technical Innovations - The model employs an Error-Correcting Editable (ECE) mechanism, enabling self-correction during the generation process, which addresses the common issues of inconsistency in earlier diffusion models [8][13]. - LLaDA2.1 successfully implements reinforcement learning (RL) on a 100B parameter diffusion model, a feat previously considered impossible, enhancing its performance in alignment tasks [16][22]. Group 3: Performance Metrics - In benchmark tests, LLaDA2.1 outperformed its predecessor LLaDA2.0 across various tasks, demonstrating superior performance in both speed and quality [22][23]. - The model's peak speed in Speedy Mode reached 892 tokens per second on the HumanEval+ benchmark, while the mini version exceeded 1500 tokens per second in certain tasks [18][24]. Group 4: Industry Implications - The advancements in LLaDA2.1 challenge the dominance of autoregressive models, suggesting a potential shift in industry standards towards more efficient and versatile models [20][26]. - The open-sourcing of LLaDA2.1 and its mini version indicates a strategic move to foster wider adoption and innovation within the AI community [24][27].
里程碑时刻,100B扩散语言模型跑出892 Tokens /秒,AI的另一条路走通了
3 6 Ke· 2026-02-11 04:31
Core Insights - The release of LLaDA2.1 marks a significant transformation in the field of diffusion language models (dLLM), which was previously considered a niche area. The new version includes LLaDA2.1-Mini (16 billion parameters) and LLaDA2.1-Flash (100 billion parameters) [1][3] - LLaDA2.1 achieves a peak speed of 892 tokens per second, demonstrating a practical efficiency advantage and breaking the "fast but inaccurate" paradigm with its error-correcting mechanism [3][10] - The model introduces a dual-mode system allowing users to switch between quality and speed, addressing the trade-off between these two aspects effectively [15][19] Model Performance - LLaDA2.1's 100 billion parameter version achieved a peak speed of 892 tokens per second, which is particularly notable given the complexity of tasks it can handle, such as programming benchmarks [10][11] - The model's architecture allows for parallel generation and self-correction, which enhances its usability compared to traditional autoregressive models that lack this capability [13][14] - In experimental evaluations, LLaDA2.1 outperformed its predecessor LLaDA2.0 in quality mode across various benchmarks, while also showing significant improvements in throughput in speed mode [20][22] Technical Innovations - The introduction of an Error-Correcting Editable (ECE) mechanism allows LLaDA2.1 to draft answers quickly and then edit them, enabling a more flexible and accurate output generation process [13][18] - The model employs a reinforcement learning phase to enhance its understanding of instructions and alignment with user intent, marking a first for diffusion models at this scale [16][17] - The dual-mode design allows users to configure the model for either speed or quality, simplifying user experience and model management [15][19] Industry Implications - LLaDA2.1's advancements suggest a potential shift in the landscape of AI models, challenging the dominance of autoregressive architectures and opening up new avenues for research and application in language modeling [26] - The successful implementation of a 100 billion parameter diffusion model indicates that the barriers to scaling such models may be diminishing, encouraging further investment and exploration in this area [11][26] - The model's ability to handle complex tasks efficiently positions it as a competitive alternative in the AI landscape, potentially influencing future developments in language processing technologies [10][26]
小众架构赢麻了!通过编辑功能让100B扩散模型飙出892 tokens/秒的速度!
量子位· 2026-02-11 01:55
Core Viewpoint - The article discusses the emergence of the LLaDA2.1 model from Ant Group, which has achieved a remarkable speed of 892 tokens per second in complex programming tasks, marking a significant advancement over traditional autoregressive models [1][3][11]. Group 1: Model Performance and Features - LLaDA2.1 operates on a 100 billion parameter scale and has transitioned from a research model to a practical tool, demonstrating superior efficiency [3][4]. - The model introduces a dual-mode decoding strategy, allowing users to switch between Speedy Mode and Quality Mode with a single configuration, thus enhancing usability [9][10]. - In Speedy Mode, LLaDA2.1 achieves a peak speed of 892 tokens per second on the HumanEval+ benchmark, while in Quality Mode, it surpasses previous models in various reasoning tasks [11][31]. Group 2: Technical Innovations - The model employs an Error-Correcting Editable (ECE) mechanism, enabling it to generate drafts quickly and then refine them, addressing the limitations of traditional diffusion models [16][21]. - LLaDA2.1 successfully implements reinforcement learning (RL) on a 100 billion scale, enhancing its performance in instruction-following tasks and demonstrating that diffusion models can achieve both speed and understanding [23][26]. - The introduction of the EBPO algorithm allows for efficient training and editing, marking a significant milestone in the application of RL to diffusion models [25][28]. Group 3: Competitive Advantage - LLaDA2.1's performance in benchmark tests shows a significant advantage over mainstream autoregressive architectures, achieving high speeds without compromising quality [29][30]. - The model's ability to maintain quality even in Speedy Mode demonstrates its robustness, achieving a balance between speed and accuracy [32]. - A lighter 16 billion parameter Mini version has been released, achieving peak speeds exceeding 1500 tokens per second, indicating potential for more lightweight deployments [33].
懂了很多道理,AI 依然要发疯
3 6 Ke· 2026-02-09 06:50
最近一段时间,很多论文都在讨论Agent目前的困境。 困境是真实存在的。在应用层,目前Agent离开了像Skill这样人造拐棍后,在处理真实世界的长程任务时根本不可靠。 这种困境通常被归结为两个原因。 第一个是上下文的黑洞。正如前两天腾讯首席AI科学家姚顺雨带领混元团队做的CL Bench所指出的那样,模型或许根本没能力吃透复杂 上下文,所以也不可能按照指令好好办事。 第二个其实更致命,它叫长期规划的崩塌。就是说一旦规划的步长长了,模型就开始犯迷糊。就和喝多了一样,走两步是直的,走十步 就开始画圈。 Anthropic 的研究员们在1月末发布了一篇重磅论文《The Hot Mess of AI 》(AI 的一团乱麻),试图解释第二个问题的因由,结果他们发 现,这一试,给自回归模型(Transformer为基础的都是)清楚的找到了阿喀琉斯之踵。 我们都听说过Yann Lecun经常提的"自回归模型只做Next Token Prediction(下一个词预测),因此根本没法达到理解和AGI。" 但之前这都是个判断或者信仰,没有什么实证证据。这篇论文,就给出了一些实证证据。 而且它还预示了一个可怕的现实,即随着模型 ...
Sebastian Raschka 2026预测:Transformer统治依旧,但扩散模型正悄然崛起
机器之心· 2026-01-14 07:18
Core Insights - The article discusses the evolving landscape of large language models (LLMs) as of 2026, highlighting a shift from the dominance of the Transformer architecture to a focus on efficiency and hybrid architectures [1][4][5]. Group 1: Transformer Architecture and Efficiency - The Transformer architecture is expected to maintain its status as the foundation of the AI ecosystem for at least the next few years, supported by mature toolchains and optimization strategies [4]. - Recent developments indicate a shift towards hybrid architectures and efficiency improvements, rather than a complete overhaul of existing models [5]. - The industry is increasingly focusing on mixed architectures and efficiency, as demonstrated by models like DeepSeek V3 and R1, which utilize mixture of experts (MoE) and multi-head latent attention (MLA) to reduce inference costs while maintaining large parameter counts [7]. Group 2: Linear and Sparse Attention Mechanisms - The standard Transformer attention mechanism has a complexity of O(N^2), leading to exponential growth in computational costs with increasing context length [9]. - New models like Qwen3-Next and Kimi Linear are adopting hybrid strategies that combine efficient linear layers with full attention layers to balance long-distance dependencies and inference speed [14]. Group 3: Diffusion Language Models - Diffusion language models (DLMs) are gaining attention for their ability to generate tokens quickly and cost-effectively through parallel generation, contrasting with the serial generation of autoregressive models [12]. - Despite their advantages, DLMs face challenges in integrating tool calls within response chains due to their simultaneous generation nature [15]. - Research indicates that DLMs may outperform autoregressive models when high-quality data is scarce, as they can benefit from multiple training epochs without overfitting [24][25]. Group 4: Data Scarcity and Learning Efficiency - The concept of "Crossover" suggests that while autoregressive models learn faster with ample data, DLMs excel when data is limited, achieving significant accuracy on benchmarks with relatively small datasets [27]. - DLMs demonstrate that increased training epochs do not necessarily lead to a decline in downstream task performance, offering a potential solution in an era of data scarcity [28].
小模型层数好玄学:12/32/64层效果好,16/24/48/层效果糟
量子位· 2026-01-11 04:02
Core Insights - The article reveals significant findings regarding the 70M small model, emphasizing that the architecture's importance is lower than previously thought, while the model's "shape" (depth-width ratio) is more critical [1][2]. Group 1: Model Architecture and Performance - The optimal number of layers for small models is identified as 32, with 12 and 64 layers also performing well, while configurations with 16, 24, and 48 layers yield poor results [2][15]. - The performance gap between "good" and "bad" layer configurations exceeds 6 percentage points, with "good" configurations averaging around 38% accuracy and "bad" configurations around 32% [15][16]. - The hidden dimension must be at least 512 for optimal performance, with the 32-layer configuration achieving the highest score of 38.50% [18][23]. Group 2: Comparative Analysis of Architectures - A comparison of 12 different architectures, including LLaMA3 and Qwen3, shows that modern architectures perform similarly within the 70M parameter range, with average differences of less than 2% [25][26]. - The article notes that improvements in modern architectures are primarily designed for models with over 700 million parameters and do not provide measurable advantages for 70M models [27]. Group 3: Diffusion Models vs. Autoregressive Models - Diffusion models, while slightly lower in average accuracy (31-32%), demonstrate faster inference speeds (3.8 times faster) and lower hallucination rates compared to autoregressive models [28][30]. - The introduction of a "Canon layer" can enhance factual accuracy by 1% for autoregressive models and over 2% for diffusion models, with minimal additional parameter cost [35][36]. Group 4: New Model Development - The Dhara-70M model is introduced, combining the best features of autoregressive and diffusion models, built on the LLaMA3-Canon architecture and converted using the WSD method [41][42]. - The specifications of Dhara-70M include 71.34M parameters, 32 layers, and a hidden size of 384, designed for high throughput and factual accuracy [44]. Group 5: Recommendations for Model Builders - The article advises small language model builders to focus on the fundamental depth-width ratio rather than chasing the latest architectural trends, especially for applications requiring high-speed processing and factual accuracy [45].
VLA-Arena:一个用于系统性评估VLA的开源基准框架
具身智能之心· 2025-12-31 00:50
Research Background and Motivation - The Vision-Language-Action models (VLAs) are rapidly evolving towards general robotic strategies, achieving capabilities such as cross-carrier generalization, dexterous manipulation, and instruction following. However, there is a lack of quantitative understanding regarding the capability boundaries, limitations, and failure modes of these models, with existing benchmarks having three core deficiencies [1][4]. Core Design: Structured Tasks and Benchmark Framework - The VLA-Arena framework is proposed to address the aforementioned issues, aiming to systematically design and accurately characterize the capability frontiers and failure mechanisms of VLA models [1][4]. - The benchmark includes 170 tasks categorized into four dimensions, covering difficulty levels from L0 to L2 [6]. Key Components and Technical Details - The framework enhances the Behavior Domain Definition Language (BDDL) to create the Constraint Behavior Domain Definition Language (CBDDL), focusing on two core enhancements [6][7]. - The VLA-Arena-S/M/L datasets are provided, categorized by task level (L0/L1) and trajectory count (10/30/50 per task), constructed from human demonstration data with preprocessing steps to ensure reproducibility [8]. Experimental Design and Main Findings - The experimental setup evaluates models across two architectural paradigms, including autoregressive models and continuous action generation models, using success rate (SR) and cumulative cost (CC) as evaluation metrics [12][13]. - Key findings indicate that: 1. Models exhibit a strong tendency to memorize rather than generalize, with performance drastically declining in L1 and L2 tasks [14]. 2. There is an asymmetry in robustness, where models are generally resilient to language perturbations but vulnerable to visual disturbances [15]. 3. A trade-off exists between safety and performance, with models struggling to integrate safety constraints effectively [16]. 4. The ability to handle distractors varies, with static distractors posing greater challenges than dynamic ones, and models failing in long-horizon tasks [19]. 5. Increasing data diversity can enhance near-distribution performance but may harm far-distribution generalization capabilities [17]. Comparison with LIBERO Benchmark - The VLA-Arena tasks require deeper language understanding compared to LIBERO, where performance declines less significantly in the absence of instructions, indicating a more robust semantic grounding in real-world scenarios [22].
跳过“逐字生成”,蚂蚁集团赵俊博:扩散模型让我们能直接修改Token
3 6 Ke· 2025-12-12 07:17
Core Insights - The main focus of the news is on the emerging diffusion architecture for language models, which offers advantages over traditional autoregressive models in terms of speed and computational efficiency [1][4][20]. Group 1: Diffusion Architecture Advantages - Diffusion architecture allows for direct modification and control of tokens during inference, eliminating the need to regenerate entire segments of content as required by autoregressive models [1][5]. - The newly released LLaDA 2.0 model has achieved a scale of 100 billion parameters, marking a significant milestone in the development of diffusion language models [1][20]. - Diffusion models are described as "data-hungry," requiring larger datasets for training compared to autoregressive models, but they can absorb data more quickly [5][8]. Group 2: Technical Developments - The LLaDA model employs a "fill-in-the-blank" prediction method, which contrasts with the sequential token generation of autoregressive models [6][8]. - The architecture includes both global and causal attention mechanisms to enhance computational efficiency and maintain coherence in generated sequences [16]. - The research team has made significant strides in addressing architectural challenges, including the integration of mixture of experts (MoE) within the diffusion framework [19]. Group 3: Industry Impact and Future Directions - Major tech companies, including Google and ByteDance, are actively exploring diffusion models, indicating a growing interest in this technology [1][19]. - The development of a new inference engine, dInfer, is expected to enhance the performance of diffusion models, with potential for significant speed improvements in key applications [24][25]. - The community is encouraged to collaborate in building the ecosystem around diffusion language models, which are still in the early stages of development [27].
跳过“逐字生成”!蚂蚁集团赵俊博:扩散模型让我们能直接修改Token | MEET2026
量子位· 2025-12-12 03:00
Core Viewpoint - The article discusses the shift from autoregressive models to diffusion architecture in language models, highlighting the potential for faster generation speeds and lower computational costs with diffusion models [2][8]. Group 1: Diffusion Architecture Insights - Diffusion architecture allows for direct modification and control of tokens during inference, unlike autoregressive models that require re-generating entire segments [2][15]. - The recent release of LLaDA 2.0 marks a significant milestone, achieving a scale of 100 billion parameters for diffusion language models [4][44]. - The development of diffusion models is still in its early stages, but it has attracted attention from major companies like Google and ByteDance, as well as several startups [5][41]. Group 2: Technical Aspects and Comparisons - Diffusion models operate on a "fill-in-the-blank" mechanism rather than a sequential token generation, which can lead to more efficient data utilization [12][21]. - In terms of parameter efficiency, diffusion models can achieve similar performance with fewer parameters compared to autoregressive models under the same computational constraints [15][23]. - The unique characteristics of diffusion models allow for continuous training, unlike autoregressive models that plateau after several epochs [24][26]. Group 3: Future Directions and Community Engagement - The article emphasizes the need for further exploration of the scaling laws specific to diffusion language models, which differ from those of autoregressive models [56]. - The community is encouraged to participate in the development and optimization of diffusion models, as the ecosystem is still in its infancy [56]. - Upcoming collaborations and API releases are planned to enhance accessibility and integration of diffusion models into various applications [51].
速递|斯坦福教授创业,Inception获5000万美元种子轮融资,用扩散模型解锁实时AI应用
Z Potentials· 2025-11-07 02:12
Core Insights - The article discusses the current surge of funding into AI startups, highlighting it as a golden period for AI researchers to validate their ideas [1] - Inception, a startup developing diffusion-based AI models, recently raised $50 million in seed funding, led by Menlo Ventures with participation from several notable investors [2] Company Overview - Inception is focused on developing diffusion models, which generate outputs through iterative optimization rather than sequential generation [3] - The project leader, Stefano Ermon, has been researching diffusion models prior to the recent AI boom and aims to apply these models to a broader range of tasks [3] Technology and Innovation - Inception has released a new version of its Mercury model, designed specifically for software development, which has been integrated into various development tools [3] - Ermon claims that diffusion-based models will significantly optimize two critical metrics: latency and computational cost, stating that these models are faster and more efficient than those built by other companies [3][5] - Diffusion models differ structurally from autoregressive models, which dominate text-based AI services, and are believed to perform better when handling large volumes of text or data limitations [5] Performance Metrics - The diffusion models exhibit greater flexibility in hardware utilization, which is increasingly important as AI's infrastructure demands grow [5] - Ermon's benchmarks indicate that the models can process over 1,000 tokens per second, surpassing the capabilities of existing autoregressive technologies due to their inherent support for parallel processing [5]