dInfer
Search documents
跳过“逐字生成”,蚂蚁集团赵俊博:扩散模型让我们能直接修改Token
3 6 Ke· 2025-12-12 07:17
当主流大语言模型还在采用自回归架构时,有人已经盯上了扩散架构。 在本次量子位MEET2026智能未来大会上,浙江大学百人计划研究员、博士生导师,蚂蚁集团资深技术专家赵俊博表示: 扩散架构在推理过程中可以直接修改和控制token,而不需要像自回归模型那样重新生成整段内容。 这意味着,相比自回归模型,扩散模型理论上有望实现更快的生成速度以及更低的计算成本。 基于此,他和团队将重点押注于扩散架构,并致力于探索扩散语言模型独有的Scaling Law。 而作为这一探索的关键里程碑,他们近期发布并开源了LLaDA 2.0,率先将扩散语言模型做到千亿体量。 赵俊博坦言,该领域在训练与推理层面仍处早期,但发展势头迅猛,已吸引包括谷歌、字节在内的巨头及一批初创公司积极布局。 编者注:就在MEET2026智能未来大会结束后,赵俊博和团队也发布了全新的技术报告,揭示了千亿体量扩散语言模型背后的关键技术选择。 报告标题:LLaDA2.0: Scaling Up Diffusion Language Models to 100B 报告链接(github):https://github.com/inclusionAI/LLaDA2.0 ...
推理速度10倍提升,蚂蚁集团开源业内首个高性能扩散语言模型推理框架dInfer
机器之心· 2025-10-13 09:24
Core Insights - Ant Group has launched dInfer, the industry's first high-performance inference framework for diffusion large language models (dLLM), achieving over 10 times the inference speed compared to Fast-dLLM [2][29] - dInfer has set a new milestone in performance, reaching a throughput of 1011 tokens per second in single-batch inference scenarios, surpassing highly optimized autoregressive (AR) models [29] Group 1: dInfer Framework - dInfer is designed to support various dLLM architectures, including LLaDA, LLaDA-MoE, and LLaDA-MoE-TD, emphasizing modularity and scalability [9][20] - The framework integrates four core modules: Model, KV Cache Manager, Iteration Manager, and Decoder, allowing developers to customize and optimize strategies [11][13] - dInfer addresses three core challenges in dLLM inference: high computational costs, KV cache invalidation, and the complexities of parallel decoding [12][19] Group 2: Performance Enhancements - dInfer employs a "Vicinity KV-Cache Refresh" strategy to reduce computational costs while maintaining generation quality by selectively recalculating KV caches [15][17] - The framework optimizes the forward computation speed of dLLM to match that of AR models through various system enhancements [18] - It introduces hierarchical and credit decoding algorithms to maximize the number of tokens decoded in parallel without additional training [19][20] Group 3: Performance Metrics - In tests with 8 NVIDIA H800 GPUs, dInfer achieved an average inference speed of 681 tokens per second, which is 10.7 times faster than Fast-dLLM [29] - When combined with trajectory distillation technology, dInfer's average inference speed soared to 847 tokens per second, exceeding the performance of AR models by over 3 times [24][29] - dInfer's performance in code generation tasks has set a record, demonstrating significant speed advantages in latency-sensitive scenarios [29] Group 4: Open Source and Community Engagement - The release of dInfer marks a significant step in the practical efficiency of diffusion language models, inviting global developers and researchers to collaborate in building a more efficient and open AI ecosystem [28][25] - The complete code, technical reports, and experimental configurations for dInfer v0.1 have been made open source [27][28]
推理性能提升10倍 蚂蚁集团开源高性能扩散语言模型推理框架dInfer
Huan Qiu Wang· 2025-10-13 09:03
Core Insights - Ant Group has officially announced the open-source release of dInfer, the industry's first high-performance inference framework for diffusion language models [1][5] - dInfer demonstrates a significant improvement in inference speed, achieving a 10.7 times increase compared to NVIDIA's Fast-dLLM framework, and reaching a speed of 1011 tokens per second in the HumanEval code generation task [1][4] - The framework addresses key challenges in diffusion language model inference, including high computational costs, KV cache failures, and parallel decoding [1][2] Summary by Sections - **Performance Metrics** - dInfer achieves an average inference speed of 681 tokens per second, compared to 63.6 tokens per second for Fast-dLLM, marking a 10.7 times improvement [4] - When compared to the AR model Qwen2.5-3B, dInfer's average inference speed is 2.5 times faster, at 681 tokens per second versus 277 tokens per second [5] - **Technical Architecture** - dInfer is designed with a modular architecture that includes four core components: Model, KV-Cache Manager, Iteration Manager, and Decoder, allowing developers to customize and optimize their configurations [2] - Each module integrates targeted solutions to overcome the three main challenges faced by diffusion language models [2] - **Industry Impact** - The launch of dInfer signifies a critical step in transitioning diffusion language models from theoretical feasibility to practical efficiency, connecting cutting-edge research with industrial applications [5] - Ant Group invites global developers and researchers to explore the potential of diffusion language models, aiming to build a more efficient and open AI ecosystem [5]
首次超越自回归模型!蚂蚁集团开源业内首个高性能扩散语言模型推理框架dInfer
Xin Lang Ke Ji· 2025-10-13 09:00
Core Insights - Ant Group has officially open-sourced the industry's first high-performance diffusion language model inference framework, dInfer, which significantly enhances the efficiency of diffusion language models [1][2] Performance Metrics - dInfer achieves a 10.7 times improvement in inference speed compared to NVIDIA's Fast-dLLM framework, with average transactions per second (TPS) increasing from 63.6 to 681 [1] - In the HumanEval code generation task, dInfer reaches a speed of 1011 tokens per second in single-batch inference, surpassing autoregressive models for the first time in the open-source community [1] - When compared to the vLLM framework running the Qwen2.5-3B model, dInfer's average inference speed is 2.5 times faster, with 681 TPS versus 277 TPS [1] Industry Impact - The launch of dInfer marks a critical step in transitioning diffusion language models from theoretical feasibility to practical efficiency, connecting cutting-edge research with industrial application [2] - Ant Group invites global developers and researchers to explore the vast potential of diffusion language models, aiming to build a more efficient and open AI ecosystem [2]