推理速度10倍提升，蚂蚁集团开源业内首个高性能扩散语言模型推理框架dInfer

Core Insights - Ant Group has launched dInfer, the industry's first high-performance inference framework for diffusion large language models (dLLM), achieving over 10 times the inference speed compared to Fast-dLLM [2][29] - dInfer has set a new milestone in performance, reaching a throughput of 1011 tokens per second in single-batch inference scenarios, surpassing highly optimized autoregressive (AR) models [29] Group 1: dInfer Framework - dInfer is designed to support various dLLM architectures, including LLaDA, LLaDA-MoE, and LLaDA-MoE-TD, emphasizing modularity and scalability [9][20] - The framework integrates four core modules: Model, KV Cache Manager, Iteration Manager, and Decoder, allowing developers to customize and optimize strategies [11][13] - dInfer addresses three core challenges in dLLM inference: high computational costs, KV cache invalidation, and the complexities of parallel decoding [12][19] Group 2: Performance Enhancements - dInfer employs a "Vicinity KV-Cache Refresh" strategy to reduce computational costs while maintaining generation quality by selectively recalculating KV caches [15][17] - The framework optimizes the forward computation speed of dLLM to match that of AR models through various system enhancements [18] - It introduces hierarchical and credit decoding algorithms to maximize the number of tokens decoded in parallel without additional training [19][20] Group 3: Performance Metrics - In tests with 8 NVIDIA H800 GPUs, dInfer achieved an average inference speed of 681 tokens per second, which is 10.7 times faster than Fast-dLLM [29] - When combined with trajectory distillation technology, dInfer's average inference speed soared to 847 tokens per second, exceeding the performance of AR models by over 3 times [24][29] - dInfer's performance in code generation tasks has set a record, demonstrating significant speed advantages in latency-sensitive scenarios [29] Group 4: Open Source and Community Engagement - The release of dInfer marks a significant step in the practical efficiency of diffusion language models, inviting global developers and researchers to collaborate in building a more efficient and open AI ecosystem [28][25] - The complete code, technical reports, and experimental configurations for dInfer v0.1 have been made open source [27][28]