Core Insights - Ant Group has officially announced the open-source release of dInfer, the industry's first high-performance inference framework for diffusion language models [1][5] - dInfer demonstrates a significant improvement in inference speed, achieving a 10.7 times increase compared to NVIDIA's Fast-dLLM framework, and reaching a speed of 1011 tokens per second in the HumanEval code generation task [1][4] - The framework addresses key challenges in diffusion language model inference, including high computational costs, KV cache failures, and parallel decoding [1][2] Summary by Sections - Performance Metrics - dInfer achieves an average inference speed of 681 tokens per second, compared to 63.6 tokens per second for Fast-dLLM, marking a 10.7 times improvement [4] - When compared to the AR model Qwen2.5-3B, dInfer's average inference speed is 2.5 times faster, at 681 tokens per second versus 277 tokens per second [5] - Technical Architecture - dInfer is designed with a modular architecture that includes four core components: Model, KV-Cache Manager, Iteration Manager, and Decoder, allowing developers to customize and optimize their configurations [2] - Each module integrates targeted solutions to overcome the three main challenges faced by diffusion language models [2] - Industry Impact - The launch of dInfer signifies a critical step in transitioning diffusion language models from theoretical feasibility to practical efficiency, connecting cutting-edge research with industrial applications [5] - Ant Group invites global developers and researchers to explore the potential of diffusion language models, aiming to build a more efficient and open AI ecosystem [5]
推理性能提升10倍 蚂蚁集团开源高性能扩散语言模型推理框架dInfer