Gemini diffusion

Search documents
谷歌之后,英伟达入局扩散大语言模型,Fast-dLLM推理速度狂飙27.6倍
机器之心· 2025-05-30 03:28
Core Viewpoint - The article discusses the breakthrough in inference speed for diffusion models through the introduction of Fast-dLLM, which utilizes a training-free acceleration approach, enhancing the practical application of large language models (LLMs) [2][20]. Group 1: Core Technology - Fast-dLLM employs a Block-Wise KV Cache mechanism, achieving over 90% activation reuse, significantly improving computational efficiency for long sequence inference [6][12]. - The Confidence-Aware Parallel Decoding method allows for parallel decoding while maintaining token dependency, ensuring coherent generation by filtering tokens based on confidence levels [9][13]. - The dual cache strategy enables simultaneous caching of prefix and suffix attention activations, reducing redundant calculations and enhancing performance [12]. Group 2: Performance Breakthrough - Fast-dLLM achieves a 27.6 times end-to-end acceleration for long text generation tasks, reducing single-step latency from 0.26 seconds to 0.09 seconds, and overall time from 266 seconds to 12 seconds [18]. - The accuracy loss in mainstream benchmark tests is kept under 2%, demonstrating the model's effectiveness in maintaining quality while improving speed [19]. Group 3: Application Value - Fast-dLLM's zero-training cost feature makes it an ideal tool for inference optimization, allowing for quick integration into existing systems without altering model architecture or training processes [20]. - The model shows compatibility with various existing models like LLaDA and Dream, achieving significant throughput improvements while maintaining competitive accuracy [21]. Group 4: Summary and Outlook - Fast-dLLM represents a significant advancement in inference efficiency for diffusion models while ensuring stable generation quality, paving the way for broader applications in real-time interaction and long text generation [23].