Workflow
LoPA
icon
Search documents
7B扩散语言模型单样例1000+ tokens/s!上交大联合华为推出LoPA
机器之心· 2025-12-31 08:11
Core Insights - The article discusses a breakthrough in the field of diffusion large language models (dLLMs) through a new decoding algorithm called LoPA (Lookahead Parallel Decoding), which significantly enhances inference speed and parallelism [2][3][36]. Group 1: LoPA Algorithm Features - LoPA achieves a high degree of parallelism, increasing the tokens generated per step (TPF) from 3.1 to 10.1, thus surpassing traditional methods [3][7]. - The algorithm is plug-and-play, requiring no retraining or fine-tuning of the model [8]. - It introduces a lookahead parallel decoding mechanism that actively explores different token filling orders to avoid local optima [9]. - The accompanying LoPA-Dist system maximizes hardware utilization by supporting both CUDA and Ascend platforms [10]. Group 2: Performance Metrics - LoPA has demonstrated a single-sample throughput of 1073.9 tokens/s on the Huawei Ascend 910C platform, significantly outperforming baseline models [3][33]. - In experiments, LoPA integrated with D2F-Dream achieved a TPF of 10.1 on the GSM8K benchmark, drastically reducing the total inference steps [28][31]. - The system's performance metrics indicate that it can effectively convert algorithmic parallelism into substantial real-time acceleration, achieving over 1000 tokens/s on dedicated engines [34]. Group 3: System Design and Optimization - The LoPA-Dist distributed inference system employs a new branch parallelism strategy, which can be combined with existing tensor parallelism methods [25]. - It is optimized for different hardware platforms, with LoPA-Dist-NV designed for low-latency scenarios and LoPA-Dist-Ascend aimed at high-throughput service environments [26]. Group 4: Future Directions - The team plans to explore the application of LoPA in other dLLM architectures, such as SDAR, to further advance efficient generative models [36].