NeurIPS 2025 Spotlight | 中国联通以全局优化重塑扩散模型加速

Core Insights - The article discusses the rapid advancements in video generation models, particularly the performance of the Transformer-based DiT model, which is approaching real-life shooting effects. However, it highlights a significant bottleneck: long inference times, high computational costs, and challenges in increasing generation speed [2][29]. - A new approach called LeMiCa (Lexicographic Minimax Path Caching) is introduced, which is a cache acceleration framework that does not require training and achieves global optimal modeling while maintaining image quality and consistency [2][29]. LeMiCa Framework - LeMiCa addresses the long-standing issue of whether a truly "globally consistent, error-controllable, and fast" caching acceleration path exists for diffusion models, concluding that such a path does exist and is simpler than previously thought [2][7]. - The core idea of LeMiCa is that caching acceleration is not a local decision problem but a global path optimization problem [7]. Technical Implementation - The generation process of diffusion models can be abstracted as a weighted directed acyclic graph (DAG), where each node represents a time step and edges represent the behavior of skipping computations and reusing caches [8]. - LeMiCa introduces a novel error measurement method to quantify the impact of caching on the final video results by constructing a static DAG offline [11][12]. Optimization Strategy - The optimization problem is formalized as finding the optimal path from the start to the end within a fixed budget, using a lexicographic minimax path approach to ensure that the maximum error is minimized and the error distribution is more balanced [12][13]. - Experimental results show that LeMiCa achieves significant improvements in both speed and visual quality compared to other mainstream methods [14][19]. Performance Metrics - LeMiCa demonstrates a speedup of over 2.4× in inference performance while significantly enhancing visual consistency and quality across various video generation models [19][20]. - The framework has been validated across multiple mainstream video generation models, showing superior performance in maintaining visual consistency before and after acceleration [14][19]. Robustness and Compatibility - LeMiCa exhibits robustness in acceleration paths, maintaining effectiveness even when sampling schedules are altered [20]. - The framework is compatible with text-to-image models, as demonstrated with the QWen-Image model, achieving similar acceleration effects [21]. Industry Recognition - LeMiCa has received endorsements from top-tier multi-modal model development teams, including Alibaba's Tongyi Qianwen and Zhizhu AI, highlighting its significance in the industry [24][25]. Conclusion - LeMiCa redefines the acceleration problem in diffusion video generation from a global optimization perspective, breaking through the limitations of traditional local greedy caching strategies and providing a new paradigm for video generation that is both fast and stable [29].