Workflow
FlashInfer
icon
Search documents
叶子豪、陈天奇等人开源项目FlashInfer入选,MLSys2025最佳论文奖公布
机器之心· 2025-05-14 04:36
Core Insights - The article highlights the recognition of two outstanding papers in the field of machine learning systems, both authored by Chinese researchers, awarded at the MLSys 2025 conference [1][29]. Group 1: FlashInfer - FlashInfer, a collaborative research project initiated by the University of Washington, Carnegie Mellon University, and OctoAI, aims to create a flexible inference kernel library for large language models (LLMs) [4]. - NVIDIA has integrated FlashInfer's capabilities into various projects, enhancing LLM inference performance [2]. - FlashInfer significantly improves computational performance in various inference scenarios, reducing inter-token latency by 29% to 69% compared to state-of-the-art LLM deployment solutions [7]. - The system employs a block-sparse format and composable formats to optimize memory access and reduce redundancy in key-value cache storage [9][11]. - FlashInfer supports Just-In-Time (JIT) compilation for customizable attention computation templates, allowing flexibility for different application needs [9][20]. - The system's design includes a load-balancing scheduling algorithm to adapt to dynamic user requests while maintaining compatibility with static configurations [9][26]. Group 2: The Hidden Bloat in Machine Learning Systems - The second awarded paper discusses software bloat in machine learning systems, which refers to unused code and functionalities that lead to performance degradation and resource waste [31]. - The proposed method, Negativa-ML, identifies and eliminates bloat in ML frameworks by analyzing shared libraries, achieving an average reduction of device code size by up to 75% and host code size by up to 72% [32]. - By reducing bloat, Negativa-ML can decrease peak host memory usage, peak GPU memory usage, and execution time by up to 74.6%, 69.6%, and 44.6%, respectively [32].