Mirage Persistent Kernel(MPK)

Search documents
舍弃CUDA编程!CMU等用几十行代码将LLM编译成巨型内核,推理延迟可降6.7倍
机器之心· 2025-06-21 01:33
Core Viewpoint - The introduction of the Mirage Persistent Kernel (MPK) compiler by a team led by Zhihao Jia from CMU significantly reduces the inference latency of large language models (LLMs) by 1.2 to 6.7 times, addressing the high manual optimization costs and end-to-end delays associated with CUDA-driven LLM inference [3][4][12]. Group 1: Introduction of MPK - MPK is designed to automatically convert LLMs into optimized megakernels, which can execute the entire model without interruption, thus enhancing performance [9][10]. - The MPK compiler allows developers to compile LLMs with minimal manual effort, requiring only a few lines of Python code [5][12]. Group 2: Performance Advantages - MPK eliminates kernel launch overhead and maximizes the overlap of computation, data loading, and GPU communication, resulting in significantly lower inference latency [14][18]. - The performance improvements of MPK increase with the number of GPUs, making it particularly efficient in multi-GPU deployment scenarios [18]. Group 3: Working Mechanism of MPK - MPK consists of two main components: a compiler that transforms LLM computation graphs into fine-grained task graphs, and a runtime system that executes these task graphs within a single megakernel [19][24]. - The MPK compiler captures dependencies at a finer granularity, allowing for more aggressive pipeline optimizations compared to existing systems [26][27]. Group 4: Future Plans - The team aims to enhance MPK's usability and performance, with ongoing efforts to support dynamic workloads and advanced scheduling strategies [40][43].