大模型能否为不同硬件平台生成高性能内核？南大、浙大提出跨平台内核生成评测框架MultiKernelBench

Core Viewpoint - The article discusses the emergence of MultiKernelBench, a new open-source evaluation framework developed by Nanjing University and Zhejiang University, aimed at assessing the performance of large language models (LLMs) in generating high-performance deep learning kernels across diverse hardware platforms [3][6][10]. Group 1: Background and Motivation - The majority of computations in deep learning rely on low-level computation kernels executed on hardware accelerators like GPUs, NPUs, and TPUs, which are typically manually coded using specialized programming languages [2]. - Recent advancements in LLMs for code generation have sparked interest in automating the generation of high-performance deep learning kernels [2][3]. - Existing evaluation benchmarks are limited by platform coverage, assessment dimensions, and scalability, raising questions about the transferability of LLM advantages from CUDA ecosystems to heterogeneous platforms [3][6]. Group 2: MultiKernelBench Framework - MultiKernelBench introduces an open evaluation scenario for LLMs to automatically generate high-performance deep learning kernels across multiple platforms, marking a shift from single-platform capabilities to a more versatile approach [6][9]. - The framework is designed with modularity in mind, featuring four core characteristics: cross-hardware platform support, fine-grained task system, end-to-end automated evaluation, and category-aware one-shot prompting strategies [9][11][14][16]. - It covers 14 categories of core deep learning operators, including convolution and normalization, and incorporates both classic and newly added tasks to reflect LLM capabilities comprehensively [11][12]. Group 3: Evaluation and Results - MultiKernelBench has been used to evaluate seven major LLMs, including GPT-4o and Claude, with parameter sizes ranging from 32 billion to 681 billion [19]. - The evaluation metrics include Compilation@k, Pass@k, and SpeedUp@k, assessing the success of code generation, functional correctness, and performance optimization [21]. - Results indicate that while LLMs perform well on CUDA platforms, their success rates significantly drop on non-CUDA platforms, highlighting the need for further development in this area [23][27]. Group 4: Future Directions - The authors plan to expand support for various GPU and NPU architectures and invite collaboration from manufacturers to build an open-source ecosystem [10][24]. - Future efforts will focus on enhancing cross-platform collaboration, improving generation quality on low-resource platforms, and integrating more hardware backends [23][24].