AI生成内核
Search documents
斯坦福意外用AI生成超强CUDA内核,性能比人类专家优化得还要好!翻倍碾压原生PyTorch,华人主创
量子位· 2025-05-31 03:34
Core Insights - AI unexpectedly generated kernels outperform those optimized by human experts, showcasing significant performance improvements in deep learning operations [1][2][4] Performance Metrics - AI-optimized kernels achieved up to 400% performance improvement over native PyTorch in common deep learning operations [2] - Specific performance metrics include: - Matrix multiplication (Matmul, FP32): 101.3% of PyTorch's torch.matmul - 2D convolution (Conv2D): 179.9% of torch.nn.Conv2D - Softmax: 111.8% of torch.softmax - Layer normalization (LayerNorm): 484.4% of torch.nn.LayerNorm - Conv2D + ReLU + MaxPool combination: 290.1% of PyTorch reference implementation and 189.0% of torch.compile() reference implementation [6] Research Methodology - The research team initially aimed to generate synthetic data for training kernel generation models but discovered that the synthetic data itself could produce high-performance kernels [3][40] - The optimization process involved a language reasoning step between iterations, encouraging diverse search processes [9][10] - The team employed a multi-branch exploration strategy, allowing multiple implementations to evolve from each idea, selecting the best-performing kernel for subsequent rounds [16][19] Implementation Details - Kernels were written in pure CUDA-C without relying on libraries like CUTLASS and Triton [13] - The optimization approach diverged from traditional sequential modifications, instead utilizing natural language to generate optimization ideas before translating them into code [14][15] - The research demonstrated that the generated kernels utilized advanced optimizations and hardware features previously considered difficult to implement [41] Future Prospects - The research team expressed optimism about future developments, noting that their initial goal of generating functional kernels has evolved into achieving significant performance improvements [47][48] - They highlighted ongoing optimization efforts, particularly in FP16 Matmul and FP16 Flash Attention, with current performance at 52% and 9% of torch.matmul and torch.nn.functional.scaled_dot_product_attention, respectively [46]