Workflow
CUDA Agent
icon
Search documents
字节清华智能体自动写CUDA内核,比torch.compile加速2.11倍
量子位· 2026-03-03 07:02
Core Insights - The article discusses the successful collaboration between ByteSeed and Tsinghua AIR team to develop an AI system capable of generating high-performance GPU code [1] - The newly open-sourced CUDA Agent achieved optimal performance on the GPU kernel optimization benchmark KernelBench, with a pass rate of 98.8% and a speed-up ratio of 2.11 times compared to torch.compile [2][28] Group 1: GPU Kernel Optimization - GPU kernel optimization has traditionally been challenging, requiring deep understanding of GPU architecture, memory hierarchy, and thread scheduling [6] - The performance of model training and inference is significantly influenced by the quality of the underlying CUDA kernels [7] - Existing AI-assisted solutions have not fundamentally improved kernel optimization capabilities, being either non-training iterative optimizations or fixed execution-feedback loops [8] Group 2: CUDA Agent Development - The CUDA Agent is a comprehensive large-scale reinforcement learning system designed to learn how to generate and optimize high-performance CUDA kernels [9] - The training data for CUDA Agent was constructed through a three-phase process, resulting in 6000 synthetic training tasks [10][14] - The training process includes a robust anti-cheating mechanism to ensure the integrity of the generated tasks [12] Group 3: Training Methodology - The training environment utilizes a ReAct-style interaction loop, with a performance analysis and validation process to ensure the generated kernels exceed torch.compile by at least 5% [17] - A milestone-based discrete reward mechanism is implemented to reflect the true quality of the kernels generated [22] - The training pipeline is divided into multiple phases to maintain stability in long-context reinforcement learning scenarios, achieving a context window of 128K tokens [23][27] Group 4: Performance Evaluation - CUDA Agent outperformed commercial models significantly, with a faster rate of 96.8% compared to torch.compile and a geometric mean speed-up of 2.11 times [28][30] - In Level-1 and Level-2 tasks, CUDA Agent achieved a 100% pass rate, while Level-3 tasks had a pass rate of 94% and a faster rate of 90% compared to compile [29][30] - The performance gap between CUDA Agent and leading commercial models like Claude Opus 4.5 and Gemini 3 Pro is substantial, particularly in challenging tasks [30] Group 5: Open Source Contribution - The team has synchronized the open-source release of the training dataset CUDA-Agent-Ops-6K, which includes the complete filtering process and pollution control scheme for future research in reinforcement learning-based CUDA kernel optimization [32]
英伟达护城河被AI攻破,字节清华CUDA Agent,让人人能搓CUDA内核
机器之心· 2026-03-03 02:55
机器之心编辑部 近日,来自 字节跳动 Seed 团队和清华大 学 AIR 的新研究 CUDA Agent ,在 AI 领域引发了不小的轰动。 研究人员训练了一个能够编写快速 CUDA 内核的模型:不只是正确的内核,而是真正经过优化的内核。 在简单/中等内核上,它的性能比 torch.compile 高出 2 倍 ;在复杂内核上,它的性能比 torch.compile 高出约 92% ;即使在最难的设置下,它的性能也比 Claude Opus 4.5 和 Gemini 3 Pro 高出约 40% 。 针对这一矛盾,CUDA Agent 的核心理念简单而巧妙:CUDA 性能并非取决于正确性,而是取决于硬件。线程束、内存带宽、内存冲突——这些只有在性能分析器 中才能看到的东西。 研究人员不再奖励「是否编译成功」,而是奖励实际的GPU速度。真实的性能分析数据。强化学习直接基于性能进行训练。 在此之前,GPT、Claude 等大模型已经能写出「正确」的 CUDA 代码,AI 生成的代码也已获得了一定程度的应用,但能跑通和跑得快完全是两码事。 GPU 内核优化是现代深度学习的基础,但它仍然是一项高度专业化的工作,需要深厚 ...