32B逆袭GPT-5.2：首个端到端GPU编程智能体框架StitchCUDA问世

Core Insights - The article presents StitchCUDA, a novel framework that shifts the focus from optimizing individual kernels to generating complete end-to-end GPU programs, achieving a 90% success rate and a 1.50× average speedup on KernelBench Level 3 tasks, significantly outperforming existing methods [2][10][31] Background and Motivation - The performance of CUDA code is critical for model training and inference, with existing LLM-based methods excelling in single kernel tasks but struggling with end-to-end GPU program generation, which involves complex system-level factors [4][7] Challenges in End-to-End CUDA Generation - Three core challenges are identified: 1. End-to-end programs require global coordination, as performance is influenced by system-level decisions [7] 2. The CUDA programming capability of the Coder needs enhancement beyond prompt engineering [7] 3. Existing RL methods face issues like reward hacking and degradation behavior [7] StitchCUDA Methodology - StitchCUDA employs a multi-agent framework combined with Rubric Reward-based Agentic RL, consisting of three specialized agents: 1. Planner: Analyzes performance and breaks down tasks [12] 2. Coder: Generates CUDA implementations based on the planner's tasks [12] 3. Verifier: Validates correctness and analyzes performance bottlenecks [13] Agentic Reinforcement Learning - The framework introduces an innovative Agentic RL training scheme that decomposes multi-round interactions into atomic skills, significantly reducing training time and enhancing the Coder's capabilities [14][16] Rubric Reward Mechanism - Rubric Reward, designed by CUDA experts, evaluates generated code across four dimensions, effectively addressing reward hacking and degradation behavior by combining it with rule-based rewards [17][18] Experimental Evaluation - Experiments conducted on KernelBench across two NVIDIA architectures demonstrate StitchCUDA's superior performance compared to leading models and frameworks, achieving high correctness and speedup rates [20][21] Key Findings - The multi-agent framework significantly improves end-to-end correctness, with Agentic RL being crucial for achieving system-level acceleration [22] - StitchCUDA outperforms existing methods, including those using larger models, indicating that RL training provides capabilities that model size alone cannot replace [22] - The framework surpasses torch.compile, achieving a 1.29× speedup over reference code [23] Hacking Detection - StitchCUDA implements anti-hacking measures to prevent models from exploiting evaluation criteria, resulting in a significant reduction in hacking rates [24][26] Ablation Studies - Removing Rubric Reward leads to a substantial drop in success rates and speedup, confirming its critical role in effective RL training [27] Case Study - A specific task example illustrates how StitchCUDA achieved a 3.75× speedup through a combination of system-level and kernel-level optimizations [29][30] Conclusion - StitchCUDA represents a comprehensive solution for end-to-end GPU program generation, achieving near 100% success rates and 1.5× average speedup, paving the way for LLM-driven automated GPU programming [31]