torch.compile
Search documents
英伟达护城河被AI攻破,字节清华CUDA Agent,让人人能搓CUDA内核
机器之心· 2026-03-03 02:55
机器之心编辑部 近日,来自 字节跳动 Seed 团队和清华大 学 AIR 的新研究 CUDA Agent ,在 AI 领域引发了不小的轰动。 研究人员训练了一个能够编写快速 CUDA 内核的模型:不只是正确的内核,而是真正经过优化的内核。 在简单/中等内核上,它的性能比 torch.compile 高出 2 倍 ;在复杂内核上,它的性能比 torch.compile 高出约 92% ;即使在最难的设置下,它的性能也比 Claude Opus 4.5 和 Gemini 3 Pro 高出约 40% 。 针对这一矛盾,CUDA Agent 的核心理念简单而巧妙:CUDA 性能并非取决于正确性,而是取决于硬件。线程束、内存带宽、内存冲突——这些只有在性能分析器 中才能看到的东西。 研究人员不再奖励「是否编译成功」,而是奖励实际的GPU速度。真实的性能分析数据。强化学习直接基于性能进行训练。 在此之前,GPT、Claude 等大模型已经能写出「正确」的 CUDA 代码,AI 生成的代码也已获得了一定程度的应用,但能跑通和跑得快完全是两码事。 GPU 内核优化是现代深度学习的基础,但它仍然是一项高度专业化的工作,需要深厚 ...
无需CUDA代码给H100加速33%-50%,Flash Attention作者新作火了
量子位· 2025-07-11 06:16
Core Viewpoint - The article discusses the introduction of QuACK, a new memory-bound kernel library that accelerates H100 GPU performance by 33%-50% without using any CUDA C++ code, relying solely on Python through CuTe-DSL [1][2][4]. Group 1: Performance Improvement - QuACK achieves a speed increase of 33%-50% compared to highly optimized libraries like PyTorch's torch.compile and Liger on the H100 with a bandwidth of 3TB/s [2]. - The focus is on optimizing memory-intensive kernels, which spend most of their time on memory access rather than computation, thus improving overall performance [14][20]. Group 2: Technical Insights - The authors emphasize that achieving "light-speed" performance for memory-intensive kernels involves understanding and utilizing the modern GPU's thread and memory hierarchy effectively [14][22]. - The article outlines the memory hierarchy of the Hopper architecture, detailing the execution granularity and corresponding memory levels, which include threads, thread blocks, and distributed shared memory [22][25]. Group 3: Kernel Development - The authors provide a tutorial for implementing QuACK, showcasing how to write efficient kernel code using CuTe DSL, which combines the ease of Python with the performance of CUDA C++ [12][92]. - The article highlights the importance of hardware-aware loading and storage strategies to maximize memory throughput, particularly in memory-constrained kernels [30][32]. Group 4: Benchmarking and Comparisons - Performance tests on the softmax kernel show that the implementation based on CuTe DSL achieves a DRAM throughput of 3.01TB/s, significantly outperforming the Triton kernel generated by torch.compile, which achieves around 2.0TB/s [70][81]. - The article notes that the CuTe DSL implementation maintains high memory throughput even as input sizes increase, demonstrating its efficiency in handling large-scale data [84][90]. Group 5: Future Directions - The authors suggest that the efficient GPU kernel development process can be automated, potentially allowing large language models to generate optimized GPU kernels in the future [96].