torch.compile - filings, earnings calls, financial reports, news

torch.compile

Search documents

机器之心· 2026-03-03 02:55

机器之心编辑部近日，来自字节跳动 Seed 团队和清华大学 AIR 的新研究 CUDA Agent ，在 AI 领域引发了不小的轰动。研究人员训练了一个能够编写快速 CUDA 内核的模型：不只是正确的内核，而是真正经过优化的内核。在简单/中等内核上，它的性能比 torch.compile 高出 2 倍；在复杂内核上，它的性能比 torch.compile 高出约 92% ；即使在最难的设置下，它的性能也比 Claude Opus 4.5 和 Gemini 3 Pro 高出约 40% 。针对这一矛盾，CUDA Agent 的核心理念简单而巧妙：CUDA 性能并非取决于正确性，而是取决于硬件。线程束、内存带宽、内存冲突——这些只有在性能分析器中才能看到的东西。研究人员不再奖励「是否编译成功」，而是奖励实际的GPU速度。真实的性能分析数据。强化学习直接基于性能进行训练。在此之前，GPT、Claude 等大模型已经能写出「正确」的 CUDA 代码，AI 生成的代码也已获得了一定程度的应用，但能跑通和跑得快完全是两码事。 GPU 内核优化是现代深度学习的基础，但它仍然是一项高度专业化的工作，需要深厚 ...

智能体强化学习

CUDA内核优化

Artificial Intelligence

Artificial Intelligence

CUDA Agent

torch.compile

CUDA-Agent-Ops-6K

无需CUDA代码给H100加速33%-50%，Flash Attention作者新作火了

量子位· 2025-07-11 06:16

Core Viewpoint - The article discusses the introduction of QuACK, a new memory-bound kernel library that accelerates H100 GPU performance by 33%-50% without using any CUDA C++ code, relying solely on Python through CuTe-DSL [1][2][4]. Group 1: Performance Improvement - QuACK achieves a speed increase of 33%-50% compared to highly optimized libraries like PyTorch's torch.compile and Liger on the H100 with a bandwidth of 3TB/s [2]. - The focus is on optimizing memory-intensive kernels, which spend most of their time on memory access rather than computation, thus improving overall performance [14][20]. Group 2: Technical Insights - The authors emphasize that achieving "light-speed" performance for memory-intensive kernels involves understanding and utilizing the modern GPU's thread and memory hierarchy effectively [14][22]. - The article outlines the memory hierarchy of the Hopper architecture, detailing the execution granularity and corresponding memory levels, which include threads, thread blocks, and distributed shared memory [22][25]. Group 3: Kernel Development - The authors provide a tutorial for implementing QuACK, showcasing how to write efficient kernel code using CuTe DSL, which combines the ease of Python with the performance of CUDA C++ [12][92]. - The article highlights the importance of hardware-aware loading and storage strategies to maximize memory throughput, particularly in memory-constrained kernels [30][32]. Group 4: Benchmarking and Comparisons - Performance tests on the softmax kernel show that the implementation based on CuTe DSL achieves a DRAM throughput of 3.01TB/s, significantly outperforming the Triton kernel generated by torch.compile, which achieves around 2.0TB/s [70][81]. - The article notes that the CuTe DSL implementation maintains high memory throughput even as input sizes increase, demonstrating its efficiency in handling large-scale data [84][90]. Group 5: Future Directions - The authors suggest that the efficient GPU kernel development process can be automated, potentially allowing large language models to generate optimized GPU kernels in the future [96].