TileLang

Search documents
人工智能系列报告(九)、算力系列报告(二):TileLang:中国的CUDA和Triton
Western Securities· 2025-10-15 06:09
Investment Rating - The industry investment rating is "Overweight" [7] Core Insights - CUDA has developed a significant competitive advantage for NVIDIA in high-performance computing and AI applications over nearly two decades, with enhancements like NVLink and mixed-precision training [12][18] - Triton, introduced by Philippe Tillet, automates low-level optimizations for GPU programming, significantly reducing the development burden for AI applications [19][23] - TileLang, developed by Peking University, aims to bridge the compatibility gap between domestic AI chips and established platforms like CUDA and Triton, potentially lowering development costs and accelerating commercialization [29][36] Summary by Sections Section 1: High-Performance Computing as the Foundation for Generative AI - CUDA has been pivotal in establishing NVIDIA's moat by enabling GPUs to handle parallel computing tasks essential for AI [12][18] - The introduction of Tensor Cores and mixed-precision training has drastically improved matrix computation speeds [14][18] Section 2: TileLang as a Potential Solution for Domestic AI Chips - Domestic AI chip manufacturers face challenges in software compatibility and toolchain maturity compared to NVIDIA's CUDA platform [28] - TileLang, set to be open-sourced in January 2025, utilizes tiling techniques to optimize memory and scheduling, potentially enhancing the performance of AI operators [29][32] - TileLang could effectively address the compatibility issues between leading AI chip companies and domestic platforms, facilitating broader adoption [36] Section 3: Investment Opportunities - Recommended companies to watch include AI inference chip manufacturers like Cambricon and Haiguang Information [37] - Notable server companies include Inspur Information, Zhongke Shuguang, Huqin Technology, and Digital China [37]
人工智能专题:后R1时代,DeepSeek发展的三大阶段
Zhongyuan Securities· 2025-10-14 08:40
Investment Rating - The report maintains an "Outperform" rating for the computer industry, indicating an expected increase of over 10% relative to the CSI 300 index in the next six months [41]. Core Insights - DeepSeek has gained significant attention since the release of its R1 model earlier this year, and it has since focused on incremental updates rather than launching a more advanced R2 model. The development is categorized into three main stages: performance enhancement, hybrid reasoning architecture implementation, and cost reduction with accelerated domestic adaptation [7][10]. - The introduction of the V3.2-Exp model has led to a substantial reduction in API calling prices, with input cache hit prices dropping to 20% of R1's cost and output prices to 19%, enhancing the model's cost-effectiveness and market competitiveness [33][34]. Summary by Sections Stage One: Performance Enhancement - In March, DeepSeek launched V3-0324 and in May, R1-0528, which improved model capabilities through post-training, bridging the gap with leading models [11][12]. Stage Two: Hybrid Reasoning Architecture and Agent Capability Enhancement - From August onwards, DeepSeek aligned with global trends by releasing V3.1 and V3.1-Terminus, significantly enhancing agent capabilities and reasoning efficiency through extensive training on the DeepSeek-V3.1-Base model [12][18]. Stage Three: Efficiency Improvement and Domestic Adaptation Acceleration - The V3.2-Exp model, released in September, introduced a new attention mechanism (DSA) that improved training and reasoning efficiency while significantly lowering costs. This model also marked a milestone in the domestic AI industry, achieving zero-day adaptation with domestic chips from Huawei and Cambrian [31][34].
全球科技(计算机)行业周报:DeepSeek-V3.2-Exp发布,训练推理提效,API同步降价-20251012
Huaan Securities· 2025-10-12 12:02
全球科技(计算机) [Table_IndNameRptType] 行业周报 DeepSeek-V3.2-Exp 发布,训练推理提效,API 同步降价 [Table_IndRank] 行业评级:增持 报告日期: 2025-10-12 [Table_Chart] 行业指数与沪深 300 走势比较 主要观点: -20% -10% 0% 10% 20% 30% 40% 沪深300 计算机(申万) [分析师: Table_Author] 金荣 执业证书号: S0010521080002 邮箱:jinrong@hazq.com 分析师:来祚豪 执业证书号: S0010524100001 邮箱:laizh@hazq.com 联系人:刘政 执业证书号: S0010125070006 邮箱:liuzheng@hazq.com 相关报告 1.计算机:摩尔线程科创板 IPO 成 功过会,有望冲击 A 股 GPU 第一 股。2025-09-26 2.计算机: 英伟达遭反垄断进一步 调查,斥资 50 亿美元入股英特尔。 2025-09-21 ⚫[Table_Summary] T_Summary] 9 月 29 日 DeepSeek-V ...
信创ETF(159537)涨近6%,DeepSeek-V3.2-Ex发布,国产云厂商day0适配
Mei Ri Jing Ji Xin Wen· 2025-10-09 03:28
(文章来源:每日经济新闻) 信创ETF(159537)跟踪的是国证信创指数(CN5075),该指数从沪深市场中选取涉及半导体、软件 开发、计算机设备等信息技术领域的上市公司证券作为指数样本,侧重反映信息技术创新主题的整体表 现。指数成分股平均市值较大,行业配置以半导体和软件开发为主,同时涵盖计算机设备及IT服务等领 域,全面展现信创产业的多元化发展格局。 国投证券指出,9月29日,DeepSeek正式发布 DeepSeek-V3.2-Exp 模型,这是一个实验性 (Experimental)的版本。作为迈向新一代架构的中间步骤,V3.2-Exp在V3.1-Terminus的基础上引入了 DeepSeek Sparse Attention(一种稀疏注意力机制),针对长文本的训练和推理效率进行了探索性的优化 和验证。在新模型的研究过程中,公司使用高级语言 TileLang 进行快速原型开发,以支持更深入的探 索。此次DeepSeek使用的TileLang是由北京大学计算机学院杨智副教授团队主导开发的一款开源AI算子 编程语言,其核心价值在于能够将高级别的数据流描述,自动转换并优化为高效的底层代码(如CUDA 或A ...
DeepSeek与国产芯片的“双向奔赴”
2 1 Shi Ji Jing Ji Bao Dao· 2025-09-30 23:14
Core Viewpoint - The release of DeepSeek-V3.2-Exp model by DeepSeek Company marks a significant advancement in the domestic AI chip ecosystem, introducing a sparse attention mechanism that reduces computational resource consumption and enhances inference efficiency [1][7]. Group 1: Model Release and Features - DeepSeek-V3.2-Exp model incorporates DeepSeek Sparse Attention, leading to a reduction in API prices by 50% to 75% across its official app, web, and mini-programs [1]. - The new model has received immediate recognition and adaptation from several domestic chip manufacturers, including Cambricon, Huawei, and Haiguang, indicating a collaborative ecosystem [2][6]. Group 2: Industry Impact and Ecosystem Development - The rapid adaptation of DeepSeek-V3.2-Exp by various companies suggests a growing consensus within the domestic AI industry regarding the model's significance, positioning DeepSeek as a benchmark for domestic open-source models [2][5]. - The domestic chip industry, primarily operating under a "Fabless" model, is expected to progress quickly as it aligns with standards defined by DeepSeek, which is seen as a key player in shaping the future of the industry [4][5]. Group 3: Comparison with Global Standards - DeepSeek's swift establishment of an ecosystem contrasts with NVIDIA's two-decade-long development of its CUDA platform, highlighting the rapid evolution of the domestic AI landscape [3][8]. - The collaboration among major internet companies like Tencent and Alibaba in adapting to domestic chips further emphasizes the expanding synergy within the AI hardware and software ecosystem [8].
DeepSeek 与国产芯片开启“双向奔赴”
2 1 Shi Ji Jing Ji Bao Dao· 2025-09-30 12:13
Core Insights - DeepSeek company has released the DeepSeek-V3.2-Exp model, introducing a sparse attention mechanism that significantly reduces computational resource consumption and enhances inference efficiency [1] - The new model has led to a price reduction of API services by 50% to 75% [1] - The release has prompted immediate recognition and adaptation from several domestic chip manufacturers, indicating a growing synergy within the domestic AI hardware and software ecosystem [1][2] Group 1: Model Release and Features - The DeepSeek-V3.2-Exp model incorporates the DeepSeek Sparse Attention mechanism, optimizing training and inference efficiency for long texts [5] - The model is compatible with CUDA and utilizes TileLang for rapid prototyping, aiming for higher efficiency through lower-level language implementations [5][6] - The release of V3.2-Exp marks a significant shift from the previous version, V3.1, which did not receive any proactive recognition from companies regarding the "UE8M0 floating-point format" [4][5] Group 2: Industry Response and Ecosystem Development - Within four minutes of the model's release, Cambricon announced its adaptation of the DeepSeek-V3.2-Exp model and open-sourced its large model inference engine [2] - Huawei and Haiguang also quickly followed suit, demonstrating the rapid response from the domestic chip industry to the new model [2] - The consensus within the domestic AI industry regarding the DeepSeek model has empowered the company to take the lead in defining standards for domestic chips [3][4] Group 3: Competitive Landscape - The rapid development of the domestic chip ecosystem is highlighted by the swift adaptation of major players like Tencent and Alibaba, who are actively integrating domestic chips into their cloud computing services [6] - Experts believe that the emergence of DeepSeek has accelerated the pace of domestic chip development, with expectations for significant advancements by 2025 [3]
华为昇腾、寒武纪宣布适配DeepSeek最新模型
2 1 Shi Ji Jing Ji Bao Dao· 2025-09-30 10:19
Core Insights - DeepSeek officially launched the DeepSeek-V3.2-Exp model on September 29, introducing the self-developed DeepSeek Sparse Attention (DSA) mechanism, which optimizes training and inference efficiency for long texts [1][7] - The release of the new model has led to a significant reduction in service costs, with DeepSeek API prices dropping by over 50% [2][10] - The open-sourcing of the TileLang version operator has garnered considerable attention within the industry [3] Technical Innovations - The DSA mechanism is an optimization technique for the Transformer architecture, addressing the computational complexity associated with traditional dense attention mechanisms, which grow exponentially with text length [6][7] - The V3.2-Exp model has achieved substantial improvements in training and inference efficiency for long texts while maintaining performance levels comparable to the previous V3.1-Terminus model [7] Market Impact - DeepSeek has made the V3.2-Exp model fully open-source on platforms like HuggingFace and ModelScope, with related research papers also published [5] - The collaboration with domestic hardware providers such as Huawei, Cambricon, and Haiguang demonstrates the synergy between AI software and hardware ecosystems in China [11][12] - The adoption of TileLang, a programming language designed to simplify GPU operator development, is expected to enhance the efficiency of AI operator development significantly [12]
华为昇腾、寒武纪宣布适配DeepSeek最新模型
21世纪经济报道· 2025-09-30 10:13
Core Viewpoint - DeepSeek has officially released the V3.2-Exp model, introducing the DeepSeek Sparse Attention (DSA) mechanism, which optimizes training and inference efficiency for long texts, significantly reducing service costs by over 50% for the DeepSeek API [1][5]. Group 1: Model Development - The V3.2-Exp model builds on the V3.1-Terminus version and incorporates the DSA mechanism, which is a sparse attention approach that reduces computational complexity when processing long texts [1][4]. - DSA allows for adaptive selection of key attention heads and local context windows, improving efficiency and lowering costs compared to traditional dense attention mechanisms [3][4]. Group 2: Cost and Accessibility - The introduction of the new model has led to a significant reduction in the cost of accessing the DeepSeek API, with prices dropping by more than 50% [5]. - DeepSeek has temporarily retained additional API access for the previous V3.1-Terminus model until October 15, allowing users to conduct comparative testing [2]. Group 3: Open Source and Community Engagement - DeepSeek has fully open-sourced the V3.2-Exp model on platforms like HuggingFace and ModelScope, along with related research papers [2]. - The company has also open-sourced the TileLang version of the operators, which has garnered significant attention in the industry [1][6]. Group 4: Hardware Compatibility - Following the release of V3.2-Exp, major domestic hardware companies like Huawei, Cambricon, and Haiguang have announced compatibility with the new model, indicating a collaborative development within the domestic AI ecosystem [6][10]. - TileLang, a programming language developed for simplifying GPU operator development, has been recommended for use in research experiments, enhancing the efficiency of AI operator development [7][10].
DeepSeek突然拥抱国产GPU语言,TileLang对标CUDA替代Triton,华为昇腾Day0官宣支持适配
3 6 Ke· 2025-09-30 02:52
DeepSeek v3.2有一个新改动,在论文里完全没提,只在官方公告中出现一次,却引起墙裂关注。 开源TileLang版本算子,其受关注程度甚至超过新稀疏注意力机制DSA,从画线转发的数量就可以看出来。 海外社区也注意到DeepSeek使用了它而不是OpenAI开发的Triton语言。 有接触过的开发者感叹TileLang是一种非常优雅的语言,只需不到100行代码就能写出比Flash Attention 2原版快30%的注意力实现。 那么什么是TileLang,又为何引人瞩目? 首先,TileLang是一种专门用来开发GPU内核的领域专用语言,性能上可以对标英伟达CUDA,DeepSeek官方推荐使用此版本做实验,在方便调试和快速 迭代上有优势。 更重要的是,TileLang与国产算力生态适配,连华为昇腾都要在第一时间公告对TileLang的支持。 在几周前的华为全联接大会2025的开发者日上,TileLang团队成员董宇骐就介绍了TileLang实现FlashAttention算子开发,代码量从500+行减少至80行,并 保持了与官方版本持平的性能。 此外TileLang团队成员王磊和沐曦集成电路的高级总 ...
DeepSeek 开源 TileLang 与 CUDA 算子:AI 底层国产替代的关键尝试
小熊跑的快· 2025-09-30 01:11
Core Viewpoint - DeepSeek's release of TileLang and CUDA operator versions represents a significant step towards achieving "independence and control" in AI foundational technology, particularly in the GPU operator development field, addressing issues of technical autonomy, domestic hardware compatibility, ecological collaboration, and innovation efficiency [2][11]. Group 1: Breaking CUDA Monopoly - The dominance of CUDA, a closed-source platform led by NVIDIA, poses risks of technological dependency for domestic developers, limiting their ability to customize operators for new model research [2][3]. - Domestic GPUs, despite improving in computational power, face high migration costs due to the lack of compatible operator libraries and development tools with CUDA [3][5]. Group 2: Lowering Barriers for Domestic Hardware - DeepSeek's open-source solution, TileLang, allows developers to quickly validate operator logic without relying on CUDA, thus reducing dependency on NVIDIA [4][6]. - The dual-version approach provides a precision baseline for domestic platforms, facilitating the verification of operator implementations and lowering debugging costs [4][6]. Group 3: Activating Open Source Community Collaboration - The success of domestic alternatives relies on ecological collaboration, where DeepSeek's open-source initiative encourages community participation in developing new operators [7][8]. - Researchers can quickly develop and share new operator prototypes using TileLang, which can then be adapted by domestic hardware manufacturers [8]. Group 4: Accelerating Domestic Research Pathways - The reliance on CUDA and its tools can hinder innovation in cutting-edge fields like large models and multi-modal research, creating an "optimization black box" [9][10]. - DeepSeek's dual-version operators provide a pathway for domestic teams to innovate without the constraints of CUDA compatibility and licensing issues [10][11]. Group 5: From Single Point Replacement to Ecological Breakthrough - DeepSeek's actions signify a shift from passive following to active construction in the domestic AI foundational technology stack, addressing the challenges of high barriers, long cycles, and adaptation difficulties in GPU operator development [11]. - The approach of using open-source to break monopolies, abstracting complexities, and fostering collaboration may become a crucial paradigm for domestic alternatives in the AI foundational technology sector [11].