Nvidia B200

Search documents
华为新技术,挑战英伟达
半导体芯闻· 2025-08-28 09:55
如果您希望可以时常见面,欢迎标星收藏哦~ 来源 :内容 编译自 tomshardware 。 利用其Hot Chips 2025大会的演讲契机,华为推出了UB-Mesh技术,该技术旨在通过单一协议统 一AI数据中心内外部节点的所有互连。该公司还表示,将在下个月的活动中宣布向所有用户免费 开放该协议。 该技术旨在用单一协议取代PCIe、CXL、NVLink和TCP/IP协议,以降低延迟、控制成本并提高 千兆级数据中心的可靠性。为了推动这一举措,华为计划开源该规范。但它会获得广泛关注吗? 华为处理器部门海思半导体首席科学家廖恒(音译)表示:"下个月我们将召开一次会议,宣布 UB-Mesh协议将像免费许可证一样向所有人开放。" "这是一项非常新的技术;我们看到不同阵营 正在竞相推进标准化工作。根据我们在实际系统部署方面的成功程度以及合作伙伴和客户的需求, 我们可以讨论将其转化为某种标准。" 虽然用于训练和推理的 AI 数据中心应该像一个大型并行处理器一样运行,但它们由独立的机架、 服务器、CPU、GPU、内存、SSD、NIC、交换机和其他组件组成,这些组件使用不同的总线和协 议相互连接,例如 UPI、PCIe、CX ...
万字解读AMD的CDNA 4 架构
半导体行业观察· 2025-06-18 01:26
Core Viewpoint - AMD's CDNA 4 architecture represents a moderate update over CDNA 3, focusing on enhancing matrix multiplication performance for low-precision data types, which are crucial for machine learning workloads [2][26]. Architecture Overview - CDNA 4 maintains a similar system-level architecture to CDNA 3, utilizing a large chiplet setup with eight compute dies (XCD) and a memory-side cache of 256 MB [4][20]. - The architecture employs AMD's Infinity Fabric technology for consistent memory access across multiple chips [4]. Performance Comparison - The MI355X GPU, based on CDNA 4, features a clock speed of 2.4 GHz and 256 cores, compared to MI300X's 304 cores at 2.1 GHz, indicating a slight reduction in core count but improved clock speed [5]. - MI355X offers 288 GB of HBM3E memory with a bandwidth of 8 TB/s, surpassing Nvidia's B200, which has a maximum capacity of 180 GB and bandwidth of 7.7 TB/s [25]. Matrix and Vector Throughput - CDNA 4 has rebalanced execution units to focus on low-precision matrix multiplication, doubling matrix throughput per compute unit (CU) in many cases [6][39]. - The architecture supports new low-precision data formats, significantly enhancing AI performance, with matrix core improvements leading to nearly four times the computational throughput for low-precision formats [46][47]. Local Data Sharing (LDS) Enhancements - CDNA 4 increases the Local Data Share (LDS) capacity to 160 KB and doubles the read bandwidth to 256 bytes per clock, improving data locality for matrix multiplication routines [14][48]. - The architecture introduces new instructions for reading transposed LDS, optimizing memory access patterns for matrix operations [18]. Memory Hierarchy and Cache - The memory hierarchy includes a shared 4 MB L2 cache and a 32 KB L1 vector cache per CU, with enhancements for caching non-coherent data from DRAM [49][50]. - The Infinity Cache remains at 256 MB, providing high bandwidth and supporting the increased memory demands of modern AI workloads [53]. Chiplet Architecture - The CDNA 4 architecture continues to leverage a chiplet-based design, allowing for independent evolution of each chiplet for improved performance and manufacturability [35][36]. - Each XCD contains 36 compute units, organized into arrays, with a focus on maximizing yield and operational frequency [39]. System Communication and Expansion - The architecture includes eight AMD Infinity Fabric links, with improved speeds of up to 38.4 Gbps, enhancing communication bandwidth within server nodes [63]. - The design supports both direct compatibility with previous generations and progressive improvements for high-performance systems [62]. Conclusion - AMD's CDNA 4 architecture builds on the success of CDNA 3, focusing on optimizing performance for machine learning workloads while maintaining a competitive edge against Nvidia [26][27].
台积电,颠覆传统中介层
半导体芯闻· 2025-06-12 10:04
Core Viewpoint - The article discusses the significant rise of TSMC's CoWoS packaging technology, driven by the increasing demand for GPUs in the AI sector, particularly through its partnership with NVIDIA, which has deepened over time [1][3]. Group 1: CoWoS Technology and NVIDIA Partnership - NVIDIA has emphasized its reliance on TSMC for CoWoS technology, stating that it has no alternative partners in this area [1]. - TSMC has reportedly surpassed ASE Group to become the largest player in the global packaging market, benefiting from the growing demand for advanced packaging solutions [1]. - NVIDIA's upcoming Blackwell series will utilize more CoWoS-L packaging, indicating a shift in production focus from CoWoS-S to CoWoS-L to meet the high bandwidth requirements of its GPUs [3]. Group 2: Challenges and Innovations in CoWoS - The increasing size of AI chips poses challenges for CoWoS packaging, as larger chips reduce the number of chips that can fit on a 12-inch wafer [4]. - TSMC is facing difficulties with the use of flux in CoWoS, which is essential for chip bonding but becomes problematic as the size of the interposer increases [4][5]. - TSMC is exploring flux-free bonding technologies to improve yield rates and address the challenges posed by flux residue [5]. Group 3: Future Developments and Alternatives - TSMC plans to introduce CoWoS-L with a mask size of 5.5 times larger by 2026 and aims for a record 9.5 times larger version by 2027 [8]. - The company is also developing CoPoS technology, which replaces traditional wafers with panel substrates, allowing for higher chip density and efficiency [9][10]. - CoPoS is positioned as a potential alternative to CoWoS-L, targeting high-performance applications in AI and HPC systems [12]. Group 4: Technical Comparisons - FOPLP and CoPoS both utilize large panel substrates but differ in architecture; FOPLP does not use an interposer, while CoPoS does, enhancing signal integrity for high-performance chips [11]. - CoPoS is transitioning to glass substrates, which offer better performance characteristics compared to traditional organic substrates [12]. - The shift from round wafers to square panels in CoPoS aims to improve yield and reduce costs, making it more competitive in the AI and 5G markets [12]. Group 5: Challenges Ahead - Transitioning to square panel technology requires significant investment in materials and equipment, along with overcoming technical challenges related to pattern precision [14]. - The demand for finer RDL line widths poses additional challenges for suppliers, necessitating breakthroughs in RDL layout technology [14]. Conclusion - The future of TSMC's packaging technologies appears promising, with ongoing innovations and adaptations to meet the evolving demands of the semiconductor industry [14].
传华为开发新AI芯片
半导体芯闻· 2025-04-28 10:15
如果您希望可以时常见面,欢迎标星收藏哦~ 来源:内容编译自日经 ,谢谢 。 点这里加关注,锁定更多原创内容 *免责声明:文章内容系作者个人观点,半导体芯闻转载仅为了传达一种不同的观点,不代表半导体芯闻对该 观点赞同或支持,如果有任何异议,欢迎联系我们。 华尔街日报周日报道,中国华为技术有限公司正准备测试其最新、最强大的人工智能处理器,希望 取代美国芯片巨头英伟达的一些高端产品。 报道称,知情人士透露,华为已与一些中国科技公司接洽,测试新芯片 Ascend 910D 的技术可行 性。 报道称,这家中国公司希望其最新版本的 Ascend AI 处理器能够比 Nvidia 的 H100 更强大,并 计划最早于 5 月底收到该处理器的首批样品。 路透社4月21日报道称,华为计划最早于下个月开始向中国客户大规模出货其先进的910C人工智能 芯片。 多年来,华为及其中国同行一直在努力与英伟达竞争高端芯片,以与这家美国公司在训练模型方面 的产品竞争。训练模型是将数据输入算法,帮助算法学习做出准确决策的过程。 为了限制中国的技术发展,特别是军事方面的进步,华盛顿切断了中国获得英伟达最先进的人工智 能产品的渠道,包括其旗舰产品 ...