Workflow
总拥有成本(TCO)
icon
Search documents
GB200出货量上修,但NVL72目前尚未大规模训练
傅里叶的猫· 2025-08-20 11:32
Core Viewpoint - The article discusses the performance and cost comparison between NVIDIA's H100 and GB200 NVL72 GPUs, highlighting the potential advantages and challenges of the GB200 NVL72 in AI training environments [30][37]. Group 1: Market Predictions and Performance - After the ODM performance announcement, institutions raised the forecast for GB200/300 rack shipments in 2025 from 30,000 to 34,000, with expected shipments of 11,600 in Q3 and 15,700 in Q4 [3]. - Foxconn anticipates a 300% quarter-over-quarter increase in AI rack shipments, projecting a total of 19,500 units for the year, capturing approximately 57% of the market [3]. - By 2026, even with stable production of NVIDIA chips, downstream assemblers could potentially assemble over 60,000 racks due to an estimated 2 million Blackwell chips carried over [3]. Group 2: Cost Analysis - The total capital expenditure (Capex) for H100 servers is approximately $250,866, while for GB200 NVL72, it is around $3,916,824, making GB200 NVL72 about 1.6 to 1.7 times more expensive per GPU [12][13]. - The operational expenditure (Opex) for GB200 NVL72 is slightly higher than H100, primarily due to higher power consumption (1200W vs. 700W) [14][15]. - The total cost of ownership (TCO) for GB200 NVL72 is about 1.6 times that of H100, necessitating at least a 1.6 times performance advantage for GB200 NVL72 to be attractive for AI training [15][30]. Group 3: Reliability and Software Improvements - As of May 2025, GB200 NVL72 has not yet been widely adopted for large-scale training due to software maturity and reliability issues, with H100 and Google TPU remaining the mainstream options [11]. - The reliability of GB200 NVL72 is a significant concern, with early operators facing numerous XID 149 errors, which complicates diagnostics and maintenance [34][36]. - Software optimizations, particularly in the CUDA stack, are expected to enhance GB200 NVL72's performance significantly, but reliability remains a bottleneck [37]. Group 4: Future Outlook - By July 2025, GB200 NVL72's performance/TCO is projected to reach 1.5 times that of H100, with further improvements expected to make it a more favorable option [30][32]. - The GB200 NVL72's architecture allows for faster operations in certain scenarios, such as MoE (Mixture of Experts) models, which could enhance its competitive edge in the market [33].
SemiAnalysis--为什么除了CSP,几乎没人用AMD的GPU?
傅里叶的猫· 2025-05-23 15:46
Core Viewpoint - The article provides a comprehensive analysis comparing the inference performance, total cost of ownership (TCO), and market dynamics of NVIDIA and AMD GPUs, highlighting why AMD products are less utilized outside of large-scale cloud service providers [1][2]. Testing Background and Objectives - The research team conducted a six-month analysis to validate claims that AMD's AI servers outperform NVIDIA in TCO and inference performance, revealing complex results across different workloads [2][5]. Performance Comparison - For customers using vLLM/SGLang, the performance cost ratio (perf/$) of single-node H200 deployments is sometimes superior, while MI325X can outperform depending on workload and latency requirements [5]. - In most scenarios, MI300X lacks competitiveness against H200, but it outperforms H100 for specific models like Llama3 405B and DeepSeekv3 670B [5]. - For short-term GPU rentals, NVIDIA consistently offers better cost performance due to a larger number of rental providers, while AMD's offerings are limited, leading to higher prices [5][26]. Total Cost of Ownership (TCO) Analysis - AMD's MI300X and MI325X GPUs generally have lower hourly costs compared to NVIDIA's H100 and H200, with MI300X costing $1.34 per hour and MI325X costing $1.53 per hour [21]. - The capital cost constitutes a significant portion of the total cost, with MI300X having a capital cost share of 70.5% [21]. Market Dynamics - AMD's market share in the AI GPU sector has been growing steadily, but it is expected to decline in early 2025 due to NVIDIA's Blackwell series launch, while AMD's response products will not be available until later [7]. - The rental market for AMD GPUs is constrained, with few providers, leading to artificially high prices and reduced competitiveness compared to NVIDIA [26][30]. Benchmark Testing Methodology - The benchmark testing focused on real-world inference workloads, measuring throughput and latency under various user loads, which differs from traditional offline benchmarks [10][11]. - The testing included a variety of input/output token lengths to assess performance across different inference scenarios [11][12]. Benchmark Results - In tests with Llama3 70B FP16, MI325X and MI300X outperformed all other GPUs in low-latency scenarios, while H200 showed superior performance in high-concurrency situations [15][16]. - For Llama3 405B FP8, MI325X consistently demonstrated better performance than H100 and H200 in various latency conditions, particularly in high-latency scenarios [17][24]. Conclusion on AMD's Market Position - The article concludes that AMD needs to lower rental prices to compete effectively with NVIDIA in the GPU rental market, as current pricing structures hinder its competitiveness [26][30].