Mode Flops Utilization (MFU)

Search documents
H100 与 GB200 NVL72 训练基准对比 —— 功耗、总体拥有成本(TCO)及可靠性分析,软件随时间的改进 ——SemiAnalysis
2025-08-20 14:50
Summary of Conference Call Notes Company and Industry - The discussion primarily revolves around Nvidia's GPU products, specifically the H100 and GB200 NVL72 models, and their performance in machine learning training environments. Core Points and Arguments 1. **Benchmarking and Performance Analysis** - The report presents benchmark results from over 2,000 H100 GPUs, analyzing metrics such as mode fops utilization (MFU), total cost of ownership (TCO), and cost per training 1 million tokens [5][6][12] - The analysis includes energy consumption comparisons, framing power efficiency in a societal context by comparing GPU energy use to average U.S. household energy usage [5][6] 2. **Cost Analysis** - The price of an H100 server has decreased to approximately $10,000, with total upfront capital costs reaching around $250,000 for a typical hyperscaler [14] - The GB200 NVL72 server costs about $1.1 million per rack, with all-in costs reaching approximately $1.5 million per rack [15] - The all-in capital cost per GPU for the GB200 NVL72 is estimated to be 1.1x to 1.7x that of the H100 [15] 3. **Operational Costs** - The operational cost per GPU for the GB200 NVL72 is not significantly higher than that of the H100, but the GB200 consumes 1200W per chip compared to 700W for the H100, impacting overall operational expenses [17][18] - Total cluster operating costs per month per GPU are $249 for H100 and $359 for GB200 NVL72, indicating a higher cost for the latter [19] 4. **Reliability Issues** - Current reliability challenges with the GB200 NVL72 are noted, with no large-scale training runs completed yet due to ongoing software maturation [7][8] - Nvidia is expected to work closely with partners to address these reliability issues, which are critical for the ecosystem's success [8] 5. **Software Improvements** - Significant improvements in training throughput have been observed, with MFU increasing from 2.5% to 5% over 12 months, attributed to software optimizations [31][33] - The cost to train GPT-175B has decreased from $218,000 in January 2022 to $12,000 by December 2022, showcasing the impact of software enhancements on cost efficiency [34] 6. **Recommendations for Nvidia** - Suggestions include expanding benchmarking efforts and increasing transparency to aid decision-making in the ML community [22][24] - Nvidia should also broaden its benchmarking focus beyond NeMo-MegatronLM to include native PyTorch frameworks [25] - Accelerating the development of diagnostics and debugging tools for the GB200 NVL72 is recommended to improve reliability [25] Other Important Content - The report emphasizes the importance of effective training and the need for Nvidia to address reliability challenges to maintain competitiveness in the GPU market [6][8] - The analysis of power consumption indicates that training large models like GPT-175B requires significant energy, equivalent to the annual consumption of multiple U.S. households [35][48] - The discussion on scaling performance highlights the differences between strong and weak scaling in compute resources, which is crucial for optimizing training processes [39][40]