Workflow
NVIDIA HGX B200
icon
Search documents
探秘NVIDIA HGX B200集群,超多图
半导体行业观察· 2025-08-15 01:19
Core Insights - The article discusses the impressive scale and technology of the NVIDIA HGX B200 AI cluster, which consists of thousands of GPUs and is deployed by Lambda in collaboration with Supermicro and Cologix [2][4][13]. Group 1: Cluster Design and Technology - The cluster utilizes air cooling technology, which accelerates deployment speed and allows for quick availability of GPUs for rental by customers [4][8]. - Each Supermicro NVIDIA HGX B200 platform contains 32 GPUs per rack, with a total of 256 GPUs across eight racks [5][6]. - The design includes advanced cooling systems to manage the heat generated by the GPUs, ensuring efficient operation [25][59]. Group 2: Networking and Connectivity - The cluster features a robust networking infrastructure, including NVIDIA Bluefield-3 DPUs providing 400Gbps bandwidth and multiple 400Gbps NVIDIA NDR network cards [22][37]. - Each GPU server is equipped with numerous network connections, facilitating communication across the cluster and with external storage [37][45]. - The networking setup is designed for high-capacity data transfer, essential for AI workloads that require significant data movement [45][47]. Group 3: Power and Infrastructure - The Cologix data center has a power capacity of 36MW, with power distribution managed through advanced systems to ensure reliability [64][67]. - The cluster is supported by a combination of traditional computing resources and high-speed storage solutions, such as VAST Data, to meet the demands of AI applications [52][54]. - The infrastructure includes various components that are crucial for the operation of the AI cluster, highlighting the complexity of building such systems [83][87]. Group 4: Future Developments and Trends - The article notes that Lambda is expanding its capabilities by also incorporating liquid cooling systems in newer cluster designs, such as the NVIDIA GB200 NVL72 [88]. - The rapid evolution of AI cluster technology is emphasized, with a focus on the need for seamless integration of various components to optimize performance [90][92]. - The article concludes by reflecting on the scale of AI clusters and the intricate details that contribute to their functionality, indicating a trend towards more sophisticated and efficient designs in the industry [95][96].
CoreWeave Becomes First Hyperscaler to Deploy NVIDIA GB300 NVL72 Platform
Prnewswire· 2025-07-03 16:14
Core Viewpoint - CoreWeave is the first AI cloud provider to deploy NVIDIA's latest GB300 NVL72 systems, aiming for significant global scaling of these deployments [1][5] Performance Enhancements - The NVIDIA GB300 NVL72 offers a 10x boost in user responsiveness, a 5x improvement in throughput per watt compared to the previous NVIDIA Hopper architecture, and a 50x increase in output for reasoning model inference [2] Technological Collaboration - CoreWeave collaborated with Dell, Switch, and Vertiv to establish the initial deployment of the NVIDIA GB300 NVL72 systems, enhancing speed and efficiency for AI cloud services [3] Software Integration - The GB300 NVL72 deployment is integrated with CoreWeave's cloud-native software stack, including CoreWeave Kubernetes Service (CKS) and Slurm on Kubernetes (SUNK), along with hardware-level data integration through Weights & Biases' platform [4] Market Leadership - CoreWeave continues to lead in providing first-to-market access to advanced AI infrastructure, expanding its offerings with the new NVIDIA GB300 systems alongside its existing fleet [5] Benchmark Achievement - In June 2025, CoreWeave achieved a record in the MLPerf® Training v5.0 benchmark using nearly 2,500 NVIDIA GB200 Grace Blackwell Superchips, completing a complex model in just 27.3 minutes [6] Company Background - CoreWeave, recognized as one of the TIME100 most influential companies and featured in Forbes Cloud 100 ranking in 2024, has been operating data centers across the US and Europe since 2017 [7]
Micron Innovates From the Data Center to the Edge With NVIDIA
Globenewswire· 2025-03-18 20:23
Core Insights - Micron Technology, Inc. is the first and only memory company shipping both HBM3E and SOCAMM products for AI servers, reinforcing its leadership in low-power DDR for data center applications [1][2][3] Product Innovations - Micron's SOCAMM, developed in collaboration with NVIDIA, supports the NVIDIA GB300 Grace Blackwell Ultra Superchip, enhancing AI workload performance [2][4] - The HBM3E 12H 36GB offers 50% increased capacity and 20% lower power consumption compared to competitors' offerings, while the HBM3E 8H 24GB is also available for various NVIDIA platforms [6][15] - SOCAMM is described as the fastest, smallest, lowest-power, and highest capacity modular memory solution, designed for AI servers and data-intensive applications [5][10] Performance Metrics - SOCAMM provides over 2.5 times higher bandwidth at the same capacity compared to RDIMMs, allowing for faster access to larger datasets [10] - The HBM3E 12H 36GB provides significant power savings and improved computational capabilities for GPUs, essential for AI training and inference applications [4][6] Market Positioning - Micron aims to maintain its technology momentum with the upcoming HBM4 solution, expected to boost performance by over 50% compared to HBM3E [7] - The company showcases a complete AI memory and storage portfolio at GTC 2025, emphasizing collaboration with ecosystem partners to meet the growing demands of AI workloads [3][8] Storage Solutions - Micron's SSDs, including the 61.44TB 6550 ION NVMe SSD, are designed for high-performance AI data centers, delivering over 44 petabytes of storage per rack [11] - The integration of Micron LPDDR5X memory on platforms like NVIDIA DRIVE AGX Orin enhances processing performance while reducing power consumption [11]