Workflow
芯片间互连(ICI)
icon
Search documents
Google集群拆解
HTSC· 2025-11-27 08:52
Report Industry Investment Rating No relevant content provided. Core Viewpoints The report delves into the in - depth analysis of Google clusters, including their Scale - up (3D structure and optical interconnection) and Scale - out aspects, and also compares the architectures of different GPUs such as NVIDIA and AMD [1][2]. Summary by Directory 1. Google Cluster's Scale up: 3D Structure - **TPU Architecture**: The Ironwood architecture of TPU has high - performance computing components like TensorCore, XLU, VPU, etc., and is connected by high - speed ICI. It uses HBM3 and HBM3E memory to achieve scale - up of 9216 chips [11][12]. - **From TPU to TPU Rack**: A TPU Tray contains 4 Ironwood TPUs, and a TPU Rack consists of 16 TPU Trays and 64 TPU chips. The rack has a specific physical structure and cooling system [28][29]. - **Comparison with Other GPUs**: Compares the architectures of NVIDIA (from Hopper to Blackwell) and AMD (from MI350 to MI400) GPUs, highlighting their different interconnect technologies and performance parameters [20][25]. 2. Google Cluster's Scale up Optical Interconnection: Optical Path Switch - **Optical Switch Components**: The optical path switch uses components such as 850nm camera modules, dichroic beam splitters, fiber collimators, and 2D MEMS micromirrors to separate or combine calibration light and signal light [46]. - **TPU SuperPod Structure**: A TPU SuperPod consists of 64 Google racks, divided into 8 groups of 8 racks. It integrates 4096 chips, sharing 256TiB of HBM memory, with a total computing performance of over 1 ExaFLOP. Each group of 8 racks has a CDU for liquid - cooling [60]. 3. TPU Cluster, Proportion of Optical Path Switches and Optical Modules - **TPU V4**: The proportion of optical path switches is 1.1% with 4096 TPUs, and the proportion of optical modules is 1.5 [70][84]. - **TPU V7**: The proportion of optical path switches is 0.52% with 9216 TPUs, and the proportion of optical modules is also 1.5 [75][89]. - **Rack - level Data**: For a single rack, there are 6 * 16 external optical modules, 4 * 16 PCB traces, and 80 copper cables [94]. 4. Google Cluster's Scale out - **Switch Parameters**: The Tomahawk 5 switch has 128 400G ports [103]. - **Communication Outside TPU SuperPod**: Communication outside the TPU SuperPod is carried out through the Data - center Network (DCN), which includes optical circuit switches and physical fibers [106][108]. - **NV Scale - out OCS**: In the NV scale - out, OCS is used in a redundant spine - leaf network structure, which can enhance the resilience of the network [113][114]. - **Comparison of Interconnection Schemes in a 100,000 - card Cluster**: Compares the InfiniBand, NVIDIA Spectrum - X, and Broadcom Tomahawk5 interconnection schemes in terms of switch quantity, optical module quantity, cost, etc. [125].