Google集群拆解
HTSC·2025-11-27 08:52

Report Industry Investment Rating No relevant content provided. Core Viewpoints The report delves into the in - depth analysis of Google clusters, including their Scale - up (3D structure and optical interconnection) and Scale - out aspects, and also compares the architectures of different GPUs such as NVIDIA and AMD [1][2]. Summary by Directory 1. Google Cluster's Scale up: 3D Structure - TPU Architecture: The Ironwood architecture of TPU has high - performance computing components like TensorCore, XLU, VPU, etc., and is connected by high - speed ICI. It uses HBM3 and HBM3E memory to achieve scale - up of 9216 chips [11][12]. - From TPU to TPU Rack: A TPU Tray contains 4 Ironwood TPUs, and a TPU Rack consists of 16 TPU Trays and 64 TPU chips. The rack has a specific physical structure and cooling system [28][29]. - Comparison with Other GPUs: Compares the architectures of NVIDIA (from Hopper to Blackwell) and AMD (from MI350 to MI400) GPUs, highlighting their different interconnect technologies and performance parameters [20][25]. 2. Google Cluster's Scale up Optical Interconnection: Optical Path Switch - Optical Switch Components: The optical path switch uses components such as 850nm camera modules, dichroic beam splitters, fiber collimators, and 2D MEMS micromirrors to separate or combine calibration light and signal light [46]. - TPU SuperPod Structure: A TPU SuperPod consists of 64 Google racks, divided into 8 groups of 8 racks. It integrates 4096 chips, sharing 256TiB of HBM memory, with a total computing performance of over 1 ExaFLOP. Each group of 8 racks has a CDU for liquid - cooling [60]. 3. TPU Cluster, Proportion of Optical Path Switches and Optical Modules - TPU V4: The proportion of optical path switches is 1.1% with 4096 TPUs, and the proportion of optical modules is 1.5 [70][84]. - TPU V7: The proportion of optical path switches is 0.52% with 9216 TPUs, and the proportion of optical modules is also 1.5 [75][89]. - Rack - level Data: For a single rack, there are 6 * 16 external optical modules, 4 * 16 PCB traces, and 80 copper cables [94]. 4. Google Cluster's Scale out - Switch Parameters: The Tomahawk 5 switch has 128 400G ports [103]. - Communication Outside TPU SuperPod: Communication outside the TPU SuperPod is carried out through the Data - center Network (DCN), which includes optical circuit switches and physical fibers [106][108]. - NV Scale - out OCS: In the NV scale - out, OCS is used in a redundant spine - leaf network structure, which can enhance the resilience of the network [113][114]. - Comparison of Interconnection Schemes in a 100,000 - card Cluster: Compares the InfiniBand, NVIDIA Spectrum - X, and Broadcom Tomahawk5 interconnection schemes in terms of switch quantity, optical module quantity, cost, etc. [125].