万字解读AMD的CDNA 4 架构

Core Viewpoint - AMD's CDNA 4 architecture represents a moderate update over CDNA 3, focusing on enhancing matrix multiplication performance for low-precision data types, which are crucial for machine learning workloads [2][26]. Architecture Overview - CDNA 4 maintains a similar system-level architecture to CDNA 3, utilizing a large chiplet setup with eight compute dies (XCD) and a memory-side cache of 256 MB [4][20]. - The architecture employs AMD's Infinity Fabric technology for consistent memory access across multiple chips [4]. Performance Comparison - The MI355X GPU, based on CDNA 4, features a clock speed of 2.4 GHz and 256 cores, compared to MI300X's 304 cores at 2.1 GHz, indicating a slight reduction in core count but improved clock speed [5]. - MI355X offers 288 GB of HBM3E memory with a bandwidth of 8 TB/s, surpassing Nvidia's B200, which has a maximum capacity of 180 GB and bandwidth of 7.7 TB/s [25]. Matrix and Vector Throughput - CDNA 4 has rebalanced execution units to focus on low-precision matrix multiplication, doubling matrix throughput per compute unit (CU) in many cases [6][39]. - The architecture supports new low-precision data formats, significantly enhancing AI performance, with matrix core improvements leading to nearly four times the computational throughput for low-precision formats [46][47]. Local Data Sharing (LDS) Enhancements - CDNA 4 increases the Local Data Share (LDS) capacity to 160 KB and doubles the read bandwidth to 256 bytes per clock, improving data locality for matrix multiplication routines [14][48]. - The architecture introduces new instructions for reading transposed LDS, optimizing memory access patterns for matrix operations [18]. Memory Hierarchy and Cache - The memory hierarchy includes a shared 4 MB L2 cache and a 32 KB L1 vector cache per CU, with enhancements for caching non-coherent data from DRAM [49][50]. - The Infinity Cache remains at 256 MB, providing high bandwidth and supporting the increased memory demands of modern AI workloads [53]. Chiplet Architecture - The CDNA 4 architecture continues to leverage a chiplet-based design, allowing for independent evolution of each chiplet for improved performance and manufacturability [35][36]. - Each XCD contains 36 compute units, organized into arrays, with a focus on maximizing yield and operational frequency [39]. System Communication and Expansion - The architecture includes eight AMD Infinity Fabric links, with improved speeds of up to 38.4 Gbps, enhancing communication bandwidth within server nodes [63]. - The design supports both direct compatibility with previous generations and progressive improvements for high-performance systems [62]. Conclusion - AMD's CDNA 4 architecture builds on the success of CDNA 3, focusing on optimizing performance for machine learning workloads while maintaining a competitive edge against Nvidia [26][27].