Workflow
Rocky V2
icon
Search documents
AAI 2025 | Powering AI at Scale: OCI Superclusters with AMD
AMDยท 2025-07-15 16:01
AI Workload Challenges & Requirements - AI workloads differ from traditional cloud workloads due to the need for high throughput and low latency, especially in large language model training involving thousands of GPUs communicating with each other [2][3][4] - Network glitches like packet drops, congestion, or latency can slow down the entire training process, increasing training time and costs [3][5] - Networks must support small to large-sized clusters for both inference and training workloads, requiring high performance and reliability [8] - Networks should scale up within racks and scale out across data halls and data centers, while being autonomous and resilient with auto-recovery capabilities [9][10] - Networks need to support increasing East-West traffic, accommodating data transfer from various sources like on-premises data centers and other cloud locations, expected to scale 30% to 40% [10] OCI's Solution: Backend and Frontend Networks - OCI addresses AI workload requirements by implementing a two-part network architecture: a backend network for high-performance AI and a frontend network for data ingestion [11][12] - The backend network, designed for RDMA-intensive workloads, supports AI, HPC, Oracle databases, and recommendation engines [13] - The frontend network provides high-throughput and reliable connectivity within OCI and to external networks, facilitating data transfer from various locations [14] OCI's RDMA Network Performance & Technologies - OCI utilizes RDMA technology powered by RoCEv2, enabling high-performance, low-latency RDMA traffic on standard Ethernet hardware [18] - OCI's network supports multi-class RDMA workloads using Q-cure techniques in switches, accommodating different requirements for training, HPC, and databases on the same physical network [20] - Independent studies show OCI's RDMA network achieves near line-rate throughput (100 gig) with roundtrip delays under 10 microseconds for HPC workloads [23] - OCI testing demonstrates close to 96% of the line rate (400 gig throughput) with Mi300 clusters, showcasing efficient network utilization [25] Future Roadmap: Zeta-Scale Clusters with AMD - OCI is partnering with AMD to build a zeta-scale Mi300X cluster, powering over 131,000 GPUs, which is nearly triple the compute power and 50% higher memory bandwidth [26] - The Mi300X cluster will feature 288 gig HBM3 memory, enabling customers to train larger models and improve inferencing [26] - The new system will utilize AMD AI NICs, enabling innovative standards-based RoCE networking at peak performance [27]