有效训练时间比率(ETTR)
Search documents
豆包是如何炼成的?字节放出自研万卡训练系统ByteRobust论文
机器之心· 2025-10-21 09:32
Core Insights - The article discusses the challenges and advancements in training large language models (LLMs), particularly focusing on ByteDance's robust training infrastructure, ByteRobust, which aims to minimize training interruptions and enhance fault diagnosis and recovery efficiency [3][7][25]. Group 1: Training Infrastructure and Challenges - The core infrastructure for LLM training is GPUs, with training scales reaching tens of thousands of GPUs, leading to increased training times and frequent hardware failures [1][2]. - ByteDance's training of a 175B parameter model utilized 12,288 GPUs, while a 405B parameter model, LLaMA 3, required 16,384 NVIDIA H100 GPUs and took 54 days to pre-train [1]. - Faults such as CUDA errors and task hangs occur frequently, with Meta reporting hardware failures approximately every 2.78 hours during training on 16,000 GPUs [1][2]. Group 2: ByteRobust Overview - ByteRobust is designed to achieve high effective training time ratio (ETTR) by efficiently diagnosing and handling events during LLM training [7][25]. - The infrastructure consists of two main components: a control plane for event management and a data plane for monitoring and diagnostics [8][10]. Group 3: Control Plane and Data Plane Functions - The control plane coordinates robust event handling strategies, including anomaly detection and fault localization, while the data plane integrates monitoring, diagnostics, and checkpoint management [10][11]. - The Robust Controller in the control plane manages an automated fault mitigation framework, utilizing real-time monitoring for most events [10][12]. Group 4: Fault Tolerance Mechanisms - ByteRobust emphasizes rapid fault isolation over precise fault localization to minimize GPU idling during large-scale training [13][14]. - The automated fault tolerance framework includes real-time checks, in-depth diagnostics, and mechanisms for quick recovery from transient faults [19][20]. Group 5: Performance Metrics and Results - ByteRobust has been deployed for over a year, effectively reducing event detection time and resolving incidents through its automated framework [25]. - In a three-month period, ByteRobust identified 38,236 explicit faults and 5,948 implicit faults across 778,135 LLM training tasks [26]. - The system achieved a maximum ETTR of 97% during intensive model training using 9,600 GPUs, demonstrating significant improvements in recovery speed with warm standby and hot update mechanisms [28][35]. Group 6: Model Training Insights - ByteDance's experiments showed that the warm standby and hot update mechanisms improved recovery speeds by up to 10.87 times and 11.04 times, respectively [28]. - The effective checkpoint mechanism implemented in ByteRobust incurs less than 0.9% overhead, facilitating faster fault switching [31]. - The training of dense models and MoE models revealed that while dense models had higher performance optimizations, MoE training introduced additional complexities that could lead to increased manual restarts [38].