Workflow
大型语言模型训练
icon
Search documents
豆包是如何炼成的?字节放出自研万卡训练系统ByteRobust论文
机器之心· 2025-10-21 09:32
Core Insights - The article discusses the challenges and advancements in training large language models (LLMs), particularly focusing on ByteDance's robust training infrastructure, ByteRobust, which aims to minimize training interruptions and enhance fault diagnosis and recovery efficiency [3][7][25]. Group 1: Training Infrastructure and Challenges - The core infrastructure for LLM training is GPUs, with training scales reaching tens of thousands of GPUs, leading to increased training times and frequent hardware failures [1][2]. - ByteDance's training of a 175B parameter model utilized 12,288 GPUs, while a 405B parameter model, LLaMA 3, required 16,384 NVIDIA H100 GPUs and took 54 days to pre-train [1]. - Faults such as CUDA errors and task hangs occur frequently, with Meta reporting hardware failures approximately every 2.78 hours during training on 16,000 GPUs [1][2]. Group 2: ByteRobust Overview - ByteRobust is designed to achieve high effective training time ratio (ETTR) by efficiently diagnosing and handling events during LLM training [7][25]. - The infrastructure consists of two main components: a control plane for event management and a data plane for monitoring and diagnostics [8][10]. Group 3: Control Plane and Data Plane Functions - The control plane coordinates robust event handling strategies, including anomaly detection and fault localization, while the data plane integrates monitoring, diagnostics, and checkpoint management [10][11]. - The Robust Controller in the control plane manages an automated fault mitigation framework, utilizing real-time monitoring for most events [10][12]. Group 4: Fault Tolerance Mechanisms - ByteRobust emphasizes rapid fault isolation over precise fault localization to minimize GPU idling during large-scale training [13][14]. - The automated fault tolerance framework includes real-time checks, in-depth diagnostics, and mechanisms for quick recovery from transient faults [19][20]. Group 5: Performance Metrics and Results - ByteRobust has been deployed for over a year, effectively reducing event detection time and resolving incidents through its automated framework [25]. - In a three-month period, ByteRobust identified 38,236 explicit faults and 5,948 implicit faults across 778,135 LLM training tasks [26]. - The system achieved a maximum ETTR of 97% during intensive model training using 9,600 GPUs, demonstrating significant improvements in recovery speed with warm standby and hot update mechanisms [28][35]. Group 6: Model Training Insights - ByteDance's experiments showed that the warm standby and hot update mechanisms improved recovery speeds by up to 10.87 times and 11.04 times, respectively [28]. - The effective checkpoint mechanism implemented in ByteRobust incurs less than 0.9% overhead, facilitating faster fault switching [31]. - The training of dense models and MoE models revealed that while dense models had higher performance optimizations, MoE training introduced additional complexities that could lead to increased manual restarts [38].
比Adam更有效,POET从谱不变原理出发,让LLM训练又稳又快
机器之心· 2025-07-15 00:59
Core Viewpoint - The article discusses a novel training paradigm for large language models (LLMs) called POET (Reparameterized Training via Orthogonal Equivalence Transformation), which aims to enhance training efficiency and stability based on first principles [2][3]. Group 1: POET Methodology - POET introduces structural reparameterization of each neuron by incorporating two learnable orthogonal matrices and a fixed random weight matrix, maintaining the singular value distribution of weights during training [3][11]. - The method combines singular value invariance with minimal hyperspherical energy, providing a new paradigm that offers both physical interpretability and generalization capability for large model training [3][11]. - POET's training process is designed to stabilize the optimization process and significantly improve model generalization performance [3][11]. Group 2: Advantages of POET - POET maintains the spectral properties of the weight matrix throughout training, ensuring that the singular values remain consistent with the randomly initialized matrix [17]. - The method allows for efficient parameter control and avoids the issue of excessively large singular values that can occur in standard LLM training [17]. - Two new initialization strategies, normalized Gaussian initialization and uniform spectrum initialization, are proposed to ensure bounded singular values in the generated weight matrices [17]. Group 3: Training Dynamics and Performance - The article presents experimental results demonstrating POET's superior performance in training large language models, including improvements in perplexity and training efficiency compared to traditional methods like AdamW [20][24]. - POET's training process is divided into three phases: conical shell searching, stable learning on the conical shell, and final adjusting, which reflects the evolution of the orthogonal matrices during training [40][41]. - The use of a fully stochastic sampling approach in POET allows for a significant reduction in memory costs compared to traditional methods, enhancing scalability [26][27].