大型语言模型训练 - filings, earnings calls, financial reports, news

大型语言模型训练

Search documents

Qi Cha Cha· 2026-02-13 09:53

专利摘要显示，本发明实施例中，通过获取多个初始采样数据，初始采样数据包括图像、图像的辅助文本信息以及图像的标准审核结果；根据每个初始采用数据生成思维链数据，并确定思维链数据集合；根据思维链数据集合对基础大型语言模型进行全量微调，确定中间大型语言模型；根据中间大型语言模型以及多个所述初始采样数据，迭代生成多个中间思维链数据；进而根据预先设置的奖励函数，确定各所述中间思维链数据的奖励数值；最后采用组相对策略优化算法GRPO对所述中间大型语言模型进行强化学习，确定目标大型语言模型。通过上述方法，可以提高大型语言模型的可解释性和审核精度。企查查APP显示，近日，阿里巴巴（中国）有限公司申请公布"一种基于思维链训练大型语言模型的方法、装置和设备"专利。（原标题：阿里巴巴申请公布大型语言模型训练相关专利） ...

豆包是如何炼成的？字节放出自研万卡训练系统ByteRobust论文

机器之心· 2025-10-21 09:32

Core Insights - The article discusses the challenges and advancements in training large language models (LLMs), particularly focusing on ByteDance's robust training infrastructure, ByteRobust, which aims to minimize training interruptions and enhance fault diagnosis and recovery efficiency [3][7][25]. Group 1: Training Infrastructure and Challenges - The core infrastructure for LLM training is GPUs, with training scales reaching tens of thousands of GPUs, leading to increased training times and frequent hardware failures [1][2]. - ByteDance's training of a 175B parameter model utilized 12,288 GPUs, while a 405B parameter model, LLaMA 3, required 16,384 NVIDIA H100 GPUs and took 54 days to pre-train [1]. - Faults such as CUDA errors and task hangs occur frequently, with Meta reporting hardware failures approximately every 2.78 hours during training on 16,000 GPUs [1][2]. Group 2: ByteRobust Overview - ByteRobust is designed to achieve high effective training time ratio (ETTR) by efficiently diagnosing and handling events during LLM training [7][25]. - The infrastructure consists of two main components: a control plane for event management and a data plane for monitoring and diagnostics [8][10]. Group 3: Control Plane and Data Plane Functions - The control plane coordinates robust event handling strategies, including anomaly detection and fault localization, while the data plane integrates monitoring, diagnostics, and checkpoint management [10][11]. - The Robust Controller in the control plane manages an automated fault mitigation framework, utilizing real-time monitoring for most events [10][12]. Group 4: Fault Tolerance Mechanisms - ByteRobust emphasizes rapid fault isolation over precise fault localization to minimize GPU idling during large-scale training [13][14]. - The automated fault tolerance framework includes real-time checks, in-depth diagnostics, and mechanisms for quick recovery from transient faults [19][20]. Group 5: Performance Metrics and Results - ByteRobust has been deployed for over a year, effectively reducing event detection time and resolving incidents through its automated framework [25]. - In a three-month period, ByteRobust identified 38,236 explicit faults and 5,948 implicit faults across 778,135 LLM training tasks [26]. - The system achieved a maximum ETTR of 97% during intensive model training using 9,600 GPUs, demonstrating significant improvements in recovery speed with warm standby and hot update mechanisms [28][35]. Group 6: Model Training Insights - ByteDance's experiments showed that the warm standby and hot update mechanisms improved recovery speeds by up to 10.87 times and 11.04 times, respectively [28]. - The effective checkpoint mechanism implemented in ByteRobust incurs less than 0.9% overhead, facilitating faster fault switching [31]. - The training of dense models and MoE models revealed that while dense models had higher performance optimizations, MoE training introduced additional complexities that could lead to increased manual restarts [38].

比Adam更有效，POET从谱不变原理出发，让LLM训练又稳又快

机器之心· 2025-07-15 00:59

Core Viewpoint - The article discusses a novel training paradigm for large language models (LLMs) called POET (Reparameterized Training via Orthogonal Equivalence Transformation), which aims to enhance training efficiency and stability based on first principles [2][3]. Group 1: POET Methodology - POET introduces structural reparameterization of each neuron by incorporating two learnable orthogonal matrices and a fixed random weight matrix, maintaining the singular value distribution of weights during training [3][11]. - The method combines singular value invariance with minimal hyperspherical energy, providing a new paradigm that offers both physical interpretability and generalization capability for large model training [3][11]. - POET's training process is designed to stabilize the optimization process and significantly improve model generalization performance [3][11]. Group 2: Advantages of POET - POET maintains the spectral properties of the weight matrix throughout training, ensuring that the singular values remain consistent with the randomly initialized matrix [17]. - The method allows for efficient parameter control and avoids the issue of excessively large singular values that can occur in standard LLM training [17]. - Two new initialization strategies, normalized Gaussian initialization and uniform spectrum initialization, are proposed to ensure bounded singular values in the generated weight matrices [17]. Group 3: Training Dynamics and Performance - The article presents experimental results demonstrating POET's superior performance in training large language models, including improvements in perplexity and training efficiency compared to traditional methods like AdamW [20][24]. - POET's training process is divided into three phases: conical shell searching, stable learning on the conical shell, and final adjusting, which reflects the evolution of the orthogonal matrices during training [40][41]. - The use of a fully stochastic sampling approach in POET allows for a significant reduction in memory costs compared to traditional methods, enhancing scalability [26][27].