大语言模型训练

Search documents
无损减少80%激活值内存,提升5倍训练序列长度,仅需两行代码
机器之心· 2025-06-23 07:44
Core Insights - The article discusses the StreamBP algorithm, which significantly reduces the memory required for training large language models (LLMs) by optimizing the backpropagation process [3][6][15]. Group 1: StreamBP Algorithm - StreamBP reduces the memory consumption of activation values to about 20% of that required by gradient checkpointing, allowing for longer sequence lengths during training [3][6]. - Under the same memory constraints, StreamBP can achieve a maximum sequence length that is 2.8 to 5.5 times greater than that of gradient checkpointing [6][22]. - The algorithm is applicable to common LLM objective functions such as SFT, GRPO, PPO, and DPO, and its code is open-sourced for integration into existing training frameworks [6][12]. Group 2: Memory and Performance Comparison - In terms of memory usage, StreamBP requires only 5% to 15% of the total activation memory for all layers, while a single layer's complete activation values account for over 85% of the memory [13][19]. - A comparison of memory and time costs between standard backpropagation and StreamBP shows that StreamBP significantly reduces peak memory usage while maintaining similar computational costs [14][25]. Group 3: Application in LLM Training - StreamBP is specifically designed to optimize memory usage in the Transformer layers and lmhead layers of LLMs, effectively lowering the memory consumption of layer activations and logits [16][20]. - The algorithm allows for larger batch sizes and faster training times by enabling longer sequence lengths, which is crucial for training efficiency [25][28].
Gemini2.5弯道超车背后的灵魂人物
Hu Xiu· 2025-06-05 03:14
《硅谷101》创始人泓君邀请了Energent.ai联合创始人Kimi Kong和HeyRevia创始人Shaun Wei,一起和两 位前Google的技术专家聊聊Gemini模型登顶背后的底层逻辑。 以下是这次对话内容的精选: 一、Gemini2.5崛起背后的底层逻辑 泓君:谷歌此次发布的Gemini 2.5 Pro,在当前各项评测中的数据都是所有大模型中最好的,Kimi你可 以分析一下它是如何做到的吗? 从去年在大会前夜被OpenAI的4o模型"精准狙击",到今年Gemini 2.5 Pro全面霸榜。短短一年时间, Gemini是如何完成从追赶者到领跑者的逆转? Kimi:我已经离开DeepMind快一年时间了,也不太清楚我的前同事们在这一年中又做了哪些新的创 新。但大语言模型训练根本的步骤是不变的,包括以下三点:Pre-training(预训练)、SFT(Supervised Fine-tuning,监督微调)和利用RLHF(基于人类反馈的强化学习)技术做的Alignment(对齐)。 大概在去年的NeurIPS(神经信息处理系统大会)上,业内已经普遍承认,公开网络数据基本都已经抓 完了,就像化石燃料已 ...
大模型训练或无需“纯净数据”!北大团队新研究:随机噪声影响有限,新方法让模型更抗噪
量子位· 2025-02-27 09:37
Core Insights - The article challenges the traditional belief that language models require "clean data" for effective training, suggesting that exposure to noise and imperfect data can still lead to strong language capabilities [1][2]. Group 1: Research Findings - Researchers from Peking University conducted experiments by intentionally adding random noise to training data, revealing that models could tolerate up to 20% "garbage data" with minimal impact on performance, as the Next-token Prediction (NTP) loss increased by less than 1% [2][4]. - The study utilized the OpenWebText dataset and injected random noise ranging from 1% to 20%, demonstrating that even with significant noise, the model's predictive loss remained relatively stable [3][4]. - The findings indicate a complex relationship between noise and model performance, leading to the development of a new method called "Local Gradient Matching" (LGM) to enhance model robustness in noisy environments [2][10]. Group 2: Theoretical Analysis - The research posits that the presence of noise does not significantly alter the global minimum of the NTP loss, even when noise levels are high, due to the low probability of finding meaningful text within random noise [6][7]. - The study's assumptions can be extended to multilingual models, where different languages can be viewed as noise to each other, thus not adversely affecting the performance of individual languages [9]. Group 3: Practical Implications - Despite the minor changes in pre-training loss, downstream tasks showed a decline in accuracy, highlighting a "loss-performance decoupling" phenomenon where pre-training metrics do not fully capture model capabilities [10]. - The proposed LGM method enhances the model's resistance to noise by constraining the gradient differences between original and perturbed features, thereby maintaining decision consistency under noise [10][12]. - Experimental results across various natural language understanding and visual classification datasets confirmed that LGM significantly improves performance for models affected by noise [11][12]. Group 4: Future Directions - The research opens new perspectives on large-scale pre-training, suggesting that retaining some random noise can reduce data cleaning costs, particularly for resource-constrained teams [15]. - Future work will explore the dynamic relationship between noise types and model capacity, as well as the application of LGM in other modalities [14].