大语言模型训练 - filings, earnings calls, financial reports, news

大语言模型训练

Search documents

机器之心· 2025-09-28 04:50

一个月前，我们曾报道过清华姚班校友、普林斯顿教授陈丹琦似乎加入 Thinking Machines Lab 的消息。有些爆料认为她在休假一年后，会离开普林斯顿，全职加入 Thinking Machines Lab。最近，陈丹琦在普林斯顿大学的团队发布了最新学术成果，表明了 RLVR 范式在可验证领域之外依然有效，提出了基于模型奖励思维的强化学习（RLMT）方法，它将显式的思维链推理融入通用聊天模型之中。论文标题：Language Models that Think, Chat Better 论文链接：https://www.arxiv.org/overview/2509.20357v1 众所周知，大型语言模型传统上遵循一种多阶段训练范式：首先在大规模文本语料上进行预训练，然后通过监督微调来学习指令跟随，最后借助强化学习来对齐人类偏好。机器之心报道编辑：冷猫思考自身行为的后果，并在必要时进行修正 —— 这是人类智慧的核心特征之一。这种方法确实催生了功能强大的对话式 AI 系统，但仍存在一个关键局限：在数学、编程等领域通过可验证奖励的强化学习（RLVR）所获得的推理能力， ...

基于模型奖励的思维强化学习（RLMT）方法

基于模型奖励的思维强化学习（RLMT）方法

无损减少80%激活值内存，提升5倍训练序列长度，仅需两行代码

机器之心· 2025-06-23 07:44

Core Insights - The article discusses the StreamBP algorithm, which significantly reduces the memory required for training large language models (LLMs) by optimizing the backpropagation process [3][6][15]. Group 1: StreamBP Algorithm - StreamBP reduces the memory consumption of activation values to about 20% of that required by gradient checkpointing, allowing for longer sequence lengths during training [3][6]. - Under the same memory constraints, StreamBP can achieve a maximum sequence length that is 2.8 to 5.5 times greater than that of gradient checkpointing [6][22]. - The algorithm is applicable to common LLM objective functions such as SFT, GRPO, PPO, and DPO, and its code is open-sourced for integration into existing training frameworks [6][12]. Group 2: Memory and Performance Comparison - In terms of memory usage, StreamBP requires only 5% to 15% of the total activation memory for all layers, while a single layer's complete activation values account for over 85% of the memory [13][19]. - A comparison of memory and time costs between standard backpropagation and StreamBP shows that StreamBP significantly reduces peak memory usage while maintaining similar computational costs [14][25]. Group 3: Application in LLM Training - StreamBP is specifically designed to optimize memory usage in the Transformer layers and lmhead layers of LLMs, effectively lowering the memory consumption of layer activations and logits [16][20]. - The algorithm allows for larger batch sizes and faster training times by enabling longer sequence lengths, which is crucial for training efficiency [25][28].

Hu Xiu· 2025-06-05 03:14

《硅谷101》创始人泓君邀请了Energent.ai联合创始人Kimi Kong和HeyRevia创始人Shaun Wei，一起和两位前Google的技术专家聊聊Gemini模型登顶背后的底层逻辑。以下是这次对话内容的精选: 一、Gemini2.5崛起背后的底层逻辑泓君：谷歌此次发布的Gemini 2.5 Pro，在当前各项评测中的数据都是所有大模型中最好的，Kimi你可以分析一下它是如何做到的吗？从去年在大会前夜被OpenAI的4o模型"精准狙击"，到今年Gemini 2.5 Pro全面霸榜。短短一年时间， Gemini是如何完成从追赶者到领跑者的逆转？ Kimi：我已经离开DeepMind快一年时间了，也不太清楚我的前同事们在这一年中又做了哪些新的创新。但大语言模型训练根本的步骤是不变的，包括以下三点：Pre-training（预训练）、SFT（Supervised Fine-tuning，监督微调）和利用RLHF（基于人类反馈的强化学习）技术做的Alignment（对齐）。大概在去年的NeurIPS（神经信息处理系统大会）上，业内已经普遍承认，公开网络数据基本都已经抓完了，就像化石燃料已 ...

大语言模型训练

强化学习

自然语言处理

Artificial Intelligence

Artificial Intelligence

Gemini 2.5 Pro

Perplexity

大模型训练或无需“纯净数据”！北大团队新研究：随机噪声影响有限，新方法让模型更抗噪

量子位· 2025-02-27 09:37

Core Insights - The article challenges the traditional belief that language models require "clean data" for effective training, suggesting that exposure to noise and imperfect data can still lead to strong language capabilities [1][2]. Group 1: Research Findings - Researchers from Peking University conducted experiments by intentionally adding random noise to training data, revealing that models could tolerate up to 20% "garbage data" with minimal impact on performance, as the Next-token Prediction (NTP) loss increased by less than 1% [2][4]. - The study utilized the OpenWebText dataset and injected random noise ranging from 1% to 20%, demonstrating that even with significant noise, the model's predictive loss remained relatively stable [3][4]. - The findings indicate a complex relationship between noise and model performance, leading to the development of a new method called "Local Gradient Matching" (LGM) to enhance model robustness in noisy environments [2][10]. Group 2: Theoretical Analysis - The research posits that the presence of noise does not significantly alter the global minimum of the NTP loss, even when noise levels are high, due to the low probability of finding meaningful text within random noise [6][7]. - The study's assumptions can be extended to multilingual models, where different languages can be viewed as noise to each other, thus not adversely affecting the performance of individual languages [9]. Group 3: Practical Implications - Despite the minor changes in pre-training loss, downstream tasks showed a decline in accuracy, highlighting a "loss-performance decoupling" phenomenon where pre-training metrics do not fully capture model capabilities [10]. - The proposed LGM method enhances the model's resistance to noise by constraining the gradient differences between original and perturbed features, thereby maintaining decision consistency under noise [10][12]. - Experimental results across various natural language understanding and visual classification datasets confirmed that LGM significantly improves performance for models affected by noise [11][12]. Group 4: Future Directions - The research opens new perspectives on large-scale pre-training, suggesting that retaining some random noise can reduce data cleaning costs, particularly for resource-constrained teams [15]. - Future work will explore the dynamic relationship between noise types and model capacity, as well as the application of LGM in other modalities [14].

局部梯度匹配损失（LGM）

大语言模型训练

Artificial Intelligence

Artificial Intelligence

GPT - 2

Llama - 3

ViT - L