Workflow
大语言模型训练
icon
Search documents
36年卷积猜想被解决,华人唯一作者,AI或受益
机器之心· 2025-11-26 05:12
Core Viewpoint - The article discusses a significant mathematical breakthrough by Yuansi Chen, who solved the Talagrand convolution conjecture, a problem that has remained open for 36 years, with implications for modern computer science and machine learning [3][10]. Group 1: Background and Importance - The Talagrand convolution conjecture, proposed in 1989, is one of the most important open problems in probability theory and functional analysis, focusing on the regularization properties of the heat semigroup applied to L₁ functions on the Boolean hypercube [10]. - The conjecture predicts that applying a smoothing operator to any L₁ function will significantly improve tail decay, which is crucial for theoretical computer science, discrete mathematics, and statistical physics [10][21]. Group 2: Key Findings - Chen's proof shows that for any non-negative function f on the Boolean hypercube, the probability of the smoothed function exceeding a certain threshold decays at a rate better than the Markov inequality, specifically with a bound involving a log log factor [6][11]. - The result provides a positive answer to whether the tail probability disappears as η approaches infinity, marking a significant improvement over previous methods [13][21]. Group 3: Methodology - The core of Chen's method involves constructing a coupling between two Markov jump processes through a "perturbed reverse heat process," representing a major methodological advancement in discrete stochastic analysis [15][20]. - The proof combines several innovative techniques, including total variation control and a multi-stage Duhamel formula, to achieve dimension-free bounds [20][21]. Group 4: Implications for Future Research - The remaining log log η factor presents a clear target for future research, with potential improvements in coupling distance or alternative perturbation designs that could eliminate this factor [21][25]. - The work enhances the toolbox for handling high-dimensional discrete space probability distributions and connects to current AI trends, particularly in score-based generative models [23][24].
RLHF与RLVR全都要,陈丹琦团队最新力作将推理能力拓展到通用智能
机器之心· 2025-09-28 04:50
Core Insights - The article discusses the introduction of a new method called Reinforcement Learning with Model Thinking (RLMT), which integrates explicit reasoning into general chat models, enhancing their performance in open-ended tasks [5][7][26]. Summary by Sections Introduction - The article highlights the recent academic contributions of Chen Danqi from Princeton University, who has developed the RLMT method, which aims to bridge the gap between specialized reasoning capabilities and general conversational abilities in AI [2][5]. Methodology - RLMT combines aspects of Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) to optimize language models for open-ended tasks [6][11]. - The training process involves two approaches: supervised fine-tuning (SFT) to teach the desired reasoning format and a zero-training method that directly applies RLMT to base models without prior training [12][14]. Results - Models trained with RLMT demonstrated superior performance in open-ended reasoning tasks compared to non-thinking baseline models, particularly in chat and creative writing benchmarks [18][26]. - The article presents comparative performance data showing that RLMT models outperformed other models, including GPT-4o and Claude-3.7-Sonnet, in various chat benchmarks [19][20]. Conclusion - RLMT successfully extends the advantages of explicit reasoning from specialized domains to general conversational AI, indicating its potential to reshape language model training methodologies [26][29].
无损减少80%激活值内存,提升5倍训练序列长度,仅需两行代码
机器之心· 2025-06-23 07:44
Core Insights - The article discusses the StreamBP algorithm, which significantly reduces the memory required for training large language models (LLMs) by optimizing the backpropagation process [3][6][15]. Group 1: StreamBP Algorithm - StreamBP reduces the memory consumption of activation values to about 20% of that required by gradient checkpointing, allowing for longer sequence lengths during training [3][6]. - Under the same memory constraints, StreamBP can achieve a maximum sequence length that is 2.8 to 5.5 times greater than that of gradient checkpointing [6][22]. - The algorithm is applicable to common LLM objective functions such as SFT, GRPO, PPO, and DPO, and its code is open-sourced for integration into existing training frameworks [6][12]. Group 2: Memory and Performance Comparison - In terms of memory usage, StreamBP requires only 5% to 15% of the total activation memory for all layers, while a single layer's complete activation values account for over 85% of the memory [13][19]. - A comparison of memory and time costs between standard backpropagation and StreamBP shows that StreamBP significantly reduces peak memory usage while maintaining similar computational costs [14][25]. Group 3: Application in LLM Training - StreamBP is specifically designed to optimize memory usage in the Transformer layers and lmhead layers of LLMs, effectively lowering the memory consumption of layer activations and logits [16][20]. - The algorithm allows for larger batch sizes and faster training times by enabling longer sequence lengths, which is crucial for training efficiency [25][28].
Gemini2.5弯道超车背后的灵魂人物
Hu Xiu· 2025-06-05 03:14
《硅谷101》创始人泓君邀请了Energent.ai联合创始人Kimi Kong和HeyRevia创始人Shaun Wei,一起和两 位前Google的技术专家聊聊Gemini模型登顶背后的底层逻辑。 以下是这次对话内容的精选: 一、Gemini2.5崛起背后的底层逻辑 泓君:谷歌此次发布的Gemini 2.5 Pro,在当前各项评测中的数据都是所有大模型中最好的,Kimi你可 以分析一下它是如何做到的吗? 从去年在大会前夜被OpenAI的4o模型"精准狙击",到今年Gemini 2.5 Pro全面霸榜。短短一年时间, Gemini是如何完成从追赶者到领跑者的逆转? Kimi:我已经离开DeepMind快一年时间了,也不太清楚我的前同事们在这一年中又做了哪些新的创 新。但大语言模型训练根本的步骤是不变的,包括以下三点:Pre-training(预训练)、SFT(Supervised Fine-tuning,监督微调)和利用RLHF(基于人类反馈的强化学习)技术做的Alignment(对齐)。 大概在去年的NeurIPS(神经信息处理系统大会)上,业内已经普遍承认,公开网络数据基本都已经抓 完了,就像化石燃料已 ...
大模型训练或无需“纯净数据”!北大团队新研究:随机噪声影响有限,新方法让模型更抗噪
量子位· 2025-02-27 09:37
Core Insights - The article challenges the traditional belief that language models require "clean data" for effective training, suggesting that exposure to noise and imperfect data can still lead to strong language capabilities [1][2]. Group 1: Research Findings - Researchers from Peking University conducted experiments by intentionally adding random noise to training data, revealing that models could tolerate up to 20% "garbage data" with minimal impact on performance, as the Next-token Prediction (NTP) loss increased by less than 1% [2][4]. - The study utilized the OpenWebText dataset and injected random noise ranging from 1% to 20%, demonstrating that even with significant noise, the model's predictive loss remained relatively stable [3][4]. - The findings indicate a complex relationship between noise and model performance, leading to the development of a new method called "Local Gradient Matching" (LGM) to enhance model robustness in noisy environments [2][10]. Group 2: Theoretical Analysis - The research posits that the presence of noise does not significantly alter the global minimum of the NTP loss, even when noise levels are high, due to the low probability of finding meaningful text within random noise [6][7]. - The study's assumptions can be extended to multilingual models, where different languages can be viewed as noise to each other, thus not adversely affecting the performance of individual languages [9]. Group 3: Practical Implications - Despite the minor changes in pre-training loss, downstream tasks showed a decline in accuracy, highlighting a "loss-performance decoupling" phenomenon where pre-training metrics do not fully capture model capabilities [10]. - The proposed LGM method enhances the model's resistance to noise by constraining the gradient differences between original and perturbed features, thereby maintaining decision consistency under noise [10][12]. - Experimental results across various natural language understanding and visual classification datasets confirmed that LGM significantly improves performance for models affected by noise [11][12]. Group 4: Future Directions - The research opens new perspectives on large-scale pre-training, suggesting that retaining some random noise can reduce data cleaning costs, particularly for resource-constrained teams [15]. - Future work will explore the dynamic relationship between noise types and model capacity, as well as the application of LGM in other modalities [14].