Workflow
大语言模型训练
icon
Search documents
36年卷积猜想被解决,华人唯一作者,AI或受益
机器之心· 2025-11-26 05:12
Core Viewpoint - The article discusses a significant mathematical breakthrough by Yuansi Chen, who solved the Talagrand convolution conjecture, a problem that has remained open for 36 years, with implications for modern computer science and machine learning [3][10]. Group 1: Background and Importance - The Talagrand convolution conjecture, proposed in 1989, is one of the most important open problems in probability theory and functional analysis, focusing on the regularization properties of the heat semigroup applied to L₁ functions on the Boolean hypercube [10]. - The conjecture predicts that applying a smoothing operator to any L₁ function will significantly improve tail decay, which is crucial for theoretical computer science, discrete mathematics, and statistical physics [10][21]. Group 2: Key Findings - Chen's proof shows that for any non-negative function f on the Boolean hypercube, the probability of the smoothed function exceeding a certain threshold decays at a rate better than the Markov inequality, specifically with a bound involving a log log factor [6][11]. - The result provides a positive answer to whether the tail probability disappears as η approaches infinity, marking a significant improvement over previous methods [13][21]. Group 3: Methodology - The core of Chen's method involves constructing a coupling between two Markov jump processes through a "perturbed reverse heat process," representing a major methodological advancement in discrete stochastic analysis [15][20]. - The proof combines several innovative techniques, including total variation control and a multi-stage Duhamel formula, to achieve dimension-free bounds [20][21]. Group 4: Implications for Future Research - The remaining log log η factor presents a clear target for future research, with potential improvements in coupling distance or alternative perturbation designs that could eliminate this factor [21][25]. - The work enhances the toolbox for handling high-dimensional discrete space probability distributions and connects to current AI trends, particularly in score-based generative models [23][24].
RLHF与RLVR全都要,陈丹琦团队最新力作将推理能力拓展到通用智能
机器之心· 2025-09-28 04:50
Core Insights - The article discusses the introduction of a new method called Reinforcement Learning with Model Thinking (RLMT), which integrates explicit reasoning into general chat models, enhancing their performance in open-ended tasks [5][7][26]. Summary by Sections Introduction - The article highlights the recent academic contributions of Chen Danqi from Princeton University, who has developed the RLMT method, which aims to bridge the gap between specialized reasoning capabilities and general conversational abilities in AI [2][5]. Methodology - RLMT combines aspects of Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) to optimize language models for open-ended tasks [6][11]. - The training process involves two approaches: supervised fine-tuning (SFT) to teach the desired reasoning format and a zero-training method that directly applies RLMT to base models without prior training [12][14]. Results - Models trained with RLMT demonstrated superior performance in open-ended reasoning tasks compared to non-thinking baseline models, particularly in chat and creative writing benchmarks [18][26]. - The article presents comparative performance data showing that RLMT models outperformed other models, including GPT-4o and Claude-3.7-Sonnet, in various chat benchmarks [19][20]. Conclusion - RLMT successfully extends the advantages of explicit reasoning from specialized domains to general conversational AI, indicating its potential to reshape language model training methodologies [26][29].
无损减少80%激活值内存,提升5倍训练序列长度,仅需两行代码
机器之心· 2025-06-23 07:44
Core Insights - The article discusses the StreamBP algorithm, which significantly reduces the memory required for training large language models (LLMs) by optimizing the backpropagation process [3][6][15]. Group 1: StreamBP Algorithm - StreamBP reduces the memory consumption of activation values to about 20% of that required by gradient checkpointing, allowing for longer sequence lengths during training [3][6]. - Under the same memory constraints, StreamBP can achieve a maximum sequence length that is 2.8 to 5.5 times greater than that of gradient checkpointing [6][22]. - The algorithm is applicable to common LLM objective functions such as SFT, GRPO, PPO, and DPO, and its code is open-sourced for integration into existing training frameworks [6][12]. Group 2: Memory and Performance Comparison - In terms of memory usage, StreamBP requires only 5% to 15% of the total activation memory for all layers, while a single layer's complete activation values account for over 85% of the memory [13][19]. - A comparison of memory and time costs between standard backpropagation and StreamBP shows that StreamBP significantly reduces peak memory usage while maintaining similar computational costs [14][25]. Group 3: Application in LLM Training - StreamBP is specifically designed to optimize memory usage in the Transformer layers and lmhead layers of LLMs, effectively lowering the memory consumption of layer activations and logits [16][20]. - The algorithm allows for larger batch sizes and faster training times by enabling longer sequence lengths, which is crucial for training efficiency [25][28].
Gemini2.5弯道超车背后的灵魂人物
Hu Xiu· 2025-06-05 03:14
Group 1: Core Insights on Gemini 2.5 - Gemini 2.5 Pro has achieved the best performance metrics among large models, showcasing a significant leap from being a follower to a leader in the AI model landscape [2][20] - The training process of Gemini 2.5 emphasizes three fundamental steps: Pre-training, Supervised Fine-tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF) for alignment [2][3] - The focus on reinforcement learning, particularly in tasks with clear objectives like mathematics and programming, has contributed to Gemini's impressive performance [3][4] Group 2: Competitive Landscape and Model Development - Google has accumulated substantial foundational training experience from previous versions of Gemini, which has been enhanced by a greater emphasis on reinforcement learning [3][4] - Other companies like Anthropic have prioritized coding capabilities in their models, leading to a notable quality difference in code generation compared to competitors [4][5] - The shift in focus from human preference outputs to programming capabilities has been a strategic move for Google, allowing it to catch up with competitors like OpenAI [10][11] Group 3: Key Personnel and Organizational Dynamics - Key figures in Google's AI development include Jeff Dean, Oriol Vinyals, and Noam Shazeer, who have significantly influenced the model's capabilities through their expertise in pre-training, reinforcement learning, and natural language processing [15][16] - The integration of Google and DeepMind's strengths has created a powerful synergy, enhancing the overall capabilities of the Gemini model [17] - Sergey Brin's return to Google has reinvigorated the company's culture, fostering a more ambitious and motivated environment among employees [20] Group 4: API Pricing Strategy - Gemini's API pricing is significantly lower than competitors, with token costs being approximately one-fifth to one-tenth of OpenAI's [21][22] - Google's long-term investment in TPU technology has allowed it to reduce dependency on external GPU suppliers, contributing to lower operational costs [22][23] - The ability to customize hardware and leverage extensive infrastructure resources has enabled Google to optimize model performance and pricing effectively [23][24]
大模型训练或无需“纯净数据”!北大团队新研究:随机噪声影响有限,新方法让模型更抗噪
量子位· 2025-02-27 09:37
Core Insights - The article challenges the traditional belief that language models require "clean data" for effective training, suggesting that exposure to noise and imperfect data can still lead to strong language capabilities [1][2]. Group 1: Research Findings - Researchers from Peking University conducted experiments by intentionally adding random noise to training data, revealing that models could tolerate up to 20% "garbage data" with minimal impact on performance, as the Next-token Prediction (NTP) loss increased by less than 1% [2][4]. - The study utilized the OpenWebText dataset and injected random noise ranging from 1% to 20%, demonstrating that even with significant noise, the model's predictive loss remained relatively stable [3][4]. - The findings indicate a complex relationship between noise and model performance, leading to the development of a new method called "Local Gradient Matching" (LGM) to enhance model robustness in noisy environments [2][10]. Group 2: Theoretical Analysis - The research posits that the presence of noise does not significantly alter the global minimum of the NTP loss, even when noise levels are high, due to the low probability of finding meaningful text within random noise [6][7]. - The study's assumptions can be extended to multilingual models, where different languages can be viewed as noise to each other, thus not adversely affecting the performance of individual languages [9]. Group 3: Practical Implications - Despite the minor changes in pre-training loss, downstream tasks showed a decline in accuracy, highlighting a "loss-performance decoupling" phenomenon where pre-training metrics do not fully capture model capabilities [10]. - The proposed LGM method enhances the model's resistance to noise by constraining the gradient differences between original and perturbed features, thereby maintaining decision consistency under noise [10][12]. - Experimental results across various natural language understanding and visual classification datasets confirmed that LGM significantly improves performance for models affected by noise [11][12]. Group 4: Future Directions - The research opens new perspectives on large-scale pre-training, suggesting that retaining some random noise can reduce data cleaning costs, particularly for resource-constrained teams [15]. - Future work will explore the dynamic relationship between noise types and model capacity, as well as the application of LGM in other modalities [14].