强化学习（RL） - filings, earnings calls, financial reports, news - Reportify

强化学习（RL）

Search documents

大模型刷数学题竟有害？CMU评估20+模型指出训练陷阱

量子位· 2025-07-07 06:13

Core Viewpoint - The article discusses the relationship between mathematical reasoning capabilities of large language models (LLMs) and their ability to transfer these skills to other tasks, highlighting that models trained with reinforcement learning (RL) show better transferability compared to those trained with supervised fine-tuning (SFT) [4][11]. Group 1: Mathematical Reasoning and Transferability - Research indicates that only models trained with RL can effectively transfer mathematical reasoning skills to other tasks, while SFT models show limited or no transfer [4][11]. - A Transferability Index (TI) is introduced to quantify the extent to which improvements in mathematical reasoning can be applied to other reasoning and non-reasoning tasks [8][9]. - If TI is greater than 0, it indicates a positive transfer effect to other tasks; if less than 0, it indicates negative transfer [9]. Group 2: Experimental Findings - The study evaluated over 20 models across various tasks, including mathematical reasoning, other reasoning tasks (like medical reasoning), and non-reasoning tasks (like common-sense dialogue) [7]. - Results show that models fine-tuned with RL consistently achieve higher transferability metrics across reasoning and non-reasoning tasks, while SFT models often experience negative transfer in non-reasoning tasks [11]. Group 3: Model Representation and Performance - PCA analysis reveals that RL fine-tuned models exhibit minimal shifts in representation space, indicating they retain previously learned knowledge while enhancing performance in specific domains [15]. - RL models demonstrate lower KL divergence in reasoning and non-reasoning tasks compared to SFT models, suggesting more stable and precise representation updates [16][18]. - The findings suggest that RL is crucial for achieving transferable reasoning capabilities in LLMs, marking another victory for reinforcement learning in this context [19].

强化学习（RL）

监督微调（SFT）

迁移能力指标（Transferability Index

Artificial Intelligence

强化学习（RL）

监督微调（SFT）

迁移能力指标（Transferability Index

Artificial Intelligence

图像目标导航的核心究竟是什么？

具身智能之心· 2025-07-04 12:07

Research Background and Core Issues - Image goal navigation requires two key capabilities: core navigation skills and direction information calculation based on visual observation and target image comparison [2] - The research focuses on whether this task can be efficiently solved through end-to-end training of complete agents using reinforcement learning (RL) [2] Core Research Content and Methods - The study explores various architectural designs and their impact on task performance, emphasizing implicit correspondence computation between images [3][4] - Key architectures discussed include Late Fusion, ChannelCat, SpaceToDepth + ChannelCat, and Cross-attention [4] Main Findings - Early patch-level fusion methods (like ChannelCat and Cross-attention) are more critical than late fusion methods (Late Fusion) for supporting implicit correspondence computation [8] - The performance of different architectures varies significantly under different simulator settings, particularly the "Sliding" setting [8][10] Performance Metrics - The success rate (SR) and success path length (SPL) metrics are used to evaluate the performance of various models [7] - For example, when Sliding=True, ChannelCat (ResNet9) achieved an SR of 83.6%, while Late Fusion only reached 13.8% [8] Transferability of Abilities - Some learned capabilities can transfer to more realistic environments, especially when including the weights of the perception module [10] - Training with Sliding=True and then fine-tuning in a Sliding=False environment improved SR from 31.7% to 38.5% [10] Relationship Between Navigation and Relative Pose Estimation - A correlation exists between navigation performance and relative pose estimation accuracy, indicating the importance of direction information extraction in image goal navigation [12] Conclusion - Architectural designs that support early local fusion (like Cross-attention and ChannelCat) are crucial for implicit correspondence computation [15] - The simulator's Sliding setting significantly affects performance, but transferring perception module weights can help retain some capabilities in real-world scenarios [15] - Navigation performance is related to relative pose estimation ability, confirming the core role of direction information extraction in image goal navigation [15]

图像目标导航

强化学习（RL）

相对位姿估计

交叉注意力（Cross-attention）

晚期融合（Late Fusion）

图像目标导航

强化学习（RL）

相对位姿估计

交叉注意力（Cross-attention）

晚期融合（Late Fusion）

ToMAP：赋予大模型「读心术」，打造更聪明的AI说服者

机器之心· 2025-06-24 14:07

Core Viewpoint - The article introduces ToMAP, a new persuasion model that integrates Theory of Mind (ToM) mechanisms to enhance the persuasive capabilities of AI, addressing the limitations of current large language models in understanding opponents' perspectives and adapting strategies accordingly [4][19]. Summary by Sections Introduction to Persuasion - Persuasion is a complex communication process that influences beliefs, attitudes, and behaviors, and serves as a test for advanced large language models [2]. Limitations of Current Models - Top-tier large models can generate coherent persuasive text but lack mental perception, which hinders their ability to effectively persuade [3][4]. ToMAP Model Overview - ToMAP introduces two key mental modules: the Refutation Predictor and the Attitude Predictor, enabling AI to anticipate opposing viewpoints and assess the opponent's attitude dynamically [9][19]. Refutation Predictor - The Refutation Predictor simulates human-like anticipation of counterarguments, allowing the model to address concerns proactively. It can identify common objections, such as "cooking is troublesome" or "the taste is bad" in discussions about vegetarian recipes [9][10]. Attitude Predictor - The Attitude Predictor evaluates the opponent's stance towards counterarguments, determining whether they are firmly against, neutral, or persuaded. This module uses dialogue history and arguments to dynamically assess the opponent's attitude [9][11]. Training Methodology - ToMAP employs reinforcement learning (RL) to train the model through numerous dialogues, rewarding it based on a "persuasiveness score" that measures attitude changes before and after interactions [11][19]. Experimental Results - The model was tested across various datasets, showing that ToMAP significantly outperforms baseline models and even larger models like GPT-4o, demonstrating its effectiveness despite having fewer parameters [14][20]. Performance Insights - ToMAP maintains a low level of repetition while increasing the diversity of outputs, indicating effective use of the mental modules. It also shows a higher depth of thought compared to baseline models, favoring rational strategies over emotional appeals [15][16]. Long-term Persuasiveness - Unlike baseline models that plateau or decline in effectiveness over extended dialogues, ToMAP continues to improve its persuasiveness, showcasing its adaptability and diverse argumentation [17][20]. Conclusion - ToMAP represents a significant advancement in AI persuasion frameworks, integrating social cognition features that allow for a more human-like understanding of opponents' cognitive structures and attitudes [20][21].

心智理论（ToM）

强化学习（RL）

反驳预测器

态度预测器

心智理论（ToM）

强化学习（RL）

反驳预测器

态度预测器

搜索智能体RAG落地不佳？UIUC开源s3，仅需2.4k样本，训练快效果好

机器之心· 2025-06-17 00:10

Core Insights - The article discusses the emergence of Agentic RAG (Retrieval-Augmented Generation) as a key method for large language models to access external knowledge, highlighting the limitations of current reinforcement learning (RL) training methods in achieving stable performance [1][8]. Group 1: Development of RAG Systems - The evolution of RAG systems is categorized into three stages: Classic RAG, Pre-RL-Zero Active RAG, and RL-Zero stage, with each stage introducing new methodologies to enhance retrieval and generation capabilities [7][8]. - The RL-based methods, while promising, face challenges such as misalignment of optimization goals with actual downstream tasks and the coupling of retrieval and generation processes, which complicates performance evaluation [9][12]. Group 2: Limitations of Current RL Methods - Current RL methods like Search-R1 and DeepRetrieval focus on Exact Match (EM) as a reward metric, which can lead to suboptimal training outcomes due to its strictness and insensitivity to semantic variations [9][10]. - The coupling of retrieval and generation in training can obscure the true performance improvements, making it difficult to discern whether gains are due to better search or enhanced language generation [11][12]. - Existing evaluation metrics fail to accurately measure the contribution of search quality to overall performance, leading to bottlenecks in assessment, training, and generalization [14]. Group 3: Introduction of s3 Framework - The s3 framework, proposed by UIUC and Amazon, aims to improve training efficiency and effectiveness by decoupling the search and generation processes, focusing solely on optimizing the searcher with a new reward function called Gain Beyond RAG (GBR) [1][17]. - s3 demonstrates significant efficiency, requiring only 2.4k training samples and achieving superior performance compared to larger baseline models, with a total training time of just 114 minutes [21][22][25]. Group 4: Experimental Results - In general QA tasks, s3 outperformed both Search-R1 and DeepRetrieval across multiple datasets, showcasing its strong generalization capabilities [23][25]. - In medical QA tasks, s3 exhibited remarkable cross-domain performance, indicating its robustness and adaptability to different datasets and contexts [26][27]. Group 5: Design and Optimization Insights - The design of s3 emphasizes the importance of starting retrieval from the original query, which helps maintain focus and improves search outcomes [31]. - The document selection mechanism within s3 significantly reduces token consumption, enhancing efficiency and minimizing noise in the generation process [31][30].

Retrieval-Augmented Generation (RAG)

强化学习（RL）

生成式人工智能

Retrieval-Augmented Generation (RAG)

强化学习（RL）

生成式人工智能

揭秘LLM“思考”之谜：推理即“梯度下降”，元学习框架解构训练过程，还给优化提供新思路

量子位· 2025-06-10 04:05

Core Insights - The article introduces the Reasoning as Meta-Learning (RaML) framework, which aims to reveal how large language models (LLMs) "think" by drawing parallels between reasoning and gradient descent optimization [1][2] - RaML posits that the reasoning trajectory generated by LLMs during problem-solving acts as a form of implicit parameter updates, leading to improved model performance [2][4] Group 1: RaML Framework and Mechanism - RaML's core insight is that the reasoning trajectory in LLMs resembles a "pseudo-gradient descent" process, where each reasoning step adjusts the model's internal state towards a better solution [2] - The framework decomposes the training process of LLMs into two levels: "inner-loop optimization" for specific tasks and "outer-loop optimization" for learning strategies across multiple tasks [8][9] - The study emphasizes that longer reasoning trajectories typically lead to better optimization outcomes, akin to more iterations in traditional optimization algorithms [14] Group 2: Empirical Validation and Performance - The QwQ-32B model's reasoning on the AIME24 dataset demonstrated that confidence in correct answers increases with the decoding of reasoning trajectories, supporting the idea of parameter updates through reasoning [3][4] - The comparison between supervised fine-tuning (SFT) and reinforcement learning (RL) models showed that SFT models outperform RL models in mathematical benchmarks, highlighting the benefits of guided learning [10][12] Group 3: Reflection Tokens and Optimization - The article discusses the role of "reflection" tokens in reasoning trajectories, which help the model reassess its outputs and improve performance by escaping local optima [15][17] - It contrasts "thinking" and "non-thinking" modes, indicating that forced early termination of reasoning can lead to suboptimal solutions, similar to premature stopping in gradient descent [18][20] Group 4: Generalization and Meta-Learning - The research indicates that LLMs trained on specific reasoning tasks can generalize to unseen tasks, leveraging learned universal features from various problems [21][23] - The RaML framework provides practical strategies for enhancing training performance by increasing the number of reasoning trajectories for each problem, akin to expanding the support set in meta-learning [25] Group 5: Future Directions and Efficiency - The article suggests exploring methods to extract shorter, equivalent optimization trajectories from longer reasoning paths to reduce decoding overhead while maintaining performance [27][30] - Initial experiments show that summarizing long reasoning trajectories can yield comparable results with significantly reduced computational costs, indicating a potential area for future research [30][31] Conclusion - The RaML framework offers a novel perspective on understanding LLM reasoning and training, revealing the intricate connections between reasoning, meta-learning, and gradient descent [32]

大语言模型（LLM）

元学习（Meta-Learning）

有监督微调（SFT）

强化学习（RL）

QwQ - 32B模型

大语言模型（LLM）

元学习（Meta-Learning）

有监督微调（SFT）

强化学习（RL）

QwQ - 32B模型

英伟达揭示RL Scaling魔力！训练步数翻倍=推理能力质变，小模型突破推理极限

机器之心· 2025-06-04 04:41

Core Insights - The article discusses the potential of Prolonged Reinforcement Learning (ProRL) in enhancing reasoning capabilities in language models, suggesting that it can lead to significant improvements in model performance rather than merely optimizing existing knowledge retrieval [1][15]. Group 1: ProRL Framework - ProRL framework significantly increases the training steps from hundreds to over 2000, unlocking the hidden potential of smaller models [3]. - The framework incorporates a diverse set of verifiable rewards from various domains, providing reliable supervision signals for RL training [5]. - The combination of GRPO and DAPO algorithms enhances training efficiency by avoiding policy update imbalances and filtering ineffective samples [7]. Group 2: Performance Improvements - The Nemotron-Research-Reasoning-Qwen-1.5B model demonstrates remarkable performance across various tasks, outperforming larger models in specific areas [9][10]. - ProRL leads to a 14.7% improvement in mathematical tasks, surpassing 7B models, and a 6.5% lead in code generation over DeepCoder-1.5B [12]. - In logical reasoning, accuracy improves by 54.8%, showcasing the model's enhanced capabilities [12][13]. Group 3: Creativity and Reasoning Expansion - ProRL enables models to solve problems that base models could not, achieving a pass@k of 100% in previously unsolvable tasks [13]. - The training process fosters creativity, allowing models to generate new problem-solving paths rather than relying on rote answers [6][14]. - The longer the training, the stronger the model's ability to deviate from pre-training data, resulting in richer and more creative reasoning strategies [14]. Group 4: Future Implications - The research indicates that ProRL could be the key to developing small language models with strong reasoning capabilities, low deployment costs, and high generalization abilities [16][17].

Nvidia(US:NVDA)

强化学习（RL）

小语言模型

ProRL（Prolonged Reinforcement Learning）框架

强化学习（RL）

小语言模型

ProRL（Prolonged Reinforcement Learning）框架

SFT在帮倒忙？新研究：直接进行强化学习，模型多模态推理上限更高

机器之心· 2025-06-01 03:30

Core Insights - The article discusses the limitations of the "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" paradigm in developing large vision-language models (LVLM), suggesting that SFT may hinder learning and lead to superficial reasoning paths, while RL promotes genuine multimodal reasoning [3][11][21]. Group 1: Research Findings - A study from the University of California, Santa Cruz, and the University of Texas at Dallas reveals that SFT can obstruct learning, often resulting in "pseudo-reasoning paths" that lack depth [3][11]. - The research team created the VLAA-Thinking dataset to systematically investigate the roles of SFT and RL in multimodal reasoning, highlighting the unique contributions of each method [4][8]. - The findings indicate that while SFT improves performance on standard tasks, it falls short in enhancing complex reasoning capabilities, leading to a 47% relative performance decline in a 7B model [11][13]. Group 2: Data and Methodology - The VLAA-Thinking dataset comprises 203,182 samples, with 126,413 for SFT and 25,195 for RL, designed to facilitate high-quality reasoning chains [5][6]. - The research employed a six-stage data processing workflow to effectively transfer reasoning capabilities from pure text models to LVLMs [6][8]. - A mixed reward function was innovatively designed within the GRPO framework to optimize RL in visual contexts, incorporating various reward types for different problem categories [8][19]. Group 3: Performance Analysis - The study found that SFT's imitative reasoning patterns can limit the exploration space during the RL phase, suggesting that direct learning from reward signals is more effective [15][26]. - Models trained solely with GRPO outperformed those that underwent SFT, with the VLAA-Thinker-Qwen2.5-VL-3B model ranking first in the Open LMM reasoning leaderboard for 4B models, achieving a 1.8% record improvement [15][31]. - The analysis revealed that response length and reward scores do not correlate significantly with performance, challenging previous assumptions about their relationship [24][26]. Group 4: Implications for Future Research - The findings suggest that SFT is currently incompatible with GRPO in the context of multimodal reasoning, potentially damaging the performance of both foundational and instruction-tuned LVLMs [21][22]. - The research emphasizes the need for high-quality instruction tuning to enhance model performance in RL settings, indicating that better instruction tuning leads to improved reasoning capabilities post-RL training [31].

监督微调（SFT）

强化学习（RL）

多模态推理

视觉 - 语言大模型（LVLM）

VLAA-Thinking数据集

VLAA-Thinker-Qwen2.5VL-3B模型

监督微调（SFT）

强化学习（RL）

多模态推理

视觉 - 语言大模型（LVLM）

VLAA-Thinking数据集

VLAA-Thinker-Qwen2.5VL-3B模型

LLM加RL遭质疑：故意用错奖励，数学基准也显著提升，AI圈炸了

机器之心· 2025-05-28 08:09

Core Insights - The article discusses a recent paper that challenges the effectiveness of reinforcement learning (RL) in training large language models (LLMs), particularly in the context of using false rewards to enhance performance [3][4][5]. Group 1: Findings on Reinforcement Learning - The study reveals that using false rewards, including random and incorrect rewards, can significantly improve the performance of the Qwen2.5-Math-7B model on the MATH-500 benchmark, with random rewards improving scores by 21% and incorrect rewards by 25% compared to a 28.8% improvement with true rewards [5][10]. - The research questions the traditional belief that high-quality supervision signals are essential for effective RL training, suggesting that even minimal or misleading signals can yield substantial improvements [7][19]. Group 2: Model-Specific Observations - The effectiveness of RL with false rewards appears to be model-dependent, as other models like Llama3 and OLMo2 did not show similar performance gains when subjected to false rewards [16][17]. - The Qwen model demonstrated a unique ability to leverage code generation for mathematical reasoning, achieving a code generation frequency of 65% prior to RL training, which increased to over 90% post-training [28][34]. Group 3: Implications for Future Research - The findings indicate that future RL research should explore the applicability of these methods across diverse model families, rather than relying solely on a single model's performance [25][49]. - Understanding the pre-existing reasoning patterns learned during pre-training is crucial for designing effective RL training strategies, as these patterns significantly influence downstream performance [50].

大语言模型（LLM）

强化学习（RL）

可验证奖励强化学习（RLVR）

Qwen2.5-Math-7B

Qwen2.5-Math-1.5B

大语言模型（LLM）

强化学习（RL）

可验证奖励强化学习（RLVR）

Qwen2.5-Math-7B

Qwen2.5-Math-1.5B

MiniMax开源首个视觉RL统一框架，闫俊杰领衔！推理感知两手抓，性能横扫MEGA-Bench

量子位· 2025-05-27 12:31

Core Insights - The article discusses the introduction of the V-Triune framework by MiniMax, which allows for unified learning of visual reasoning and perception tasks within a single reinforcement learning (RL) system [1][11] - The framework addresses the limitations of traditional RL methods that typically focus on either reasoning or perception tasks, enabling a more comprehensive approach to visual tasks [2][8] Framework and Model Development - V-Triune employs a three-layer component design and a dynamic Intersection over Union (IoU) reward mechanism to effectively balance multiple tasks [2][22] - The Orsta model series, developed based on V-Triune, ranges from 7 billion to 32 billion parameters and has shown significant performance improvements in the MEGA-Bench Core benchmark, with enhancements ranging from +2.1% to +14.1% [3][30] Technical Implementation - The framework allows for sample-level data formatting, enabling custom reward settings and verifiers for each sample, thus supporting dynamic routing and weight adjustments [13][14] - An asynchronous client-server architecture is utilized to decouple reward calculation from the main training loop, enhancing flexibility in task expansion and reward logic updates [15][18] Monitoring and Stability - The system includes a monitoring mechanism that tracks various metrics such as reward values, IoU, mean Average Precision (mAP), response length, and reflection rates to ensure learning stability [19][21] - Dynamic IoU rewards are introduced to alleviate cold start issues and guide models in improving localization accuracy through phased threshold adjustments [22][24] Performance Metrics - The Orsta models have been trained on a diverse dataset covering four types of reasoning tasks and four types of perception tasks, leading to significant improvements in performance metrics, particularly in perception tasks [30][31] - The article highlights the effectiveness and scalability of the unified approach, as evidenced by the substantial gains in mAP metrics during testing [30] Company Background - MiniMax, recognized as one of the "Six Little Giants" in AI, has been actively expanding its capabilities in the multimodal field, developing models that span language, audio, and video [32] - The company aims to innovate in multimodal architecture, focusing on a unified generative understanding model [35]

强化学习（RL）

多模态领域

Artificial Intelligence

强化学习（RL）

多模态领域

Artificial Intelligence

微软副总裁X上「开课」，连更关于RL的一切，LLM从业者必读

机器之心· 2025-05-26 01:28

Core Viewpoint - The article discusses the educational series on artificial intelligence initiated by Nando de Freitas, focusing on reinforcement learning (RL) and its applications in large language models (LLMs) [1][2]. Summary by Sections Introduction to AI Education - Nando de Freitas aims to educate readers on AI through a series of posts on X, starting with reinforcement learning and gradually covering diffusion and flow matching technologies [1][2]. Learning Types - The article highlights that there is no ultimate conclusion on unsupervised learning, supervised learning, and reinforcement learning [8][19]. - Supervised learning is described as basic imitation, requiring high-quality expert data for effective learning [9]. - Reinforcement learning focuses on selective imitation, allowing agents to learn from suboptimal experiences and improve their performance [10][11]. Distributed Reinforcement Learning Systems - Modern distributed RL systems consist of two main components: Actors and Learners, where Actors interact with the environment and collect data, while Learners update the policy network based on this data [23][24]. - The importance of measuring operational durations and communication bandwidth in such systems is emphasized [24][27]. Offline Reinforcement Learning - Offline RL has unique value in scenarios like post-training LLMs, where it can leverage historical data for learning [28][29]. Single-step and Multi-step RL - The article differentiates between single-step and multi-step RL problems, with single-step focusing on immediate actions and multi-step involving planning over a series of interactions [35][39]. - The complexity of multi-step RL is noted, particularly in credit assignment issues where multiple decisions affect outcomes [40][41]. Policy Gradient and Techniques - Policy gradient methods are discussed, including the use of baseline subtraction to reduce variance in reward signals [49][56]. - The article also covers the significance of KL divergence in maintaining proximity to supervised fine-tuning strategies during post-training [69]. Importance Sampling and PPO - Importance sampling is introduced as a method to correct off-policy sample bias, with Proximal Policy Optimization (PPO) being a key technique to manage policy updates [73][78]. - The integration of various techniques in training models like DeepSeek-R1 is highlighted, showcasing the complexity of modern RL systems [81]. Future Directions - Freitas plans to expand the discussion from single-step to multi-step RL, indicating ongoing developments in the field [82].

Microsoft(US:MSFT)

强化学习（RL）

无监督学习

近端策略优化（PPO）

强化学习（RL）

无监督学习

近端策略优化（PPO）