可验证奖励强化学习（RLVR） - filings, earnings calls, financial reports, news

可验证奖励强化学习（RLVR）

Search documents

3 6 Ke· 2025-06-20 12:00

Core Insights - The pursuit of "superintelligence" represents a significant ambition among leading AI companies like Meta, OpenAI, and Google DeepMind, with substantial investments being made in this direction [1][3][4] - Sam Altman of OpenAI suggests that building superintelligence is primarily an engineering challenge, indicating a belief in a feasible path to achieve it [3][4] - Meta AI researcher Jack Morris argues that the current approach of using large language models (LLMs) and reinforcement learning (RL) may not be sufficient to construct superintelligence [1][2] Group 1: Current Approaches and Challenges - Morris outlines three potential methods for building superintelligence: purely supervised learning (SL), RL from human validators, and RL from automated validators [2] - The integration of non-text data into models is believed not to enhance overall performance, as human-written text carries intrinsic value that sensory inputs do not [2][6] - The concept of a "data wall" or "token crisis" is emerging, where the availability of text data for training LLMs is becoming a concern, leading to extensive efforts to scrape and transcribe data from various sources [8][19] Group 2: Learning Algorithms and Their Implications - The two primary learning methods identified for potential superintelligence are SL and RL, with SL being more stable and efficient for initial training [10][22] - The hypothesis that superintelligence could emerge from SL alone is challenged by the limitations of current models, which may not exhibit human-level general intelligence despite excelling in specific tasks [15][16] - The combination of SL and RL is proposed as a more viable path, leveraging human feedback or automated systems to refine model outputs [20][22][28] Group 3: Future Directions and Speculations - The potential for RL to effectively transfer learning across various tasks remains uncertain, raising questions about the scalability of this approach to achieve superintelligence [34] - The competitive landscape among AI companies is likely to intensify as they seek to develop the most effective training environments for LLMs, potentially leading to breakthroughs in superintelligence [34]

Meta Platforms(US:META)

LLM加RL遭质疑：故意用错奖励，数学基准也显著提升，AI圈炸了

机器之心· 2025-05-28 08:09

Core Insights - The article discusses a recent paper that challenges the effectiveness of reinforcement learning (RL) in training large language models (LLMs), particularly in the context of using false rewards to enhance performance [3][4][5]. Group 1: Findings on Reinforcement Learning - The study reveals that using false rewards, including random and incorrect rewards, can significantly improve the performance of the Qwen2.5-Math-7B model on the MATH-500 benchmark, with random rewards improving scores by 21% and incorrect rewards by 25% compared to a 28.8% improvement with true rewards [5][10]. - The research questions the traditional belief that high-quality supervision signals are essential for effective RL training, suggesting that even minimal or misleading signals can yield substantial improvements [7][19]. Group 2: Model-Specific Observations - The effectiveness of RL with false rewards appears to be model-dependent, as other models like Llama3 and OLMo2 did not show similar performance gains when subjected to false rewards [16][17]. - The Qwen model demonstrated a unique ability to leverage code generation for mathematical reasoning, achieving a code generation frequency of 65% prior to RL training, which increased to over 90% post-training [28][34]. Group 3: Implications for Future Research - The findings indicate that future RL research should explore the applicability of these methods across diverse model families, rather than relying solely on a single model's performance [25][49]. - Understanding the pre-existing reasoning patterns learned during pre-training is crucial for designing effective RL training strategies, as these patterns significantly influence downstream performance [50].