Workflow
强化学习
icon
Search documents
字节&MAP重塑大模型推理算法优化重点,强化学习重在高效探索助力LLM提升上限
量子位· 2025-08-07 10:13
Core Viewpoint - The article discusses the limitations of traditional reinforcement learning (RL) frameworks in large language models (LLMs), particularly the issue of premature convergence leading to a lack of exploration and diversity in generated outputs [1][2]. Group 1: Introduction to FR3E - The FR3E framework, inspired by the concept of "First Return, Then Explore," aims to address the exploration challenges in RL by balancing exploitation and exploration [2][4]. - This new structured exploration framework is developed by a collaborative team from ByteDance, MAP, and the University of Manchester [2][5]. Group 2: Algorithm Framework - The FR3E algorithm consists of two phases: First Return and Entropy-Eliciting Explore [10][14]. - In the First Return phase, the model performs multiple rollouts for each prompt, exploring potential solutions and collecting trajectories and reward signals [12]. - The Entropy-Eliciting Explore phase utilizes a dynamic advantage modulation mechanism to fine-tune learning signals based on the marginal improvement in value from one state to another [16][18]. Group 3: Data Construction - The team employs a mixed difficulty strategy for data construction, using low-difficulty data for stable training and high-difficulty data to challenge the model's reasoning capabilities [23]. Group 4: Experimental Results - The effectiveness of FR3E was evaluated across several authoritative mathematical reasoning benchmarks, including GSM8K, Math500, and others, using various model sizes [24]. - FR3E outperformed the strong baseline GRPO++ across multiple benchmarks, demonstrating superior generalization and reasoning capabilities [25][28]. - Notably, FR3E exhibited prolonged exploration behavior, with slower entropy decay and longer response lengths, successfully overcoming the "stagnation" issue seen in traditional methods [26][27]. Group 5: Conclusion - FR3E presents an innovative and efficient structured exploration paradigm that directly addresses the core bottleneck of insufficient exploration in LLMs [28]. - The method's principles of "structured feedback + adaptive adjustment" show promising scalability and potential for future RL training in large models [29].
强化学习+MCP=王炸?开源框架教AI在MCP中玩转工具解决任务,实测效果超越GPT!
量子位· 2025-08-07 10:13
Core Viewpoint - The article discusses the introduction of OpenPipe's new open-source reinforcement learning framework, MCP·RL, which allows agents to autonomously discover tools, generate tasks, and learn optimal strategies through closed-loop feedback without extensive manual configuration [2][14][23]. Group 1: MCP·RL Overview - MCP·RL enables agents to automatically connect to an MCP Server, discover available tools, and generate training tasks based on tool information [18]. - The framework achieves state-of-the-art (SOTA) performance in two-thirds of benchmark tests, demonstrating its effectiveness [4][21]. - Unlike traditional methods that require extensive setup, MCP·RL simplifies the process by allowing the model to learn from experience without the need for data annotation or custom MCP interfaces [23][24]. Group 2: Learning Process - The training process of MCP·RL consists of four steps: discovering tools, generating tasks, learning how to use tools, and testing the effectiveness of the strategies [18][19]. - The framework emphasizes a "learning by doing" approach, where agents learn through practical experience rather than predefined configurations [7][14]. - The transition from using MCP to having AI utilize MCP signifies a significant shift in how agents interact with tools [20]. Group 3: Practical Applications - MCP·RL is designed to be applicable to any server and is ready to use out of the box, making it versatile for various applications [23]. - The Agent Reinforcement Trainer (ART) component of MCP·RL allows for real-world training and evaluation of agent strategies, enhancing reliability [24][25]. - Previous tests with ART on the Qwen 2.5-14B model showed superior performance in email retrieval tasks, achieving SOTA results [26].
DeepSeek的GRPO会导致模型崩溃?看下Qwen3新范式GSPO
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the evolution of reinforcement learning techniques in the post-training phase of large language models (LLMs), highlighting the introduction of Group Sequence Policy Optimization (GSPO) as a solution to the instability issues associated with Group Relative Policy Optimization (GRPO) [2][10][31]. Group 1: Training Phases and Techniques - The training of large language models typically consists of two phases: pre-training and post-training, where the latter focuses on improving the model's understanding and execution of human instructions [1]. - The post-training phase employs reinforcement learning, with initial methods like Reinforcement Learning from Human Feedback (RLHF) being time-consuming and costly due to reliance on human annotators [2][3]. Group 2: Innovations and Comparisons - DeepSeek introduced an automated approach to RLHF, significantly reducing costs and improving efficiency by allowing the model to learn through reward signals rather than manual evaluations [2]. - The DeepSeek team proposed the Group Relative Policy Optimization (GRPO) algorithm, which they believe is more effective than the Proximal Policy Optimization (PPO) used by OpenAI in ChatGPT [3][5]. Group 3: Issues with GRPO - The Qwen team identified serious stability issues with GRPO, particularly due to its reliance on token-level importance sampling, which can lead to high variance and training instability [10][11][12]. - The instability arises from the incorrect application of importance sampling weights at the token level, which can accumulate high variance in long sequences, exacerbating the training challenges [15][16][17]. Group 4: Introduction of GSPO - To address the issues with GRPO, the Qwen team proposed the Group Sequence Policy Optimization (GSPO), which utilizes sequence-level importance sampling to enhance training stability [10][22][31]. - GSPO's design mitigates the accumulation of variance seen in token-level sampling, leading to improved training efficiency and stability [23][24]. Group 5: Experimental Evidence and Advantages - Experimental results demonstrated that GSPO outperformed GRPO in various tasks, showcasing better scalability and efficiency in training [20][30]. - The Qwen team highlighted that GSPO simplifies the training of Mixture-of-Experts (MoE) models by eliminating the need for auxiliary strategies like Routing Replay, which were necessary for GRPO to achieve stable convergence [25][27][30].
具身智能之心技术交流群成立了!
具身智能之心· 2025-08-07 02:38
Group 1 - The establishment of the Embodied Intelligence Heart Technology Exchange Group focuses on various advanced technologies including VLA, VLN, remote operation, Diffusion Policy, reinforcement learning, VLA+RL, sim2real, multimodal large models, simulation, motion control, target navigation, mapping and localization, and navigation [1] - Interested individuals can add the assistant's WeChat AIDriver005 to join the community [2] - To expedite the joining process, it is recommended to include a note with the institution/school, name, and research direction [3]
成功率提高57%,VLA+RL最新!CO-RFT:实现VLA模型的高效微调(北航&清华等)
具身智能之心· 2025-08-07 00:03
Core Insights - The article discusses the development of a new reinforcement learning framework called Chunked RL, specifically designed for fine-tuning Vision-Language-Action (VLA) models, which show great potential in real-world robotic control [4][8]. - The proposed CO-RFT algorithm demonstrates significant improvements over traditional supervised fine-tuning methods, achieving a 57% increase in success rate and a 22.3% reduction in cycle time in real-world environments [4][29]. Section Summaries Introduction - VLA models integrate perception and language understanding for embodied control, showing promise in developing general strategies for real-world robotic control [6]. - The challenges faced in fine-tuning VLA models primarily stem from the dependency on the quality and quantity of task-specific data, which limits generalization to out-of-distribution (OOD) scenarios [6][7]. Methodology - The article introduces Chunked RL, a novel reinforcement learning framework that incorporates action chunking to enhance sample efficiency and stability, particularly suited for VLA models [8][12]. - The CO-RFT algorithm consists of two phases: imitation learning for initializing the backbone network and policy, followed by offline RL with action chunking to optimize the pre-trained policy [16][18]. Experimental Analysis - The experiments were conducted on a robotic platform with six dexterous manipulation tasks, evaluating the performance of the CO-RFT algorithm against traditional methods [20][23]. - Results indicate that CO-RFT significantly outperforms supervised fine-tuning (SFT), achieving a 57% increase in success rate and a 22.3% decrease in average cycle time across various tasks [29][30]. Position Generalization - CO-RFT exhibits strong position generalization capabilities, achieving a 44.3% success rate in previously unseen locations, outperforming SFT by 38% in OOD scenarios [4][29]. Importance of Data Diversity - Data diversity plays a crucial role in the performance of CO-RFT, with models trained on diverse datasets showing significantly better generalization capabilities compared to those trained on fixed datasets [32][33].
具身智能之心招募科研辅导老师了!学术圈的大佬看过来~
具身智能之心· 2025-08-06 08:30
具身智能之心招募科研辅导老师了!如果您是具身智能方向,手里握有多篇顶会、顶刊,欢迎和我们一起带动 学术界的发展。 方向一览 行业资源共享,享有论文署名与现金激励!详细请咨询小助理微信oooops-life了解更多。 要求说明 博士及以上学历(包含在读),2篇A会或一区以上期刊/会议,有辅导经验的优先。 待遇说明 包括但不限于:VLA、VLN、遥操作、Diffusion Policy、强化学习、VLA+RL、sim2real、多模态大模型、仿 真、运动控制、目标导航等方向。 ...
大模型下一个飞跃?OpenAI的“新突破”:通用验证器
硬AI· 2025-08-05 16:02
Core Viewpoint - The introduction of the "Universal Validator" technology in GPT-5 is seen as a potential "secret weapon" for OpenAI to gain a competitive edge in the AI market [2][3]. Group 1: Technology Overview - The "Universal Validator" employs a "prover-verifier game" mechanism, where one AI model acts as a verifier to assess the answers generated by another prover model, enhancing output quality through internal competition [3][4]. - This technology aims to address the challenges of verifying answers in subjective fields like creative writing and complex mathematical proofs, which have been difficult for reinforcement learning methods [3][6]. - The framework includes roles such as a reliable prover, a deceptive prover, and a small verifier, which work together to improve the model's ability to distinguish between correct and incorrect solutions [6][7]. Group 2: Historical Context - The technology is considered a legacy of OpenAI's former "Super Alignment" team, which was focused on controlling future superintelligent AI, although the team was disbanded after key members left [10]. - Despite the team's dissolution, the technology has been integrated into OpenAI's core product development, addressing alignment and reliability issues in current models [10]. Group 3: Market Implications - The advancements brought by the "Universal Validator" are directly linked to the anticipated performance of GPT-5, with expectations heightened by statements from OpenAI's CEO regarding the model's superior capabilities [11]. - Competitors like xAI and Google are also investing heavily in reinforcement learning, making the "Universal Validator" a crucial asset for OpenAI to maintain its lead in the intensifying AI race [11]. Group 4: Challenges and Opportunities - The "Universal Validator" is noted for its versatility, improving model performance in both easily verifiable tasks and more subjective areas, indicating a shift in AI capabilities [14]. - However, the development of GPT-5 faces significant challenges, including a scarcity of high-quality training data and diminishing returns from large-scale pre-training, which could impact the model's expected breakthroughs [14].
OpenAI的“新突破”:通用验证器
Hu Xiu· 2025-08-05 07:04
Core Insights - OpenAI's "Universal Validator" technology is expected to enhance the market competitiveness of the upcoming GPT-5 model, addressing key challenges in AI commercialization, particularly in terms of reliability and credibility [2][12]. Group 1: Technology Overview - The "Universal Validator" operates through a "prover-verifier game," where one AI model acts as a verifier to assess the outputs of another model, systematically improving output quality through internal feedback [2][4]. - This technology is designed to overcome limitations in reinforcement learning (RL) in subjective areas like creative writing and complex mathematical proofs [2][13]. - The mechanism is likened to Generative Adversarial Networks (GANs), where a discriminator helps distinguish between real and AI-generated data, pushing the generator to improve [5]. Group 2: Development and Team Dynamics - The technology is considered a legacy of OpenAI's former "Super Alignment" team, which was focused on controlling future superintelligence but was disbanded after key members left [9][10]. - Despite the dissolution of the team, the technological advancements have been integrated into OpenAI's core product development, addressing alignment and reliability issues [11]. Group 3: Market Expectations and Competitive Landscape - There is heightened anticipation for GPT-5, with indications that a self-critique system trialed in GPT-4 has been officially incorporated into GPT-5, raising expectations for its performance [12]. - OpenAI's CEO, Sam Altman, has publicly endorsed GPT-5, claiming it surpasses previous models in intelligence, intensifying market interest [12]. - Competitors like xAI and Google are also investing heavily in reinforcement learning as a key technology path, making the competitive landscape increasingly intense [12]. Group 4: Challenges Ahead - The "Universal Validator" is noted for its versatility, aiding OpenAI models in both easily verifiable tasks and more subjective domains, indicating a shift in AI capabilities [13]. - However, the development of GPT-5 faces significant challenges, including a scarcity of high-quality training data and diminishing returns from large-scale pre-training [13]. - Performance degradation from internal testing to public deployment remains a concern, as evidenced by the drop in performance of the "o3" model in real-world applications [13].
清华叉院教授手把手教你写强化学习
机器之心· 2025-08-05 04:09
Core Insights - The article discusses AReaL-lite, a reinforcement learning training framework designed for algorithm developers, allowing users to modify a single file to implement various RL training algorithms and custom agent workflows, while achieving optimal model performance through Fully Async RL [1][10]. Group 1: Event Details - The sharing session will feature Professor Wu Yi from Tsinghua University's Interdisciplinary Information Institute and core members of the AReaL team, using a multi-turn math reasoning example to teach RL [2][10]. - The live session is scheduled for August 7, 19:30-20:30 Beijing time, and participants are encouraged to prepare a GPU server, preferably with 4 cards [8][10]. Group 2: AReaL-lite Features - AReaL-lite's key characteristics include: - Fully async RL for rapid training [10]. - Ecosystem-friendly, compatible with various open-source ecosystems [10]. - Algorithm-first approach, ensuring minimal file modifications for complex algorithms [10]. Group 3: Team Introduction - The team includes: - Wu Yi, Assistant Professor at Tsinghua University and Chief Scientist of the AReaL team [10]. - Fu Wei, a PhD student at Tsinghua University and core member of the AReaL project [10]. - Mei Zhiyu, a researcher at Ant Group's reinforcement learning lab and a PhD from Tsinghua University [10].
奥特曼:ChatGPT只是意外,全能AI智能体才是真爱,Karpathy:7年前就想到了
3 6 Ke· 2025-08-04 09:37
Core Insights - The article highlights the evolution of OpenAI's MathGen team, which has been pivotal in enhancing AI's mathematical reasoning capabilities, leading to significant advancements in AI agents [2][6][9] - OpenAI's CEO, Altman, emphasizes the transformative potential of AI agents, which are designed to autonomously complete tasks assigned by users, marking a strategic shift in AI development [11][28] - The competition for top talent in AI has intensified, with major companies like Meta aggressively recruiting from OpenAI, indicating a fierce race in the AI sector [13][15][36] Group 1: Development of AI Capabilities - The MathGen team, initially overlooked, is now recognized as a key contributor to OpenAI's success in the AI industry, particularly in mathematical reasoning [2][4] - OpenAI's recent breakthroughs in AI reasoning have led to its model winning a gold medal at the International Mathematical Olympiad (IMO), showcasing its advanced capabilities [6][20] - The integration of reinforcement learning and innovative techniques has significantly improved AI's problem-solving abilities, allowing it to tackle complex tasks more effectively [17][21][25] Group 2: Strategic Vision and Market Position - OpenAI's long-term vision is to create a general AI agent capable of performing a wide range of tasks, which is seen as the culmination of years of strategic planning [8][9][11] - The upcoming release of the GPT-5 model is expected to further solidify OpenAI's leadership in the AI agent space, with ambitions to create an intuitive assistant that understands user intent [35][39] - The competitive landscape is becoming increasingly crowded, with various companies vying for dominance in AI technology, raising questions about OpenAI's ability to maintain its edge [36][38]