强化学习
Search documents
RLHF与RLVR全都要,陈丹琦团队最新力作将推理能力拓展到通用智能
机器之心· 2025-09-28 04:50
Core Insights - The article discusses the introduction of a new method called Reinforcement Learning with Model Thinking (RLMT), which integrates explicit reasoning into general chat models, enhancing their performance in open-ended tasks [5][7][26]. Summary by Sections Introduction - The article highlights the recent academic contributions of Chen Danqi from Princeton University, who has developed the RLMT method, which aims to bridge the gap between specialized reasoning capabilities and general conversational abilities in AI [2][5]. Methodology - RLMT combines aspects of Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) to optimize language models for open-ended tasks [6][11]. - The training process involves two approaches: supervised fine-tuning (SFT) to teach the desired reasoning format and a zero-training method that directly applies RLMT to base models without prior training [12][14]. Results - Models trained with RLMT demonstrated superior performance in open-ended reasoning tasks compared to non-thinking baseline models, particularly in chat and creative writing benchmarks [18][26]. - The article presents comparative performance data showing that RLMT models outperformed other models, including GPT-4o and Claude-3.7-Sonnet, in various chat benchmarks [19][20]. Conclusion - RLMT successfully extends the advantages of explicit reasoning from specialized domains to general conversational AI, indicating its potential to reshape language model training methodologies [26][29].
为什么自动驾驶中的强化学习,没有很好的落地?
自动驾驶之心· 2025-09-28 03:50
Core Viewpoint - The article discusses the challenges of implementing reinforcement learning (RL) in the field of autonomous driving, particularly focusing on the issue of reward hacking and the balance between safety and efficiency [2][3]. Group 1: Challenges in Reinforcement Learning for Autonomous Driving - Reinforcement learning faces a significant issue known as reward hacking, where increasing safety requirements can lead to decreased efficiency, and vice versa [2]. - Designing a balanced reward system that can enhance overall performance in RL models is complex, as achieving equilibrium among multiple rewards is challenging [2]. - The application of RL in autonomous driving is complicated by the need to adhere to various driving rules during the driving process, unlike in embodied intelligence where the focus is primarily on local motion [2]. Group 2: Need for a Suitable Framework - A crucial factor for the successful implementation of RL in autonomous driving is the development of a robust architecture that can effectively integrate with RL [3]. - Existing models in autonomous driving are unlikely to be directly applicable to RL without significant modifications [3]. Group 3: Community and Resources - The "Autonomous Driving Knowledge Planet" community aims to provide a comprehensive platform for technical exchange and learning in the field of autonomous driving, with over 4,000 members [6][10]. - The community offers a variety of resources, including learning routes, technical discussions, and access to industry experts, to assist both beginners and advanced practitioners in the field [6][10].
NeurIPS 2025 | SURDS 数据集与 GRPO 全面强化自驾空间推理
自动驾驶之心· 2025-09-27 23:33
以下文章来源于深蓝AI ,作者深蓝学院 深蓝AI . 专注于人工智能、机器人与自动驾驶的学习平台。 作者 | 深蓝学院 来源 | 深蓝AI 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 摘 要 在大模型飞速发展的当下,让多模态大语言模型(VLM)在自动驾驶场景图像中做出准确的空间推理,依然是人工智能领域的一大挑战。学术界一直缺 乏针对自动驾驶场推理的大规模基准,现有方法往往依赖外部专家模型,难以全面衡量模型能力。 与此形成鲜明对比的是,人类可以凭借已有知识轻松判断图像中物体的朝向,或推理多个物体的相对位置。而VLM同样具备丰富的知识,却仍在此类任务上 表现不足。 为此,武汉大学联合中科院自动化所,北京智源人工智能研究院 (BAAI)等多家单位推出 首个面向驾驶场景的VLM空间推理大规模基准 SURDS ,系统评测了 包括 GPT 系列在内的通用模型及 SpatialRGPT 等空间推理模型,全面揭示了当前VLM在空间理解方面的短板。研究团队通过设计"感知准确性"和" ...
缺数据也能拿SOTA?清华&上海AI Lab破解机器人RL两大瓶颈
具身智能之心· 2025-09-27 01:33
Core Insights - The article discusses the development of SimpleVLA-RL, a new framework designed to enhance the training and generalization capabilities of Visual-Language-Action (VLA) models in robotics, addressing key limitations in existing training paradigms [4][14]. Group 1: Key Contributions of SimpleVLA-RL - SimpleVLA-RL effectively addresses three major bottlenecks in VLA model training: high data collection costs, insufficient generalization ability, and the need for large-scale demonstration data [6][11]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks such as LIBERO and RoboTwin, achieving significant improvements in success rates even with limited data [6][21]. - In scenarios with single demonstration data, the average success rate of OpenVLA-OFT in LIBERO increased from 48.9% to 96.9%, and for long-sequence tasks, it improved from 17.3% to 91.7% [6][21]. Group 2: Training Mechanism and Innovations - The training mechanism includes interactive trajectory sampling, result reward modeling, and exploration enhancement, which collectively improve data efficiency and model performance [15][16][17]. - The result reward model simplifies the reward structure to binary outcomes (success or failure), allowing for better focus on training objectives and avoiding the complexities of process rewards [16][21]. - The exploration enhancement strategy encourages diverse exploration during training, preventing the model from converging to narrow solutions [17][19]. Group 3: Performance Metrics and Benchmark Results - SimpleVLA-RL achieved an average success rate of 99.1% in the LIBERO benchmark, with specific improvements in long-sequence tasks, where success rates increased by 12.0 percentage points [23]. - In RoboTwin1.0, the average success rate improved from 39.8% to 70.4%, with notable gains in specific tasks such as "Blocks Stack," which saw a 33.1 percentage point increase [25]. - The framework also demonstrated significant performance improvements in RoboTwin2.0, with average success rates rising from 38.3% to 68.8%, surpassing previous models [27]. Group 4: Real-World Application and Generalization - The model trained solely on simulation data showed enhanced adaptability to real-world tasks, with average success rates increasing from 17.5% to 38.5% in practical applications [30]. - The emergence of the "Pushcut" phenomenon indicates that the model can autonomously discover new strategies beyond human demonstrations, showcasing its potential for adaptive learning [32][34].
OpenAI两位首席最新采访信息量好大,终极目标是“自动化研究员”,招人并非寻找“最出圈”的人
3 6 Ke· 2025-09-26 12:15
Core Insights - OpenAI's leadership discussed the advancements and future direction of GPT-5, emphasizing its role in mainstreaming reasoning capabilities and agentic behavior [6][7][9] - The company aims to develop an automated researcher that can discover new ideas and contribute to scientific progress [13][25] - OpenAI's research philosophy prioritizes foundational research over short-term product competition, focusing on long-term goals [25][28] Group 1: GPT-5 and Reasoning - GPT-5 represents a strategic shift towards integrating reasoning capabilities into mainstream applications, moving beyond previous models that focused on immediate responses [6][7] - The evaluation metrics used in the past are nearing saturation, prompting OpenAI to seek new ways to assess models based on their ability to discover new information and achieve practical advancements in economically relevant areas [8][9] Group 2: Automated Researcher Goal - OpenAI's long-term objective is to create an automated researcher capable of independently generating new ideas, starting with internal research automation before expanding to other scientific fields [13][25] - The current reasoning capabilities of models have reached a level where they can perform complex tasks in a significantly reduced timeframe, with ongoing efforts to extend this capability [13][14] Group 3: Reinforcement Learning (RL) - OpenAI's reinforcement learning approach remains robust, with ongoing developments expected to simplify reward models and enhance their alignment with human learning processes [16][17] - The company emphasizes the importance of flexibility in understanding RL, as the tools and methodologies continue to evolve rapidly [17] Group 4: Programming and Coding - The introduction of GPT-5-codex aims to optimize programming tasks, addressing previous inefficiencies in how models handled problem-solving [18][19] - The evolution of coding practices is shifting towards "vibe coding," where intuition plays a significant role, reflecting a generational change in how programming is approached [21][22] Group 5: Talent Acquisition and Research Culture - OpenAI seeks individuals with perseverance and a solid technical foundation, rather than those who are merely prominent in social media or have flashy accomplishments [22][24] - The company fosters a culture that values foundational research and encourages researchers to explore significant long-term questions without being distracted by immediate market pressures [25][28] Group 6: Resource Allocation - When considering resource allocation, OpenAI's leadership indicated that additional resources would be directed towards computational power, highlighting its critical role in research and development [26][27] - The company acknowledges the ongoing challenges posed by computational limitations, which continue to influence the balance between product development and research initiatives [27][28]
OpenAI两位首席最新采访信息量好大!终极目标是“自动化研究员”,招人并非寻找“最出圈”的人
量子位· 2025-09-26 04:56
Core Insights - OpenAI's latest interview reveals significant advancements in GPT-5, focusing on long-term reasoning and the introduction of agentic behavior into mainstream applications [1][7][9] - The company emphasizes the importance of protecting foundational research while avoiding distractions from short-term product competition [6][48] Group 1: GPT-5 Developments - GPT-5 aims to mainstream reasoning capabilities, moving beyond previous models that focused on immediate responses [8][10] - The model represents a strategic shift towards enhancing reasoning and agentic behaviors, making it more accessible to users [9][10] Group 2: Evaluation and Progress - Current evaluation metrics are nearing saturation, necessitating new methods to assess models' abilities to discover new insights and achieve practical advancements in economically relevant areas [12][13] - OpenAI plans to focus on the time span in which models can reason and make progress, with current capabilities reaching approximately 1 to 5 hours [23][25] Group 3: Automation and Research Goals - OpenAI's long-term goal is to develop an automated researcher capable of discovering new ideas, starting with internal research automation [20][21] - The company is interested in measuring the duration of autonomous operation as a key evaluation metric [25] Group 4: Reinforcement Learning (RL) - Despite skepticism, reinforcement learning continues to thrive, with OpenAI exploring new directions and ideas [27][29] - The evolution of reward models is expected to accelerate, simplifying the process of developing effective fine-tuning datasets [29][30] Group 5: Programming and Coding - OpenAI's GPT-5-codex is designed to optimize programming tasks, addressing previous models' inefficiencies in problem-solving time allocation [32][34] - The current state of coding tools is likened to the "uncanny valley," where they are effective but not yet fully comparable to human performance [37][41] Group 6: Talent Acquisition and Research Culture - OpenAI prioritizes persistence and the ability to learn from failure in its research culture, seeking individuals with a solid technical foundation [44][46] - The company focuses on foundational research rather than merely following competitors, fostering an innovative environment [46][48] Group 7: Resource Allocation - If given additional resources, OpenAI would prioritize computational power, recognizing its critical role in research and development [49][51] - The company maintains a long-term research focus, emphasizing the importance of computational resources and physical constraints in future advancements [52]
缺数据也能拿SOTA?清华&上海AI Lab破解机器人RL两大瓶颈
量子位· 2025-09-26 02:08
Core Viewpoint - The article discusses the development of SimpleVLA-RL, an end-to-end online training solution for Visual-Language-Action (VLA) models, aimed at enhancing the flexibility and performance of robots in complex environments while addressing existing training bottlenecks [3][12]. Group 1: Key Challenges in Existing Training Paradigms - Current training paradigms face significant challenges, including high data collection costs and insufficient generalization capabilities [2][8]. - The reliance on large-scale, high-quality robot operation trajectories limits scalability and increases costs, making data acquisition a major hurdle [8]. - The models struggle with generalization, particularly in out-of-distribution tasks and new environments, leading to performance drops in long-sequence dependencies and combinatorial tasks [8][9]. Group 2: SimpleVLA-RL Framework - SimpleVLA-RL employs a combination of interactive trajectory sampling, result-based rewards, and enhanced exploration to tackle the three core challenges of VLA model training [5][6]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks like LIBERO and RoboTwin, achieving significant improvements even with limited data [5][21]. - In scenarios with single demonstration data, the average success rate in LIBERO increased from 48.9% to 96.9% after applying SimpleVLA-RL [5]. Group 3: Performance Metrics and Results - SimpleVLA-RL achieved an average success rate of 99.1% in LIBERO, with long-sequence tasks improving by 12.0 percentage points [21]. - In RoboTwin1.0, the average success rate rose from 39.8% to 70.4%, with specific tasks like "Blocks Stack" improving by 33.1 percentage points [23]. - The framework also demonstrated a significant increase in performance in RoboTwin2.0, with average success rates improving from 38.3% to 68.8% [25]. Group 4: Innovations and Discoveries - The training process led to the emergence of new operational strategies, such as the "Pushcut" phenomenon, where the model autonomously discovers more efficient methods beyond human demonstrations [10][31]. - This phenomenon indicates that reinforcement learning can enable VLA models to surpass the limitations of human demonstration patterns, paving the way for future adaptive VLA model development [31].
从现有主流 RL 库来聊聊RL Infra架构演进
自动驾驶之心· 2025-09-25 23:33
Core Viewpoint - Reinforcement Learning (RL) is transitioning from a supportive technology to a core driver of model capabilities, focusing on multi-step, interactive agent training to achieve General Artificial Intelligence (AGI) [2][6]. Group 1: Modern RL Infrastructure Architecture - The core components of modern RL infrastructure include a Generator, which interacts with the environment to generate trajectories and calculate rewards, and a Trainer, which updates model parameters based on trajectory data [6][4]. - The generator-trainer architecture, combined with distributed coordination layers like Ray, forms the "gold standard" for RL systems [6][4]. Group 2: Primary Development - Primary Development frameworks serve as foundational frameworks for building RL training pipelines, providing core algorithm implementations and integration with underlying training/inference engines [8][7]. - TRL (Transformer Reinforcement Learning) is a user-friendly RL framework launched by Hugging Face, offering various algorithm supports [9][10]. - OpenRLHF, developed by a collaborative team including ByteDance and NetEase, aims to provide an efficient and scalable RLHF and Agentic RL framework [11][14]. - veRL, developed by Byte's Seed team, is one of the most comprehensive frameworks with extensive algorithm support [16][19]. - AReaL (Asynchronous Reinforcement Learning) is designed for large-scale, high-throughput RL training with a fully asynchronous architecture [20][21]. - NeMo-RL, launched by NVIDIA, integrates into its extensive NeMo ecosystem, focusing on production-level RL frameworks [24][28]. - ROLL, an Alibaba open-source framework, emphasizes asynchronous and Agentic capabilities for large-scale LLM RL [30][33]. - slime, developed by Tsinghua and Zhipu, is a lightweight framework focusing on seamless integration of SGLang with Megatron [34][36]. Group 3: Secondary Development - Secondary Development frameworks are built on primary frameworks, targeting specific downstream application scenarios like multi-modal, multi-agent, and GUI automation [44][3]. - Agentic RL frameworks, such as verl-agent, optimize for asynchronous rollout and training, addressing the core challenges of multi-round interactions with external environments [46][47]. - Multimodal RL frameworks, like VLM-R1 and EasyR1, focus on training visual-language reasoning models, addressing data processing and loss function design challenges [53][54]. - Multi-Agent RL frameworks, such as MARTI, integrate multi-agent reasoning and reinforcement learning for complex collaborative tasks [59][60]. Group 4: Summary and Trends - The RL infrastructure is evolving from a "workshop" model to a "standardized pipeline," with increasing modularity in framework design [65]. - Asynchronous architectures are becoming essential to address the computational asymmetry between rollout and training [66]. - The emergence of high-performance inference engines like vLLM and SGLang significantly accelerates the rollout process [66]. - The evolution from RLHF to Agentic RL reflects the growing complexity of tasks supported by new frameworks [66]. - Distributed training framework choices, such as Megatron-LM and DeepSpeed, are critical for large-scale model training [66]. - Scene-driven secondary development frameworks are addressing unique challenges in vertical domains [66]. - The importance of orchestrators for managing distributed components in RL systems is becoming widely recognized [66].
AI正在偷走白领工作,OpenAI狂砸10亿教AI上班,你的完美继任者即将上岗
3 6 Ke· 2025-09-25 09:32
Core Insights - Major AI companies like Anthropic and OpenAI are planning to invest $1 billion annually to train AI to work like humans, utilizing reinforcement learning environments and expert knowledge [1][4][21] - There are concerns that AI could eliminate a significant number of entry-level white-collar jobs within the next 1-5 years, potentially raising the unemployment rate in the U.S. to 10-20% [1][2] Investment and Development - Anthropic and OpenAI are allocating $1 billion each year for AI training, with OpenAI predicting this investment will rise to $8 billion by 2030 [4][10] - The funding aims to overcome current limitations in traditional training methods and explore new monetization avenues, such as workplace software and AI agents [4][10] AI Training Methodology - AI is being trained to handle complex tasks in various applications, including Salesforce and Zendesk, with a focus on real-world task execution [3][5] - Turing has developed over 1,000 reinforcement learning environments to simulate real-world applications for AI training [12][13] Expert Involvement - The trend is shifting towards hiring experienced professionals from various fields to provide real-world task examples for AI learning [15][20] - The cost of hiring experts is increasing, with some contracts exceeding $120 per hour, and projections suggest rates could rise to $150-$250 per hour in the next 18 months [11][10] Future Implications - As AI learns from expert knowledge and workplace applications, it is expected to gradually take over human jobs across various industries [24][21] - The integration of AI into the economy could lead to a transformation where the entire economic system operates as a reinforcement learning machine [21][1]
微信WeChat-YATT横空出世,腾讯强化学习布局剑指何方
Sou Hu Cai Jing· 2025-09-24 09:56
Core Insights - Tencent's open-sourcing of WeChat-YATT training library signifies a strategic move in the competitive landscape of AI model training, particularly as OpenAI's GPT-5 approaches release [1][2] - WeChat-YATT is designed with a focus on reinforcement learning and multimodal models, differentiating itself from mainstream frameworks like TensorFlow and PyTorch [2] Group 1: WeChat-YATT's Innovations - WeChat-YATT achieves significant breakthroughs in three areas: optimized parameter update efficiency for reinforcement learning, flexible multimodal data fusion interfaces, and a modular design that lowers the barriers for distributed training [2][4] - The library's emphasis on "ease of extensibility" reflects Tencent's recognition of the need for rapid iteration in large model training [4] Group 2: Competitive Positioning - Compared to Meta's PyTorch, WeChat-YATT excels in reinforcement learning support; against Google's JAX, it shows advantages in Chinese language scenarios and multimodal processing [4] - WeChat-YATT's deep integration with the WeChat ecosystem sets it apart from similar reinforcement learning frameworks like Ray RLlib [4] Group 3: Strategic Implications - The release of WeChat-YATT aligns with Tencent's broader AI strategy, which includes trademark applications for "WeChat AI Service Platform" and the deployment of the mixed Yuan model in business scenarios [7] - Tencent aims to create a closed-loop AI ecosystem through foundational technology breakthroughs and application deployment, with WeChat-YATT serving as a critical component in this strategy [7] - The focus on reinforcement learning indicates Tencent's commitment to key areas such as gaming, recommendation systems, and autonomous driving, positioning itself for future AI applications [7] Group 4: Long-term Vision - The naming of WeChat-YATT, "Yet Another Transformer Trainer," reflects both a sense of humor and Tencent's long-term investment in AI infrastructure [6] - The competition in the era of large models is fundamentally a competition for infrastructure, with WeChat-YATT representing a piece of Tencent's broader AI blueprint [7]