强化学习
Search documents
NeurIPS 2025 | SURDS 数据集与 GRPO 全面强化自驾空间推理
自动驾驶之心· 2025-09-27 23:33
Core Insights - The article discusses the challenges of achieving accurate spatial reasoning in autonomous driving scenarios using Vision Language Models (VLMs), highlighting the lack of large-scale benchmarks in this area [2][20]. - A new benchmark called SURDS has been introduced to systematically evaluate the spatial reasoning capabilities of VLMs, revealing significant shortcomings in current models [4][20]. Benchmark Overview - SURDS is a large-scale benchmark based on the nuScenes dataset, consisting of 41,080 visual-question training instances and 9,250 evaluation samples, covering six spatial categories: direction recognition, pixel-level localization, depth estimation, distance comparison, left-right ordering, and front-back relationships [4][20]. - The dataset includes diverse multimodal information collected from urban environments in Boston and Singapore, ensuring a realistic testing scenario [6][20]. Model Training and Evaluation - The research emphasizes the importance of data generation and introduces a novel automated process for generating high-quality reasoning chains, which enhances the model's spatial reasoning capabilities [8][10]. - A reinforcement learning framework combining spatial localization rewards and logical consistency objectives was designed, leading to significant performance improvements in various tasks [11][20]. Experimental Results - The evaluation results show that different models exhibit notable differences in spatial reasoning tasks, with the proposed model achieving a nearly 60% improvement in depth estimation accuracy compared to the second-best model [14][20]. - The study reveals that most existing models struggle with single-object tasks, often performing close to random levels, indicating a need for better learning of absolute pose and metric information [16][20]. Training Strategy Insights - Ablation studies indicate that combining localization and logical rewards significantly enhances model performance, underscoring the foundational role of localization ability in spatial reasoning [16][18]. - The research also highlights that the scale of model parameters does not directly correlate with spatial understanding capabilities, suggesting that simply increasing model size is insufficient [16][20].
缺数据也能拿SOTA?清华&上海AI Lab破解机器人RL两大瓶颈
具身智能之心· 2025-09-27 01:33
Core Insights - The article discusses the development of SimpleVLA-RL, a new framework designed to enhance the training and generalization capabilities of Visual-Language-Action (VLA) models in robotics, addressing key limitations in existing training paradigms [4][14]. Group 1: Key Contributions of SimpleVLA-RL - SimpleVLA-RL effectively addresses three major bottlenecks in VLA model training: high data collection costs, insufficient generalization ability, and the need for large-scale demonstration data [6][11]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks such as LIBERO and RoboTwin, achieving significant improvements in success rates even with limited data [6][21]. - In scenarios with single demonstration data, the average success rate of OpenVLA-OFT in LIBERO increased from 48.9% to 96.9%, and for long-sequence tasks, it improved from 17.3% to 91.7% [6][21]. Group 2: Training Mechanism and Innovations - The training mechanism includes interactive trajectory sampling, result reward modeling, and exploration enhancement, which collectively improve data efficiency and model performance [15][16][17]. - The result reward model simplifies the reward structure to binary outcomes (success or failure), allowing for better focus on training objectives and avoiding the complexities of process rewards [16][21]. - The exploration enhancement strategy encourages diverse exploration during training, preventing the model from converging to narrow solutions [17][19]. Group 3: Performance Metrics and Benchmark Results - SimpleVLA-RL achieved an average success rate of 99.1% in the LIBERO benchmark, with specific improvements in long-sequence tasks, where success rates increased by 12.0 percentage points [23]. - In RoboTwin1.0, the average success rate improved from 39.8% to 70.4%, with notable gains in specific tasks such as "Blocks Stack," which saw a 33.1 percentage point increase [25]. - The framework also demonstrated significant performance improvements in RoboTwin2.0, with average success rates rising from 38.3% to 68.8%, surpassing previous models [27]. Group 4: Real-World Application and Generalization - The model trained solely on simulation data showed enhanced adaptability to real-world tasks, with average success rates increasing from 17.5% to 38.5% in practical applications [30]. - The emergence of the "Pushcut" phenomenon indicates that the model can autonomously discover new strategies beyond human demonstrations, showcasing its potential for adaptive learning [32][34].
OpenAI两位首席最新采访信息量好大,终极目标是“自动化研究员”,招人并非寻找“最出圈”的人
3 6 Ke· 2025-09-26 12:15
Core Insights - OpenAI's leadership discussed the advancements and future direction of GPT-5, emphasizing its role in mainstreaming reasoning capabilities and agentic behavior [6][7][9] - The company aims to develop an automated researcher that can discover new ideas and contribute to scientific progress [13][25] - OpenAI's research philosophy prioritizes foundational research over short-term product competition, focusing on long-term goals [25][28] Group 1: GPT-5 and Reasoning - GPT-5 represents a strategic shift towards integrating reasoning capabilities into mainstream applications, moving beyond previous models that focused on immediate responses [6][7] - The evaluation metrics used in the past are nearing saturation, prompting OpenAI to seek new ways to assess models based on their ability to discover new information and achieve practical advancements in economically relevant areas [8][9] Group 2: Automated Researcher Goal - OpenAI's long-term objective is to create an automated researcher capable of independently generating new ideas, starting with internal research automation before expanding to other scientific fields [13][25] - The current reasoning capabilities of models have reached a level where they can perform complex tasks in a significantly reduced timeframe, with ongoing efforts to extend this capability [13][14] Group 3: Reinforcement Learning (RL) - OpenAI's reinforcement learning approach remains robust, with ongoing developments expected to simplify reward models and enhance their alignment with human learning processes [16][17] - The company emphasizes the importance of flexibility in understanding RL, as the tools and methodologies continue to evolve rapidly [17] Group 4: Programming and Coding - The introduction of GPT-5-codex aims to optimize programming tasks, addressing previous inefficiencies in how models handled problem-solving [18][19] - The evolution of coding practices is shifting towards "vibe coding," where intuition plays a significant role, reflecting a generational change in how programming is approached [21][22] Group 5: Talent Acquisition and Research Culture - OpenAI seeks individuals with perseverance and a solid technical foundation, rather than those who are merely prominent in social media or have flashy accomplishments [22][24] - The company fosters a culture that values foundational research and encourages researchers to explore significant long-term questions without being distracted by immediate market pressures [25][28] Group 6: Resource Allocation - When considering resource allocation, OpenAI's leadership indicated that additional resources would be directed towards computational power, highlighting its critical role in research and development [26][27] - The company acknowledges the ongoing challenges posed by computational limitations, which continue to influence the balance between product development and research initiatives [27][28]
OpenAI两位首席最新采访信息量好大!终极目标是“自动化研究员”,招人并非寻找“最出圈”的人
量子位· 2025-09-26 04:56
Core Insights - OpenAI's latest interview reveals significant advancements in GPT-5, focusing on long-term reasoning and the introduction of agentic behavior into mainstream applications [1][7][9] - The company emphasizes the importance of protecting foundational research while avoiding distractions from short-term product competition [6][48] Group 1: GPT-5 Developments - GPT-5 aims to mainstream reasoning capabilities, moving beyond previous models that focused on immediate responses [8][10] - The model represents a strategic shift towards enhancing reasoning and agentic behaviors, making it more accessible to users [9][10] Group 2: Evaluation and Progress - Current evaluation metrics are nearing saturation, necessitating new methods to assess models' abilities to discover new insights and achieve practical advancements in economically relevant areas [12][13] - OpenAI plans to focus on the time span in which models can reason and make progress, with current capabilities reaching approximately 1 to 5 hours [23][25] Group 3: Automation and Research Goals - OpenAI's long-term goal is to develop an automated researcher capable of discovering new ideas, starting with internal research automation [20][21] - The company is interested in measuring the duration of autonomous operation as a key evaluation metric [25] Group 4: Reinforcement Learning (RL) - Despite skepticism, reinforcement learning continues to thrive, with OpenAI exploring new directions and ideas [27][29] - The evolution of reward models is expected to accelerate, simplifying the process of developing effective fine-tuning datasets [29][30] Group 5: Programming and Coding - OpenAI's GPT-5-codex is designed to optimize programming tasks, addressing previous models' inefficiencies in problem-solving time allocation [32][34] - The current state of coding tools is likened to the "uncanny valley," where they are effective but not yet fully comparable to human performance [37][41] Group 6: Talent Acquisition and Research Culture - OpenAI prioritizes persistence and the ability to learn from failure in its research culture, seeking individuals with a solid technical foundation [44][46] - The company focuses on foundational research rather than merely following competitors, fostering an innovative environment [46][48] Group 7: Resource Allocation - If given additional resources, OpenAI would prioritize computational power, recognizing its critical role in research and development [49][51] - The company maintains a long-term research focus, emphasizing the importance of computational resources and physical constraints in future advancements [52]
缺数据也能拿SOTA?清华&上海AI Lab破解机器人RL两大瓶颈
量子位· 2025-09-26 02:08
Core Viewpoint - The article discusses the development of SimpleVLA-RL, an end-to-end online training solution for Visual-Language-Action (VLA) models, aimed at enhancing the flexibility and performance of robots in complex environments while addressing existing training bottlenecks [3][12]. Group 1: Key Challenges in Existing Training Paradigms - Current training paradigms face significant challenges, including high data collection costs and insufficient generalization capabilities [2][8]. - The reliance on large-scale, high-quality robot operation trajectories limits scalability and increases costs, making data acquisition a major hurdle [8]. - The models struggle with generalization, particularly in out-of-distribution tasks and new environments, leading to performance drops in long-sequence dependencies and combinatorial tasks [8][9]. Group 2: SimpleVLA-RL Framework - SimpleVLA-RL employs a combination of interactive trajectory sampling, result-based rewards, and enhanced exploration to tackle the three core challenges of VLA model training [5][6]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks like LIBERO and RoboTwin, achieving significant improvements even with limited data [5][21]. - In scenarios with single demonstration data, the average success rate in LIBERO increased from 48.9% to 96.9% after applying SimpleVLA-RL [5]. Group 3: Performance Metrics and Results - SimpleVLA-RL achieved an average success rate of 99.1% in LIBERO, with long-sequence tasks improving by 12.0 percentage points [21]. - In RoboTwin1.0, the average success rate rose from 39.8% to 70.4%, with specific tasks like "Blocks Stack" improving by 33.1 percentage points [23]. - The framework also demonstrated a significant increase in performance in RoboTwin2.0, with average success rates improving from 38.3% to 68.8% [25]. Group 4: Innovations and Discoveries - The training process led to the emergence of new operational strategies, such as the "Pushcut" phenomenon, where the model autonomously discovers more efficient methods beyond human demonstrations [10][31]. - This phenomenon indicates that reinforcement learning can enable VLA models to surpass the limitations of human demonstration patterns, paving the way for future adaptive VLA model development [31].
从现有主流 RL 库来聊聊RL Infra架构演进
自动驾驶之心· 2025-09-25 23:33
Core Viewpoint - Reinforcement Learning (RL) is transitioning from a supportive technology to a core driver of model capabilities, focusing on multi-step, interactive agent training to achieve General Artificial Intelligence (AGI) [2][6]. Group 1: Modern RL Infrastructure Architecture - The core components of modern RL infrastructure include a Generator, which interacts with the environment to generate trajectories and calculate rewards, and a Trainer, which updates model parameters based on trajectory data [6][4]. - The generator-trainer architecture, combined with distributed coordination layers like Ray, forms the "gold standard" for RL systems [6][4]. Group 2: Primary Development - Primary Development frameworks serve as foundational frameworks for building RL training pipelines, providing core algorithm implementations and integration with underlying training/inference engines [8][7]. - TRL (Transformer Reinforcement Learning) is a user-friendly RL framework launched by Hugging Face, offering various algorithm supports [9][10]. - OpenRLHF, developed by a collaborative team including ByteDance and NetEase, aims to provide an efficient and scalable RLHF and Agentic RL framework [11][14]. - veRL, developed by Byte's Seed team, is one of the most comprehensive frameworks with extensive algorithm support [16][19]. - AReaL (Asynchronous Reinforcement Learning) is designed for large-scale, high-throughput RL training with a fully asynchronous architecture [20][21]. - NeMo-RL, launched by NVIDIA, integrates into its extensive NeMo ecosystem, focusing on production-level RL frameworks [24][28]. - ROLL, an Alibaba open-source framework, emphasizes asynchronous and Agentic capabilities for large-scale LLM RL [30][33]. - slime, developed by Tsinghua and Zhipu, is a lightweight framework focusing on seamless integration of SGLang with Megatron [34][36]. Group 3: Secondary Development - Secondary Development frameworks are built on primary frameworks, targeting specific downstream application scenarios like multi-modal, multi-agent, and GUI automation [44][3]. - Agentic RL frameworks, such as verl-agent, optimize for asynchronous rollout and training, addressing the core challenges of multi-round interactions with external environments [46][47]. - Multimodal RL frameworks, like VLM-R1 and EasyR1, focus on training visual-language reasoning models, addressing data processing and loss function design challenges [53][54]. - Multi-Agent RL frameworks, such as MARTI, integrate multi-agent reasoning and reinforcement learning for complex collaborative tasks [59][60]. Group 4: Summary and Trends - The RL infrastructure is evolving from a "workshop" model to a "standardized pipeline," with increasing modularity in framework design [65]. - Asynchronous architectures are becoming essential to address the computational asymmetry between rollout and training [66]. - The emergence of high-performance inference engines like vLLM and SGLang significantly accelerates the rollout process [66]. - The evolution from RLHF to Agentic RL reflects the growing complexity of tasks supported by new frameworks [66]. - Distributed training framework choices, such as Megatron-LM and DeepSpeed, are critical for large-scale model training [66]. - Scene-driven secondary development frameworks are addressing unique challenges in vertical domains [66]. - The importance of orchestrators for managing distributed components in RL systems is becoming widely recognized [66].
AI正在偷走白领工作,OpenAI狂砸10亿教AI上班,你的完美继任者即将上岗
3 6 Ke· 2025-09-25 09:32
Core Insights - Major AI companies like Anthropic and OpenAI are planning to invest $1 billion annually to train AI to work like humans, utilizing reinforcement learning environments and expert knowledge [1][4][21] - There are concerns that AI could eliminate a significant number of entry-level white-collar jobs within the next 1-5 years, potentially raising the unemployment rate in the U.S. to 10-20% [1][2] Investment and Development - Anthropic and OpenAI are allocating $1 billion each year for AI training, with OpenAI predicting this investment will rise to $8 billion by 2030 [4][10] - The funding aims to overcome current limitations in traditional training methods and explore new monetization avenues, such as workplace software and AI agents [4][10] AI Training Methodology - AI is being trained to handle complex tasks in various applications, including Salesforce and Zendesk, with a focus on real-world task execution [3][5] - Turing has developed over 1,000 reinforcement learning environments to simulate real-world applications for AI training [12][13] Expert Involvement - The trend is shifting towards hiring experienced professionals from various fields to provide real-world task examples for AI learning [15][20] - The cost of hiring experts is increasing, with some contracts exceeding $120 per hour, and projections suggest rates could rise to $150-$250 per hour in the next 18 months [11][10] Future Implications - As AI learns from expert knowledge and workplace applications, it is expected to gradually take over human jobs across various industries [24][21] - The integration of AI into the economy could lead to a transformation where the entire economic system operates as a reinforcement learning machine [21][1]
微信WeChat-YATT横空出世,腾讯强化学习布局剑指何方
Sou Hu Cai Jing· 2025-09-24 09:56
Core Insights - Tencent's open-sourcing of WeChat-YATT training library signifies a strategic move in the competitive landscape of AI model training, particularly as OpenAI's GPT-5 approaches release [1][2] - WeChat-YATT is designed with a focus on reinforcement learning and multimodal models, differentiating itself from mainstream frameworks like TensorFlow and PyTorch [2] Group 1: WeChat-YATT's Innovations - WeChat-YATT achieves significant breakthroughs in three areas: optimized parameter update efficiency for reinforcement learning, flexible multimodal data fusion interfaces, and a modular design that lowers the barriers for distributed training [2][4] - The library's emphasis on "ease of extensibility" reflects Tencent's recognition of the need for rapid iteration in large model training [4] Group 2: Competitive Positioning - Compared to Meta's PyTorch, WeChat-YATT excels in reinforcement learning support; against Google's JAX, it shows advantages in Chinese language scenarios and multimodal processing [4] - WeChat-YATT's deep integration with the WeChat ecosystem sets it apart from similar reinforcement learning frameworks like Ray RLlib [4] Group 3: Strategic Implications - The release of WeChat-YATT aligns with Tencent's broader AI strategy, which includes trademark applications for "WeChat AI Service Platform" and the deployment of the mixed Yuan model in business scenarios [7] - Tencent aims to create a closed-loop AI ecosystem through foundational technology breakthroughs and application deployment, with WeChat-YATT serving as a critical component in this strategy [7] - The focus on reinforcement learning indicates Tencent's commitment to key areas such as gaming, recommendation systems, and autonomous driving, positioning itself for future AI applications [7] Group 4: Long-term Vision - The naming of WeChat-YATT, "Yet Another Transformer Trainer," reflects both a sense of humor and Tencent's long-term investment in AI infrastructure [6] - The competition in the era of large models is fundamentally a competition for infrastructure, with WeChat-YATT representing a piece of Tencent's broader AI blueprint [7]
寻找你的AI同频搭子|「锦秋小饭桌」活动上新
锦秋集· 2025-09-23 09:44
Core Viewpoint - The article promotes a series of networking events called "Jinqiu Dinner Table," aimed at entrepreneurs and tech innovators to share insights and experiences in a casual setting, emphasizing the importance of collaboration and innovation in the tech industry [22][23][24]. Event Details - The upcoming events include: - AI Agent in Shenzhen on September 26, 2025 [3][50] - Embodied Intelligence in Beijing on October 10, 2025 [5][12] - Robot Party in Shenzhen on October 17, 2025 [19][50] Networking Concept - "Jinqiu Dinner Table" is described as an informal gathering for entrepreneurs, product technologists, and innovators to discuss topics that are often not addressed in formal settings, focusing on genuine exchanges and practical insights [22][23]. - The initiative has hosted 31 sessions covering various topics related to technology and investment, creating a platform for sharing challenges and decision-making processes in entrepreneurship [24]. AI and Decision-Making Insights - The article discusses the limitations of large language models (LLMs) in serious decision-making tasks, highlighting that traditional reinforcement learning models perform better in high-stakes environments [25][26]. - It emphasizes the need for high-quality decision-making knowledge and data, which is currently lacking in existing LLMs [26][27]. Agent Architecture and Applications - The article outlines the evolution of AI agent architectures, including single-agent and multi-agent systems, and their applications in solving complex problems [36][38]. - It highlights the importance of clear and structured requirements for AI agents to deliver expected outcomes, stressing that vague instructions lead to poor performance [38]. Future Trends in AI Interaction - The potential for new interaction methods with AI, such as voice commands and proactive AI hardware, is discussed, suggesting that these innovations could transform user experiences and task execution [42][43]. - The article notes that the development of specialized browsers for AI could enhance performance by providing better context understanding and data access [46]. Investment Opportunities - The "Soil Seed Special Plan" by Jinqiu Capital is introduced, aimed at supporting early-stage AI entrepreneurs with funding to help them realize their innovative ideas [57][59].
进击新能源第一阵营,“增程豪华轿车新标杆”别克至境L7全国首秀
Zhong Guo Qi Che Bao Wang· 2025-09-23 05:51
Core Insights - The Buick Zhijing L7, a luxury electric sedan, has been unveiled as the flagship model of Buick's high-end electric sub-brand, showcasing advanced technology and luxury features [1][3][21] Group 1: Product Features - The Zhijing L7 is built on the new Buick "Xiaoyao" super fusion architecture, integrating top technologies in driving, assisted driving, and luxury comfort [3][5] - It features the "Zhenlong" range extender system, which offers a maximum power output of 252 kW, equivalent to a 3.0T V6 engine, with a 0-100 km/h acceleration time of just 5.9 seconds [5][8] - The vehicle boasts a pure electric range of 302 km and a total range of 1420 km, addressing common concerns about electric vehicle range [5][8] - The Zhijing L7 is equipped with a high-performance battery that supports a lifespan of 640,000 km with low degradation, ensuring safety and longevity [8] Group 2: Intelligent Features - The Zhijing L7 introduces the "Xiaoyao Zhixing" assisted driving system, featuring the Momenta R6 flywheel model based on end-to-end reinforcement learning, providing comprehensive driving assistance [9][11] - It includes a 50-inch panoramic AR-HUD head-up display and a 15.6-inch smart central control screen, enhancing user interaction and information display [11][16] - The vehicle's intelligent cockpit is powered by Qualcomm's latest SA8775P chip, delivering high computational power for various smart driving scenarios [13][11] Group 3: Luxury and Comfort - The Zhijing L7 features a spacious interior with dimensions of 5032mm x 1952mm x 1500mm and a wheelbase of 3000mm, reflecting its status as a luxury sedan [14][19] - The interior design incorporates high-quality materials and advanced sound insulation, creating a serene and luxurious atmosphere [15][19] - It offers unique seating configurations, including the industry's first dual 120° zero-gravity seats for enhanced comfort [19][21] Group 4: Market Positioning - The Zhijing L7 aims to redefine luxury standards in the electric vehicle market, combining advanced range extender technology with top-tier intelligent features and luxury experiences [21] - The vehicle is positioned to compete in the high-end electric vehicle segment, leveraging Buick's heritage and innovative capabilities to attract consumers [21]