Workflow
强化学习
icon
Search documents
全球强化学习+VLA范式,PI*0.6背后都有这家公司技术伏笔
具身智能之心· 2025-12-13 01:02
以下文章来源于具身纪元 ,作者具身纪元 具身纪元 . 见证具身浪潮,书写智能新纪元 编辑丨机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 在 Physical Intelligence 最新的成果 π 0.6 论文里,他们介绍了 π 0 .6 迭代式强化学习的思路来源: 其中有我们熟悉的 Yuke Zhu 的研究,也有他们自己(Chelsea Finn、Sergey Levine)的一些研究,我们之前对这些工作一直有跟踪和介绍。此外,还 有来自国内具身智能团队的工作,比如清华大学、星动纪元的研究。 随着 π*0.6 的发布,VLA+online RL 成为了一个行业共识的非常有前景的研究方向 ( 深扒了Π*0.6的论文,发现它不止于真实世界强化学习 、 英伟达也来做VLA在真实世界自我改进的方法了 )大语言模型从SFT到RL的发展方向也逐渐 在具身研究中清晰明朗。 一、为什么VLA+RL很重要 图注:VLA模型依赖研读微调 在具身智能(Embodi ...
具身智能之心论文辅导正式推出了,国内最专业的师资来啦!
具身智能之心· 2025-12-12 07:59
如果您有任意论文发表需求,支持带课题/研究方向咨询,欢迎联系我们, 微信:paperguidance 具身智能之心论文辅导正式推出了,国内最专业的师资来啦! 如果您是大模型、VLA、VLA+RL、视觉语言导航、端到端、强化学习、Diffusion Policy、sim2real、具身交 互、位姿估计、机器人决策规划、运动规划、3DGS、SLAM、触觉感知、双足/四足机器人、遥控操作、零样 本学习等方向。 提供的服务 论文全流程指导; 实验指导; 申博指导; 具身智能顶会/顶刊,CCF-A、CCF-B、CCF-C等; SCI一区~四区; 中科院1区,2区,3区,4区; EI/中文核心; 毕设论文/申博/比赛等; 更多咨询 论文选题; 中标率很高哦! 论文中已有多篇被CVPR、AAAI、ECCV、CoRL、ICLR、IROS、ICRA、ACL等顶会顶刊收录。 根据不同论文级别,辅导价格不同,具体如下: 多咨询 更多论文辅导内容,欢迎咨询科研助理, 微信:paperguidance ...
荣获国家级科技奖一等奖,网易伏羲产学研协同创新获权威认可
Sou Hu Cai Jing· 2025-12-12 04:15
近日,中国图象图形学学会(CSIG)正式公布2025年度科学技术奖评选结果。其中,由网易(杭州)网络有限公司与天津大学、中国科学技术大学、中 国航天科工集团第四研究院十七所共同合作的项目《基于强化学习的智能决策关键技术及应用》,荣获学会"2025年度中国图象图形学学会科技进步奖-一 等奖",产学研协同创新再结硕果。《逆水寒》便是项目落地场景之一,作为网易旗下人工智能实验室,网易伏羲深度参与《逆水寒》在图形技术与人工 智能领域的探索,以游戏AI的创新实践助力数字文娱作品突破"游戏"边界,获得国家级学术权威认可。 关于《基于强化学习的智能决策关键技术及应用》项目介绍 本获奖项目系统开展了基于强化学习的智能决策关键技术攻关,聚焦"奖励质量低、经验复用难、环境波动大"三大关键挑战,提出了基于时空分解的奖励 信号生成、基于自监督学习的经验表征提取、以及基于演化-强化学习的策略模型优化三项创新技术,在策略性能、学习效率、跨任务泛化方面实现能力 突破,达到国际领先水平。 | | 中国图象图形学学会 | 请输入要搜索的内容 Q | | --- | --- | --- | | | | CHINA SOCIETY OF IMAG ...
全球强化学习+VLA范式,PI*0.6背后都有这家中国公司技术伏笔
机器之心· 2025-12-12 03:41
Core Insights - The article discusses the significance of integrating Vision-Language-Action (VLA) models with Reinforcement Learning (RL) in the field of Embodied AI, emphasizing the limitations of imitation learning and the necessity for robust learning methods [1][2][4]. Group 1: Importance of VLA+RL - VLA models are being developed to apply powerful Vision-Language Models (VLM) in the control of robots, primarily through supervised fine-tuning (SFT) [2]. - Imitation learning alone is insufficient for robots to handle novel situations, necessitating the use of RL to enhance robustness and persistence in task execution [4]. Group 2: Challenges in Applying RL to VLA - The integration of RL with VLA faces three main challenges: environmental differences, model instability, and computational demands [6]. - Direct application of RL algorithms to large VLA models can lead to catastrophic forgetting and training collapse, making it difficult to maintain performance [6]. Group 3: Solutions to VLA's RL Challenges - The industry has proposed three types of solutions to address the challenges faced by VLA in RL applications, with a focus on internalizing high-value behaviors through SFT [7][13]. - The iRe-VLA model introduces a two-phase iterative learning process that alternates between online RL for exploration and supervised learning for consolidation [10][15]. Group 4: iRe-VLA Model Architecture - The iRe-VLA model consists of a VLM backbone for understanding images and instructions, and an Action Head for translating features into control signals [11]. - The use of Low-Rank Adaptation (LoRA) technology allows for efficient training without the need for full model fine-tuning [12]. Group 5: Experimental Results and Analysis - Extensive experiments in both simulated environments and real-world scenarios demonstrate the effectiveness of the iRe-VLA method, showing significant improvements in task success rates [26][30]. - The iRe-VLA model outperformed traditional methods, achieving a success rate increase from 43% to 83% in benchmark tasks [30]. Group 6: Conclusion and Future Implications - The article concludes that the iRe-VLA approach provides a viable solution to the challenges of deploying large models in robotic control, ensuring stability and continuous learning [37][42]. - Future research directions include efficient exploration and learning of new skills under sparse rewards, as well as developing scalable RL algorithms for large VLA models [40].
正式开课!7个Project搞懂端到端落地现状
自动驾驶之心· 2025-12-12 03:02
Core Insights - The article discusses the evolving recruitment landscape in the autonomous driving industry, highlighting a shift in demand from perception roles to end-to-end, VLA, and world model positions [2] - A new advanced course focused on end-to-end production in autonomous driving has been designed, emphasizing practical applications and real-world experience [2][4] Course Overview - The course is structured into eight chapters, covering various aspects of end-to-end algorithms, including task overview, two-stage and one-stage frameworks, navigation information applications, reinforcement learning, trajectory optimization, and production experience sharing [5][7][8][9][10][11][12][13][14] - The first chapter introduces the integration of perception tasks and learning-based control algorithms, which are essential skills for companies in the end-to-end era [7] - The second chapter focuses on the two-stage end-to-end algorithm framework, discussing its modeling and information transfer between perception and planning [8] - The third chapter covers one-stage end-to-end algorithms, emphasizing their performance advantages and various frameworks [9] - The fourth chapter highlights the critical role of navigation information in autonomous driving and its integration into end-to-end models [10] - The fifth chapter introduces reinforcement learning algorithms, addressing the limitations of imitation learning and the need for generalization [11] - The sixth chapter involves practical projects on trajectory output optimization, combining imitation and reinforcement learning [12] - The seventh chapter discusses post-processing logic for trajectory smoothing and reliability in production [13] - The final chapter shares production experiences from multiple perspectives, focusing on tools and strategies for real-world applications [14] Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving algorithms, reinforcement learning, and programming skills [15][17]
i6i8MEGA分别交付6798/6719/680|理想25年11月记录
理想TOP2· 2025-12-11 06:09
2025年11月理想交付33181,其中增程18984,纯电14197。L6789分别为9434/5212/2130/2208,i6i8MEGA分别为6798/6719/680。 | 2025年11月14日 晚点说 | i6首销期毛利率约10% 详见 | 平替时代:一家车企、 | | --- | --- | --- | | 一个行业如何被自己的成功困住 | | | | | 总交付 | | 纯电 | L6 | IL7 | Г8 | La i6 | 18 | | MEGA | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 2025年11月 | 33181 | 18984 | 14197 | 9434 | 5212 | 2130 | 2208 | 6798 | 6719 | 680 | | 2025年10月 | 31767 | 18340 | 13427 | 9680 | 4347 | 2183 | 2130 | 5775 | 5749 | 1903 | | 2025年9月 | 33951 | 24554 | 9397 | ...
时隔一年DiffusionDrive升级到v2,创下了新纪录!
自动驾驶之心· 2025-12-11 03:35
Core Insights - The article discusses the upgrade of DiffusionDrive to version 2, highlighting its advancements in end-to-end autonomous driving trajectory planning through the integration of reinforcement learning to address the challenges of diversity and sustained high quality in trajectory generation [1][3][10]. Background Review - The shift towards end-to-end autonomous driving (E2E-AD) has emerged as traditional tasks like 3D object detection and motion prediction have matured. Early methods faced limitations in modeling, often generating single trajectories without alternatives in complex driving scenarios [5][10]. - Previous diffusion models applied to trajectory generation struggled with mode collapse, leading to a lack of diversity in generated behaviors. DiffusionDrive introduced a Gaussian Mixture Model (GMM) to define prior distributions for initial noise, promoting diverse behavior generation [5][13]. Methodology - DiffusionDriveV2 introduces a novel framework that utilizes reinforcement learning to overcome the limitations of imitation learning, which previously led to a trade-off between diversity and sustained high quality in trajectory generation [10][12]. - The framework incorporates intra-anchor GRPO and inter-anchor truncated GRPO to manage advantage estimation within specific driving intentions, preventing mode collapse by avoiding inappropriate comparisons between different intentions [9][12][28]. - The method employs scale-adaptive multiplicative noise to enhance exploration while maintaining trajectory smoothness, addressing the inherent scale inconsistency between proximal and distal segments of trajectories [24][39]. Experimental Results - Evaluations on the NAVSIM v1 and NAVSIM v2 datasets demonstrated that DiffusionDriveV2 achieved state-of-the-art performance, with a PDMS score of 91.2 on NAVSIM v1 and 85.5 on NAVSIM v2, significantly outperforming previous models [10][33]. - The results indicate that DiffusionDriveV2 effectively balances trajectory diversity and sustained quality, achieving optimal performance in closed-loop evaluations [38][39]. Conclusion - The article concludes that DiffusionDriveV2 successfully addresses the inherent challenges of imitation learning in trajectory generation, achieving an optimal trade-off between planning quality and diversity through innovative reinforcement learning techniques [47].
告别专家依赖,让机器人学会自我参考,仅需200步性能飙升至99.2%
具身智能之心· 2025-12-11 02:01
Core Insights - The article discusses the development of the Self-Referential Policy Optimization (SRPO) framework, which addresses the limitations of existing Visual Language Action (VLA) models in robotic tasks by enabling robots to learn from their own experiences without relying on external expert data [3][10][56]. Motivation and Contribution - SRPO aims to overcome the challenges of sparse reward signals in reinforcement learning, particularly in the VLA domain, by utilizing self-generated successful trajectories to provide progressive rewards for failed attempts [6][10]. - The framework eliminates the need for costly expert demonstrations and task-specific reward engineering, thus enhancing the efficiency of the learning process [10][12]. Technical Approach - SRPO collects trajectories generated during policy inference and categorizes them into successful and failed attempts, using a potential world representation to model behavior similarity [16][17]. - The framework employs a progressive reward mechanism based on the distance of failed trajectories to successful trajectory representations, allowing for a more nuanced evaluation of task progress [22][24]. Experimental Results - SRPO achieved a success rate of 99.2% in the LIBERO benchmark with only 200 steps of reinforcement learning, significantly outperforming traditional methods that rely on sparse rewards [29][30]. - In the LIBERO-Plus generalization tests, SRPO demonstrated a performance improvement of 167%, showcasing its robust generalization capabilities without the need for additional training data [31][32]. Efficiency and Real-World Application - The efficiency of SRPO is highlighted by its ability to improve success rates from 17.3% to 98.6% in long-term tasks with minimal training steps, outperforming other models in terms of training efficiency [36][39]. - The framework has been tested in real-world scenarios, showing significant improvements in success rates compared to supervised fine-tuning baselines [41][39]. Conclusion - SRPO represents a significant advancement in robotic learning, allowing for autonomous exploration and creativity by enabling robots to learn from their own successes and failures, thus paving the way for a new approach in VLA reinforcement learning [56].
AI大家说 | 重磅嘉宾齐聚,近期Dwarkesh Podcast都聊了些什么?
红杉汇· 2025-12-11 00:04
要点速览: 2025年,一个叫Dwarkesh Podcast的播客成为了AI行业内获取一手信息最重要的渠道之一。甚至可以说,它已经 成了硅谷AI技术圈的必看节目。从Satya Nadella到Andrej Karpathy,再到Ilya Sutskever,这些平时很难约到的行 业核心人物,都选择在这里进行长时间的深度对话。本期,我们将为大家分享其中最新、最受关注的几场播客 及核心观点。 Ilya Sutskever 前OpenAI首席科学家、计算机科学家、SSI创始人 ■ 洞察1:那种无脑堆算力的「暴力美学」时代,其实已经翻篇了。 过去这五年,大家都在喊Scaling Law,好像只要GPU够多、数据够大,把整个互联网喂进去,AGI就自动产 出来了。但Ilya直接泼了盆冷水。 他说Pre-training (预训练) 已经开始式微,现在数据快用光了,到了后 面这一步 (RL和后训练) ,光靠"大"没用了。 现在又回到了2012年之前那种需要"拼品味、拼直觉"的手搓 时代 (Age of Research) 。 ■ 洞察2:「情绪」不是人类的累赘,而是进化给人类的礼物。 我们通常觉得AI是理性的,人类是感性 ...
南大联合LibLib.ai、中科院自动化所,共同提出布局推理与精准编辑「海报设计大模型」PosterCopilot
机器之心· 2025-12-10 08:13
Core Viewpoint - The article discusses the development of PosterCopilot, a professional-level poster design and editing model that addresses significant challenges in graphic design automation, particularly in layout reasoning and controllable editing [2][6][40]. Industry Pain Points - Graphic design faces substantial challenges in achieving true automation, with existing models like Stable Diffusion struggling with layered structures, leading to material distortion and lack of fine control [6]. - Current multimodal models exhibit four critical shortcomings: severe element overlap, lack of visual feedback, regression to a single ground truth, and inability to perform layer-specific edits [8][10]. Core Achievements - PosterCopilot aims to bridge the gap between single-step generation and professional workflows through a systematic solution that incorporates a three-stage training strategy [13][14]. - The innovative three-stage training includes: 1. Perturbation Supervised Fine-Tuning (PSFT) to address geometric distortions [15]. 2. Visual-Reality Alignment Reinforcement Learning (RL-VRA) to correct overlaps and proportional issues [15]. 3. Aesthetic Feedback Reinforcement Learning (RLAF) to encourage exploration beyond ground truth layouts [15]. Generative Agent - PosterCopilot functions as a comprehensive design assistant, facilitating seamless transitions from abstract design concepts to concrete materials through a reception model and T2I model [16][17]. - The model supports various professional scenarios, including full poster generation from provided assets, intelligent completion of missing materials, global theme transitions, intelligent size reconstruction, and multi-round fine-grained editing [21][23][28][29][31]. Experimental Results - PosterCopilot outperforms existing commercial competitors and state-of-the-art models across multiple metrics, achieving an average win rate exceeding 74% in human evaluations [34][35]. - In assessments of layout rationality, text legibility, and element preservation, PosterCopilot demonstrates superior performance compared to models like Microsoft Designer and CreatiPoster [35][37]. Conclusion and Outlook - By decoupling layout reasoning from generative editing and incorporating reinforcement learning to align with human aesthetics, PosterCopilot sets a new benchmark for intelligent design tools and offers a new paradigm for AI-assisted creative workflows [40].