强化学习 - filings, earnings calls, financial reports, news - Reportify

强化学习

Search documents

告别专家依赖，让机器人学会自我参考，仅需200步性能飙升至99.2%

机器之心· 2025-12-10 05:10

Core Insights - The article discusses the development of the Self-Referential Policy Optimization (SRPO) framework, which enhances the performance of Visual Language Action (VLA) models in robotic tasks by addressing the challenges of sparse rewards and dependency on expert demonstrations [3][11]. Motivation and Contribution - Recent research indicates that reinforcement learning (RL) can significantly improve VLA models' performance both within and outside their training distribution. However, the challenge of sparse reward signals remains, particularly in VLA tasks where high computational costs and inefficient use of failure trajectory information hinder training efficiency [6][11]. - The SRPO framework alleviates the dependency on expert demonstrations and task-specific reward engineering by utilizing self-generated successful trajectories to provide progressive rewards for failed attempts [11][12]. Technical Approach - SRPO employs a "learn from success" paradigm, where trajectories generated during policy inference are collected and categorized into successful and failed attempts. The framework uses a potential world representation to model behavior similarity and calculate progressive rewards [14][16]. - The framework formalizes the robotic decision-making process as a partially observable Markov decision process (POMDP), introducing a world model-driven reward modeling mechanism that provides progressive reward signals for failed trajectories [18][19]. Experimental Results - SRPO achieved a success rate of 99.2% with only 200 steps of reinforcement learning, significantly outperforming baseline models that rely on sparse rewards or require manual reward design [27]. - In the LIBERO-Plus generalization tests, SRPO demonstrated a performance improvement of 167%, even without training on any generalized scenario data [30]. Efficiency and Real-World Application - The efficiency of SRPO is highlighted by its ability to improve success rates from 17.3% to 98.6% in long-term tasks with minimal training steps, showcasing its superior information utilization compared to traditional methods [34]. - The reward modeling of SRPO has been tested in real-world environments, showing significant success rate improvements for various tasks [37]. Conclusion - SRPO represents a significant advancement in VLA reinforcement learning, enabling robots to transition from imitation to autonomous exploration without the need for expensive data labeling or complex reward designs [51].

自参考策略优化框架（SRPO）

视觉语言动作（VLA）模型

自参考策略优化框架（SRPO）

自参考策略优化框架（SRPO）

视觉语言动作（VLA）模型

自参考策略优化框架（SRPO）

随到随学！端到端与VLA自动驾驶小班课正式结课

自动驾驶之心· 2025-12-09 19:00

Core Viewpoint - 2023 marks the year of end-to-end production, with 2024 expected to be a significant year for end-to-end production in the automotive industry, as leading new forces and manufacturers have already achieved end-to-end production [1][3]. Group 1: End-to-End Production Development - The automotive industry has two main paradigms: single-stage and two-stage, with UniAD being a representative of the single-stage approach that directly models vehicle trajectories from sensor inputs [1]. - Since last year, the single-stage end-to-end development has rapidly advanced, leading to various derivatives such as perception-based, world model-based, diffusion model-based, and VLA-based single-stage methods [3][5]. - Major players in the autonomous driving sector, including both solution providers and car manufacturers, are focusing on self-research and production of end-to-end autonomous driving technologies [3]. Group 2: Course Overview - A course titled "End-to-End and VLA Autonomous Driving" has been launched, aimed at teaching cutting-edge algorithms in both single-stage and two-stage end-to-end approaches, with a focus on the latest developments in the industry and academia [5][14]. - The course is structured into several chapters, starting with an introduction to end-to-end algorithms, followed by background knowledge on various technologies such as VLA, diffusion models, and reinforcement learning [8][9]. - The second chapter is highlighted as containing the most frequently asked technical keywords for job interviews in the next two years [9]. Group 3: Technical Focus Areas - The course covers various subfields of single-stage end-to-end methods, including perception-based (UniAD), world model-based, diffusion model-based, and the currently popular VLA-based approaches [10][12]. - The curriculum includes practical assignments, such as RLHF fine-tuning, and aims to provide students with hands-on experience in building and experimenting with pre-trained and reinforcement learning modules [11][12]. - The course emphasizes the importance of understanding BEV perception, multi-modal large models, and the latest advancements in diffusion models, which are crucial for the future of autonomous driving [12][16].

端到端自动驾驶

视觉语言模型VLM

端到端自动驾驶

视觉语言模型VLM

端到端落地小班课：核心算法&实战讲解（7个project）

自动驾驶之心· 2025-12-09 19:00

Core Insights - The article discusses the evolving recruitment landscape in the autonomous driving sector, highlighting a shift in demand from perception roles to end-to-end, VLA, and world model positions [2] - A new advanced course focused on end-to-end production in autonomous driving has been designed, emphasizing practical applications and real-world experience [2][4] Course Overview - The course is structured to cover various core algorithms, including one-stage and two-stage end-to-end methods, navigation information applications, reinforcement learning, and trajectory optimization [2] - The course aims to provide in-depth knowledge and practical skills necessary for production in autonomous driving, with a focus on real-world applications and challenges [2][4] Chapter Summaries - **Chapter 1: Overview of End-to-End Tasks** Discusses the integration of perception tasks and the learning-based design of control algorithms, which are essential skills for companies in the end-to-end era [7] - **Chapter 2: Two-Stage End-to-End Algorithm Framework** Introduces the modeling methods of two-stage frameworks and the information transfer between perception and planning, including practical examples [8] - **Chapter 3: One-Stage End-to-End Algorithm** Focuses on one-stage frameworks that allow for lossless information transfer, presenting various methods and practical learning experiences [9] - **Chapter 4: Production Application of Navigation Information** Covers the critical role of navigation information in autonomous driving, detailing mainstream navigation map formats and their integration into models [10] - **Chapter 5: Introduction to RL Algorithms in Autonomous Driving** Explains the necessity of reinforcement learning in conjunction with imitation learning to enhance the model's ability to generalize [11] - **Chapter 6: Trajectory Output Optimization** Engages participants in practical projects focusing on algorithms based on imitation learning and reinforcement learning [12] - **Chapter 7: Safety Net Solutions - Spatiotemporal Joint Planning** Discusses post-processing logic to ensure model accuracy and stability in trajectory outputs, introducing common smoothing algorithms [13] - **Chapter 8: Experience Sharing on End-to-End Production** Provides insights on practical experiences in production, addressing data, models, scenarios, and strategies for system capability enhancement [14] Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving algorithms, reinforcement learning, and programming skills [15][17]

端到端算法

《面向量产的端到端实战小班课》

端到端算法

《面向量产的端到端实战小班课》

极客公园创新大会 2026在京落幕，罗永浩、张楠、何小鹏、刘靖康等共议 AI 时代「进程由我」

Xin Lang Cai Jing· 2025-12-09 10:23

12 月 6 日-7 日，由极客公园主办、798 文化科技联合主办的'极客公园创新大会 2026'（GeekPark Innovation Festival，以下简称'IF'），在北京 798 艺术区成功举办。在 AI 的洪流中，真正的稀缺是人、判断和行动。因此，本届大会的主题是'进程由我 On The Loop！'。 IF 2026 不仅关注'AI 会带来什么'，更着眼于如何做出重要选择，主动选择未来。 IF 已经连续举办 16 年，这个舞台不仅见证了特斯拉创始人马斯克、谷歌董事长施密特、苹果联合创始人沃兹尼亚克、Uber 创始人卡拉尼克等全球传奇极客的亮相，还记录了雷军、张一鸣、王兴、黄峥、宇树王兴兴等中国杰出创新者的最初起点和高光时刻。如今，极客公园已经成为由内容社区与早期投资共同构成的创业者生态平台。极客公园的'目标函数'十分明确：激发创新中的'变量'，推动'非共识'成为新的'共识'。正如极客公园创始人 & 总裁张鹏所说，任何成功的创新都是一个持续的'见识-认知-行动'的闭环。它本质上就是一场持续的'强化学习'，关键就是设定你那个与众不同的目标函数。大会汇聚四十余位全球创新者，通过主舞 ...

AI需要能自我改进！AI圈越来越多人认为“当前AI训练方法无法突破”

Hua Er Jie Jian Wen· 2025-12-09 01:49

来自OpenAI、谷歌等公司的小部分但日益增长的AI开发者群体认为，当前的技术路径无法实现生物学、医学等领域的重大突破，也难以避免简单错误。这一观点正在引发行业对数十亿美元投资方向的质疑。据The Information周二报道，上周在圣地亚哥举行的神经信息处理系统大会（NeurIPS）上，众多研究人员讨论了这一话题。他们认为，开发者必须创造出能在部署后持续获取新能力的AI，这种"持续学习"能力类似人类的学习方式，但目前尚未在AI领域实现。然而，技术局限已拖慢企业客户对AI代理等新产品的采购。模型在简单问题上持续犯错，AI代理在缺乏AI提供商大量工作确保其正确运行的情况下往往表现不佳。这些质疑声与部分AI领袖的乐观预测形成对比。Anthropic首席执行官Dario Amodei上周表示，扩展现有训练技术就能实现通用人工智能（AGI），OpenAI首席执行官Sam Altman则认为两年多后AI将能自我改进。但如果质疑者是对的，这可能令OpenAI和Anthropic明年在强化学习等技术上投入的数十亿美元面临风险。尽管存在技术局限，当前AI在写作、设计、购物和数据分析等任务上的表现仍推 ...

通用人工智能（AGI）

Artificial Intelligence

自适应语言模型

通用人工智能（AGI）

Artificial Intelligence

自适应语言模型

达晨、华控领投，极佳视界A2轮再融2亿，押注“世界模型+行动模型”原生架构

Tai Mei Ti A P P· 2025-12-08 07:17

图片来源：网络具身智能领军企业极佳视界宣布完成新一轮融资。近日，极佳视界近日完成2亿元A2轮融资，由达晨财智领投，老股东华控基金联合领投，首发展创投、浦耀信晔、财鑫资本、珠海科技产业集团、张科垚坤、复琢创投等知名机构跟投，老股东合鼎共资本更超额追加投资。至此，在短短3个月内，公司已连续完成Pre-A、Pre-A+、A1及A2四轮融资，累计融资额达5亿元，充分彰显资本市场对其技术路径与商业化前景的高度认可。公司创始人兼CEO黄冠博士为清华大学自动化系创新领军工程博士，曾担任地平线机器人视觉感知技术负责人、鉴智机器人合伙人兼算法副总裁，并拥有三星中国研究院、微软亚洲研究院等全球顶尖科研机构的深厚背景。过去十年，他全程深度参与并推动了物理AI从技术萌芽到产业落地的关键演进，带领团队屡次在FRVT、COCO、VOT等全球最具影响力的视觉AI竞赛中斩获冠军，并实现多项技术的大规模产业化应用。在自动驾驶领域，其团队提出的BEVDet系列方法已成为全球最具影响力的BEV感知范式之一，长期稳居nuScenes榜单首位，并成功实现规模化量产；此外，团队还主导了地平线AIDI平台——业内最大规模的数据闭 ...

端到端岗位求职：核心算法&实战讲解（7个project）

自动驾驶之心· 2025-12-08 00:02

Core Insights - The article discusses the evolving recruitment landscape in the autonomous driving industry, highlighting a shift in demand from perception roles to end-to-end, VLA, and world model positions [2] - A new course titled "End-to-End Practical Class for Mass Production" has been designed to address the skills gap in the industry, focusing on practical applications and mass production experiences [2][4] Course Overview - The course aims to cover core algorithms such as one-stage and two-stage end-to-end methods, navigation information applications, reinforcement learning, and trajectory optimization [2] - It is structured into eight chapters, each focusing on different aspects of end-to-end autonomous driving systems, including task overview, algorithm frameworks, navigation applications, and production experiences [5][7][8][9][10][11][12][13][14] Target Audience - The course is designed for advanced learners with a background in autonomous driving perception, reinforcement learning, and programming languages like Python and PyTorch [15][16] - It emphasizes practical skills and aims to prepare participants for real-world applications in the autonomous driving sector [2][15] Course Schedule - The course will commence on November 30, with a duration of approximately three months, featuring offline video lectures and online Q&A sessions [15][17]

端到端算法

导航信息量产应用

面向量产的端到端实战小班课

端到端算法

导航信息量产应用

面向量产的端到端实战小班课

Agent微调复活？英伟达开源8B新模型带飞GPT-5：在HLE狂卷37分，还把成本打下来

量子位· 2025-12-07 04:35

Core Insights - The article introduces a new paradigm in AI model orchestration, utilizing a smaller 8B model as a conductor to coordinate various tools and larger models, achieving better performance at lower costs [1][13]. Group 1: Model Performance - The Orchestrator-8B model achieved a score of 37.1% in the Humanity's Last Exam, outperforming GPT-5, which scored 35.1%, while also reducing computational costs by 2.5 times [1][9]. - In the FRAMES benchmark, Orchestrator-8B scored 76.3, compared to GPT-5's 74.0, and in the τ²-Bench, it scored 80.2 against GPT-5's 77.7 [9][10]. - The average cost for Orchestrator-8B was only 9.2 cents, with a latency of 8.2 minutes, significantly lower than GPT-5 [9][10]. Group 2: ToolOrchestra Framework - ToolOrchestra integrates various tools into a unified JSON interface, allowing the 8B conductor to think, call, and read feedback in multiple rounds until convergence [4]. - The framework employs GRPO reinforcement learning to maximize three rewards: correctness, efficiency, and user preference [4][5]. Group 3: User Preferences and Biases - The article highlights two biases in large models: self-enhancing bias, where models prefer to call upon similar models, and blind reliance on the strongest models, leading to increased costs [4][5]. - User preferences are taken into account, allowing the conductor to balance between local and cloud searches, speed, and cost [5][15]. Group 4: Application Scenarios - The Orchestrator-8B can be applied in various scenarios, such as internal Q&A and report analysis, where it defaults to local indexing and code execution for 80% of tasks [16]. - In research and development, it can set time and cost limits while considering source preferences [16]. - The framework allows for an end-to-end orchestration of functions and tools, moving away from rigid programming structures [16]. Group 5: Future Directions - The paper has made all code, models, and datasets publicly available for academic and industrial follow-up [14]. - The approach emphasizes a shift from relying solely on the strongest models to a more efficient use of diverse tools and models, enhancing cost-effectiveness and performance [15].

Nvidia(US:NVDA)

多智能体系统

Orchestrator-8B

多智能体系统

Orchestrator-8B

LLM强化学习不稳定之谜，被Qwen团队从「一阶近似」视角解开

机器之心· 2025-12-07 04:33

机器之心报道机器之心编辑部如今，强化学习（RL）已成为提升大语言模型（LLM）复杂推理与解题能力的关键技术范式，而稳定的训练过程对于成功扩展 RL 至关重要。由于语言具有强烈的上下文属性，LLM 的 RL 通常依赖序列级奖励 —— 即根据完整生成序列给一个标量分数。然而，主流 RL 算法（如 REINFORCE 与 GRPO）普遍采用基于 token 的优化目标。这种「奖励在序列级、优化在 token 级」的不匹配引发了对于它们理论健全性与训练稳定性的担忧，因此已经有研究尝试直接使用序列级优化目标。此外，token 级优化目标在混合专家（MoE）模型的 RL 训练中带来了新的挑战，比如 MoE 的动态专家路由机制可能破坏 token 级重要性采样比的有效性。由此引出的关键问题是：在什么条件下，用 token 级目标优化序列级奖励是合理的？有效程度又是怎样的？针对这些问题，阿里千问团队提出了一种针对 LLM 的全新 RL 公式化方法。核心洞察是：为了优化序列级奖励的期望值，可以使用一个替代（surrogate）token 级目标作为其一阶近似。这一近似在以下两种偏差都足够小的条件下才成立 ...

大语言模型

Artificial Intelligence

大语言模型

Artificial Intelligence

深扒PI π*0.6迭代式强化学习思路：VLA+在线RL，实现自我进化

具身智能之心· 2025-12-07 03:03

见证具身浪潮，书写智能新纪元以下文章来源于具身纪元，作者具身纪元具身纪元 . 更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球（戳我），这里包含所有你想要的! 在Physical Intelligence 最新的成果π 0.6 论文里，他们介绍了 π 0 .6迭代式强化学习的思路来源：其中有我们熟悉的Yuke Zhu的研究，也有他们自己（Chelsea Finn、Sergey Levine）的一些研究，我们之前对这些工作一直有跟踪和介绍。此外，还有来自国内具身智能团队的工作，比如清华大学、星动纪元的研究。随着π*0.6的发布，VLA+online RL成为了一个行业共识的非常有前景的研究方向深扒了Π*0.6的论文，发现它不止于真实世界强化学习英伟达也来做VLA在真实世界自我改进的方法了大语言模型从SFT到RL的发展方向也逐渐在具身研究中清晰明朗。一、为什么VLA+RL很重要编辑丨具身纪元点击下方卡片，关注" 具身智能之心 "公众号 >> 点击进入→ 具身智能之心技术交流群图注：VLA模型依赖研读微调在具身智能（Embodied AI）领域，科学家 ...