强化学习

Search documents
腾讯研究院AI速递 20250929
腾讯研究院· 2025-09-28 16:01
Group 1: OpenAI and Model Changes - OpenAI has been reported to reroute models like GPT-4 and GPT-5 to lower-capacity sensitive models without user knowledge [1] - The rerouting occurs when the system detects sensitive topics, and this judgment is based on subjective context [1] - OpenAI's VP stated that the changes are temporary and part of testing a new safety routing system, raising user concerns about rights [1] Group 2: Tencent's Hunyuan Image 3.0 - Tencent launched Hunyuan Image 3.0, the first industrial-grade native multimodal model with 80 billion parameters, recognized as the largest open-source model [2] - The model excels in semantic understanding, capable of parsing complex semantics and generating both long and short texts with high aesthetic quality [2] - Hunyuan Image 3.0 is based on Hunyuan-A13B, trained on 5 billion image-text pairs and 6 trillion tokens, and is available under Apache 2.0 license [2] Group 3: Kuaishou's KAT Series - Kuaishou's Kwaipilot team introduced KAT-Dev-32B (open-source) and KAT-Coder (closed-source) models, achieving a 62.4% solution rate on SWE-Bench Verified [3] - KAT-Coder reached a 73.4% solution rate, comparable to top closed-source models, utilizing a chain training structure [3] - The team developed entropy-based tree pruning technology and a large-scale reinforcement learning training framework, observing new capabilities in dialogue and tool usage [3] Group 4: AI Teachers by TAL Education - TAL Education's CTO proposed a grading theory for AI teachers, evolving from assistants (L2) to true teacher roles (L3) [4] - L3 AI teachers can observe students' problem-solving steps in real-time and provide targeted guidance, forming a data feedback loop [5] - The "XiaoSi AI One-on-One" program supports personalized education across various learning environments, achieving a 98.1% accuracy in math problem-solving [5] Group 5: Meta's Humanoid Robots - Meta plans to invest billions in humanoid robot development, equating its importance to augmented reality projects [6] - The focus will be on software development rather than hardware manufacturing, aiming to create industry standards [6] - A new "Superintelligent AI Lab" is collaborating with robotics teams to build a "world model" simulating real physical laws [6] Group 6: Richard Sutton's Critique on Language Models - Richard Sutton criticized large language models as a flawed starting point, emphasizing that true intelligence comes from experiential learning [7] - He argued that large models lack the ability to predict real-world events and do not adapt to changes in the external world [7] - Sutton advocates for a learning approach based on actions, observations, and continuous learning as the essence of intelligence [7] Group 7: RLMT Method by Chen Danqi - Chen Danqi's team proposed the RLMT method, integrating explicit reasoning into general chat models to bridge the gap between specialized reasoning and general dialogue capabilities [8] - RLMT combines preference alignment and reasoning abilities, requiring models to generate reasoning paths before final answers [8] - Experiments show RLMT models excel in chat benchmarks, shifting reasoning styles to iterative thinking akin to skilled writers [9] Group 8: DeepMind's Veo 3 Emergence - DeepMind's Veo 3 demonstrates four progressive capabilities: perception, modeling, manipulation, and reasoning [10] - The concept of Chain-of-Frames (CoF) allows Veo 3 to perform cross-temporal reasoning through frame-by-frame video generation [10] - Quantitative assessments indicate significant improvements over Veo 2, suggesting video models are becoming foundational in visual tasks [10] Group 9: NVIDIA's Future in AI Infrastructure - NVIDIA is transitioning from a chip company to an AI infrastructure partner, focusing on total cost advantages rather than individual chips [11] - AI inference is expected to grow by a factor of a billion, driven by three expansion laws, potentially accelerating global GDP growth [11] - Huang Renxun emphasizes the need for independent AI infrastructure in the sovereign AI era, advocating for maximizing influence through technology exports [11]
人形机器人需要“第三只手”?清华大学教授赵明国:智能化是一个渐进突破的过程
Zhong Guo Jing Ying Bao· 2025-09-28 14:41
赵明国认为,我们不应一刀切地用单一百分比去描述和定位人形机器人智能程度的占比。"人应该管人 该做的部分,机器人应该管机器人该做的部分;随着时间的推移,人形机器人的感知和理解能力会逐步 增强,人形机器人的智能化是一个渐进突破的过程。" 当我们把镜头对准人形机器人时,我们能看到它们在跑道上摔倒后再爬起来,在复杂的有障碍的路面 上"盲走"并保持平衡,或在5v5的机器人足球比赛场上完成一次次传球、接球与射门,这些瞬间足够惊 艳。 然而,《中国经营报》记者注意到,若走出受控赛场或部分预设场景,面对一个未"标注"的障碍物或一 件未叠好的衣服时,它们往往需要由人来介入遥控。"人工遥控"因此成为人形机器人不可或缺的"第三 只手"。这正是人形机器人发展所呈现出的一体两面:人形机器人已经具备了较强的运动能力,而对环 境的理解与跨场景泛化仍是未解的短板。 "现在的人形机器人凭借其本体感知能力,可以完成走、跑、跳、翻跟头,或在有障碍物的路面上行 走,这种复杂动作的完成都需要其具备很强的智能水平。不过,目前许多人形机器人还不太具备理解环 境的能力。比如说,当前面有个障碍物时,它是否能够自主绕行、是否能自主开门,这些方面还存在一 些短板 ...
速递|前OpenAI员工创立Applied Compute以5亿美元估值融资,Lux Capital领投
Z Potentials· 2025-09-28 14:29
Core Insights - Investors are increasingly funding startups that focus on automating tasks using reinforcement learning (RL) technology, as developers rely more on this approach to optimize AI models [1][4] - Applied Compute, founded by three former OpenAI employees, is negotiating a new funding round at a valuation of $500 million, just three months after raising $100 million [1][2] Group 1: Company Overview - Applied Compute aims to assist software developers and enterprises in utilizing RL technology to create customized AI systems for specific sectors such as law and finance [2][3] - The company has previously raised $20 million from investors including Benchmark, Conviction, and Sequoia Capital [2] Group 2: Founders and Background - The founders, Rhythm Garg, Yash Patil, and Linden Li, are Stanford University alumni who worked on developing ChatGPT's reasoning model and other AI tools before joining OpenAI [3] - Other companies, such as Thinking Machines Lab, co-founded by former OpenAI CTO Mira Murati, are also planning to offer RL services to enterprises [3][4] Group 3: Market Trends and Technology - Reinforcement learning is becoming a key technology for AI developers, helping improve models by rewarding desired behaviors and penalizing others [4] - The potential for RL to automate tasks in various fields is significant, with expectations that the entire economy could evolve into a "reinforcement learning machine" [4]
限时权益价16.99万元,别克至境L7上市
Bei Jing Shang Bao· 2025-09-28 13:32
Core Viewpoint - SAIC-GM Buick brand officially launched the Zhijing L7, offering five models with a limited-time price range of 169,900 to 215,900 yuan [1] Group 1: Product Features - The Zhijing L7 is the first flagship sedan of Buick's high-end new energy sub-brand "Zhijing," built on the Buick "Xiaoyao" super fusion architecture [3] - It is equipped with the "Zhenlong" range extension system, featuring a 252 kW range extension single electric drive, a dedicated 1.5T hybrid engine, and a peak power generator of 100 kW [3] - The vehicle achieves a low comprehensive energy consumption of 0.5 liters per 100 kilometers and offers a pure electric range of 302 kilometers, with a total range of 1,420 kilometers [3] - The "Zhenlong" range extension system supports 130 kW fast charging, allowing for a 30% to 80% charge in just 18 minutes [3] Group 2: Technology and Innovation - The Zhijing L7 features Buick's "Xiaoyao Zhixing" assisted driving system and is the first globally to adopt the Momenta R6 flywheel model based on end-to-end "reinforcement learning" [3] - It is also equipped with Qualcomm's latest SA8775P chip, providing 72 TOPS of AI computing power to enhance the intelligent cabin experience, offering immersive and natural interaction tailored to various travel scenarios [3]
RLHF与RLVR全都要,陈丹琦团队最新力作将推理能力拓展到通用智能
机器之心· 2025-09-28 04:50
一个月前,我们曾报道过清华姚班校友、普林斯顿教授 陈丹琦似乎加入 Thinking Machines Lab 的消息。有些爆料认为她在休假一年后,会离开普林斯顿,全职加 入 Thinking Machines Lab。 最近,陈丹琦在普林斯顿大学的团队发布了最新学术成果,表明了 RLVR 范式在可验证领域之外依然有效,提出了 基于模型奖励思维的强化学习(RLMT) 方 法,它将显式的思维链推理融入通用聊天模型之中。 论文标题:Language Models that Think, Chat Better 论文链接:https://www.arxiv.org/overview/2509.20357v1 众所周知,大型语言模型传统上遵循一种多阶段训练范式:首先在大规模文本语料上进行 预训练,然后通过 监督微调 来学习指令跟随,最后借助 强化学习 来对 齐人类偏好。 机器之心报道 编辑:冷猫 思考自身行为的后果,并在必要时进行修正 —— 这是人类智慧的核心特征之一。 这种方法确实催生了功能强大的对话式 AI 系统,但仍存在一个关键局限: 在数学、编程等领域通过 可验证奖 励的强化学习(RLVR) 所获得的推理能力, ...
为什么自动驾驶中的强化学习,没有很好的落地?
自动驾驶之心· 2025-09-28 03:50
Core Viewpoint - The article discusses the challenges of implementing reinforcement learning (RL) in the field of autonomous driving, particularly focusing on the issue of reward hacking and the balance between safety and efficiency [2][3]. Group 1: Challenges in Reinforcement Learning for Autonomous Driving - Reinforcement learning faces a significant issue known as reward hacking, where increasing safety requirements can lead to decreased efficiency, and vice versa [2]. - Designing a balanced reward system that can enhance overall performance in RL models is complex, as achieving equilibrium among multiple rewards is challenging [2]. - The application of RL in autonomous driving is complicated by the need to adhere to various driving rules during the driving process, unlike in embodied intelligence where the focus is primarily on local motion [2]. Group 2: Need for a Suitable Framework - A crucial factor for the successful implementation of RL in autonomous driving is the development of a robust architecture that can effectively integrate with RL [3]. - Existing models in autonomous driving are unlikely to be directly applicable to RL without significant modifications [3]. Group 3: Community and Resources - The "Autonomous Driving Knowledge Planet" community aims to provide a comprehensive platform for technical exchange and learning in the field of autonomous driving, with over 4,000 members [6][10]. - The community offers a variety of resources, including learning routes, technical discussions, and access to industry experts, to assist both beginners and advanced practitioners in the field [6][10].
NeurIPS 2025 | SURDS 数据集与 GRPO 全面强化自驾空间推理
自动驾驶之心· 2025-09-27 23:33
以下文章来源于深蓝AI ,作者深蓝学院 深蓝AI . 专注于人工智能、机器人与自动驾驶的学习平台。 作者 | 深蓝学院 来源 | 深蓝AI 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 摘 要 在大模型飞速发展的当下,让多模态大语言模型(VLM)在自动驾驶场景图像中做出准确的空间推理,依然是人工智能领域的一大挑战。学术界一直缺 乏针对自动驾驶场推理的大规模基准,现有方法往往依赖外部专家模型,难以全面衡量模型能力。 与此形成鲜明对比的是,人类可以凭借已有知识轻松判断图像中物体的朝向,或推理多个物体的相对位置。而VLM同样具备丰富的知识,却仍在此类任务上 表现不足。 为此,武汉大学联合中科院自动化所,北京智源人工智能研究院 (BAAI)等多家单位推出 首个面向驾驶场景的VLM空间推理大规模基准 SURDS ,系统评测了 包括 GPT 系列在内的通用模型及 SpatialRGPT 等空间推理模型,全面揭示了当前VLM在空间理解方面的短板。研究团队通过设计"感知准确性"和" ...
缺数据也能拿SOTA?清华&上海AI Lab破解机器人RL两大瓶颈
具身智能之心· 2025-09-27 01:33
Core Insights - The article discusses the development of SimpleVLA-RL, a new framework designed to enhance the training and generalization capabilities of Visual-Language-Action (VLA) models in robotics, addressing key limitations in existing training paradigms [4][14]. Group 1: Key Contributions of SimpleVLA-RL - SimpleVLA-RL effectively addresses three major bottlenecks in VLA model training: high data collection costs, insufficient generalization ability, and the need for large-scale demonstration data [6][11]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks such as LIBERO and RoboTwin, achieving significant improvements in success rates even with limited data [6][21]. - In scenarios with single demonstration data, the average success rate of OpenVLA-OFT in LIBERO increased from 48.9% to 96.9%, and for long-sequence tasks, it improved from 17.3% to 91.7% [6][21]. Group 2: Training Mechanism and Innovations - The training mechanism includes interactive trajectory sampling, result reward modeling, and exploration enhancement, which collectively improve data efficiency and model performance [15][16][17]. - The result reward model simplifies the reward structure to binary outcomes (success or failure), allowing for better focus on training objectives and avoiding the complexities of process rewards [16][21]. - The exploration enhancement strategy encourages diverse exploration during training, preventing the model from converging to narrow solutions [17][19]. Group 3: Performance Metrics and Benchmark Results - SimpleVLA-RL achieved an average success rate of 99.1% in the LIBERO benchmark, with specific improvements in long-sequence tasks, where success rates increased by 12.0 percentage points [23]. - In RoboTwin1.0, the average success rate improved from 39.8% to 70.4%, with notable gains in specific tasks such as "Blocks Stack," which saw a 33.1 percentage point increase [25]. - The framework also demonstrated significant performance improvements in RoboTwin2.0, with average success rates rising from 38.3% to 68.8%, surpassing previous models [27]. Group 4: Real-World Application and Generalization - The model trained solely on simulation data showed enhanced adaptability to real-world tasks, with average success rates increasing from 17.5% to 38.5% in practical applications [30]. - The emergence of the "Pushcut" phenomenon indicates that the model can autonomously discover new strategies beyond human demonstrations, showcasing its potential for adaptive learning [32][34].
OpenAI两位首席最新采访信息量好大,终极目标是“自动化研究员”,招人并非寻找“最出圈”的人
3 6 Ke· 2025-09-26 12:15
Core Insights - OpenAI's leadership discussed the advancements and future direction of GPT-5, emphasizing its role in mainstreaming reasoning capabilities and agentic behavior [6][7][9] - The company aims to develop an automated researcher that can discover new ideas and contribute to scientific progress [13][25] - OpenAI's research philosophy prioritizes foundational research over short-term product competition, focusing on long-term goals [25][28] Group 1: GPT-5 and Reasoning - GPT-5 represents a strategic shift towards integrating reasoning capabilities into mainstream applications, moving beyond previous models that focused on immediate responses [6][7] - The evaluation metrics used in the past are nearing saturation, prompting OpenAI to seek new ways to assess models based on their ability to discover new information and achieve practical advancements in economically relevant areas [8][9] Group 2: Automated Researcher Goal - OpenAI's long-term objective is to create an automated researcher capable of independently generating new ideas, starting with internal research automation before expanding to other scientific fields [13][25] - The current reasoning capabilities of models have reached a level where they can perform complex tasks in a significantly reduced timeframe, with ongoing efforts to extend this capability [13][14] Group 3: Reinforcement Learning (RL) - OpenAI's reinforcement learning approach remains robust, with ongoing developments expected to simplify reward models and enhance their alignment with human learning processes [16][17] - The company emphasizes the importance of flexibility in understanding RL, as the tools and methodologies continue to evolve rapidly [17] Group 4: Programming and Coding - The introduction of GPT-5-codex aims to optimize programming tasks, addressing previous inefficiencies in how models handled problem-solving [18][19] - The evolution of coding practices is shifting towards "vibe coding," where intuition plays a significant role, reflecting a generational change in how programming is approached [21][22] Group 5: Talent Acquisition and Research Culture - OpenAI seeks individuals with perseverance and a solid technical foundation, rather than those who are merely prominent in social media or have flashy accomplishments [22][24] - The company fosters a culture that values foundational research and encourages researchers to explore significant long-term questions without being distracted by immediate market pressures [25][28] Group 6: Resource Allocation - When considering resource allocation, OpenAI's leadership indicated that additional resources would be directed towards computational power, highlighting its critical role in research and development [26][27] - The company acknowledges the ongoing challenges posed by computational limitations, which continue to influence the balance between product development and research initiatives [27][28]
OpenAI两位首席最新采访信息量好大!终极目标是“自动化研究员”,招人并非寻找“最出圈”的人
量子位· 2025-09-26 04:56
(网友1):深入又有趣! OpenAI首席科学家 Jakub Pachocki 和首席研究官 Mark Chen 开启同台爆料模式: 在a16z的这场最新采访中,二人不仅深入探讨了GPT-5如何引入长远推理、如何在基准饱和后衡量进度,以及为什么强化学习不断让怀疑论 者感到惊讶,还系统性阐述了OpenAI的用人标准、未来路线图以及算力分配这些重要问题。 一句话,凡是你对OpenAI感到好奇的问题,他俩几乎都谈到了~ 一水 发自 凹非寺 量子位 | 公众号 QbitAI 采访时间不到1小时,信息密度却堪称爆炸! 氛围编码的下一步或许是氛围研究(Vibe Researching); OpenAI的最终目标是实现自动化研究员; 现有评估指标正趋近饱和,下一个里程碑将涉及实际的发现和在经济相关事物上取得实际进展; 成功的秘诀在于保护基础研究,避免被短期产品竞争所牵制; …… (网友2):听起来像一支有着清晰愿景的强大团队。 话不多说,访谈重点这就奉上—— GPT-5:将推理与Agentic行为引入主流 采访第一趴主要关于GPT-5。 Mark Chen表示, GPT-5是OpenAI试图将推理能力带入主流的一种尝试 。 ...