强化学习

Search documents
纯视觉最新SOTA!AdaThinkDrive:更灵活的自动驾驶VLA思维链(清华&小米)
自动驾驶之心· 2025-09-18 23:33
Core Viewpoint - The article discusses the limitations of existing Chain-of-Thought (CoT) reasoning methods in Vision-Language-Action (VLA) models for autonomous driving, particularly in simple scenarios where they do not improve decision quality and introduce unnecessary computational overhead. It introduces AdaThinkDrive, a new VLA framework that employs a dual-mode reasoning mechanism inspired by the "fast and slow thinking" theory, allowing the model to adaptively choose when to reason based on scene complexity [3][4][10]. Group 1: Introduction and Background - The shift from traditional modular approaches to end-to-end architectures in autonomous driving systems is highlighted, noting that while modular methods offer flexibility, they suffer from information loss between components, leading to cumulative errors in complex scenarios. End-to-end methods mitigate this issue but are still limited by their reliance on supervised data [7]. - The article categorizes current VLA methods into two paradigms: meta-action methods focusing on high-level guidance and planning-based methods that predict trajectories directly from raw inputs. The application of CoT techniques is becoming more prevalent, particularly in complex scenarios, but their effectiveness in simple scenarios is questioned [14][15]. Group 2: AdaThinkDrive Framework - AdaThinkDrive is proposed as an end-to-end VLA framework that incorporates a "fast answer/slow thinking" mechanism, allowing the model to switch adaptively between direct prediction and explicit reasoning based on scene complexity. This is achieved through a three-stage adaptive reasoning strategy [11][18]. - The framework's performance is validated through extensive experiments on the Navsim benchmark, achieving a Predictive Driver Model Score (PDMS) of 90.3, which is 1.7 points higher than the best pure visual baseline model. The model demonstrates superior adaptive reasoning capabilities, selectively enabling CoT in 96% of complex scenarios and defaulting to direct trajectory prediction in 84% of simple scenarios [4][18][50]. Group 3: Experimental Results and Analysis - The article presents a comprehensive evaluation of AdaThinkDrive against existing models, showing that it outperforms both "always think" and "never think" baseline models, with PDMS improvements of 2.0 and 1.4 points, respectively. Additionally, the reasoning time is reduced by 14% compared to the "always think" baseline, indicating a balance between accuracy and efficiency [4][18][58]. - The results indicate that the optimal reasoning strategy is not universal but depends on scene complexity, emphasizing the need for models to adaptively enable reasoning based on the context [10][18]. Group 4: Conclusion - The article concludes that reasoning in simple scenarios often increases computational costs without enhancing decision quality. AdaThinkDrive addresses this by allowing agents to learn when to think, guided by an adaptive thinking reward mechanism. The experimental results on the NAVSIM benchmark demonstrate that AdaThinkDrive achieves state-of-the-art performance, underscoring the importance of adaptive thinking for accurate and efficient decision-making in autonomous driving systems [66].
华人学者一天发表了11篇Nature论文
生物世界· 2025-09-18 10:05
撰文丨王聪 编辑丨王多鱼 排版丨水成文 2025 年 9 月 17 日,国际顶尖学术期刊 Nature 上线了 24 篇论文 , 其中 10 篇来自华人学者 (包括作为通讯作者和第一作者的论文) 。 9 月 17 日,香港城市大学 任广禹 、中国科学院深圳先进技术研究院 张杰 、 香港岭南大学 Wu Shengfan 、 吉林大学 蒋青 作为共同通讯作者 ( Wenlin Jiang 、 Geping Qu 为 共同第一作者) , 在 Nature 期刊发表了题为: Toughened self-assembled monolayers for durable perovskite solar cells ( 用于持久钙 钛矿太阳能电池的强化自组装单分子层 ) 的研究论文 【1】 。 9 月 17 日,普林斯顿大学 Zhihao Luo (现单位为犹他大学) 作为第一作者兼共同通讯作者,在 Nature 期刊发表了题为: Transitions in dynamical regime and neural mode during perceptual decisions ( 感知决策过程中动态模式和神经模式的 ...
DeepSeek首次回应“蒸馏OpenAI”质疑
第一财经· 2025-09-18 05:34
Core Viewpoint - DeepSeek's R1 model has gained significant attention after being published in the prestigious journal "Nature," showcasing its ability to enhance reasoning capabilities through reinforcement learning without relying heavily on supervised data [3][11]. Group 1: Model Development and Training - The training cost for the DeepSeek-R1 model was approximately $294,000, with specific costs for different components detailed as follows: R1-Zero training cost was $202,000, SFT dataset creation cost was $10,000, and R1 training cost was $82,000 [10]. - DeepSeek-R1 utilized 64×8 H800 GPUs for training, taking about 198 hours for R1-Zero and around 80 hours for R1 [10]. - The total training cost, including the earlier V3 model, remains significantly lower than competitors, totaling around $6 million for V3 and $294,000 for R1 [10]. Group 2: Model Performance and Validation - DeepSeek's approach allows for significant performance improvements in reasoning capabilities through large-scale reinforcement learning, even without supervised fine-tuning [13]. - The model's ability to self-validate and reflect on its answers enhances its performance on complex programming and scientific problems [13]. - The research indicates that the R1 model has become the most popular open-source reasoning model globally, with over 10.9 million downloads on Hugging Face [10]. Group 3: Industry Impact and Peer Review - The publication of the R1 model in "Nature" sets a precedent for transparency in AI research, addressing concerns about the reliability of benchmark tests and the potential for manipulation [11]. - The research emphasizes the importance of independent peer review in validating the capabilities of AI systems, which is crucial in an industry facing scrutiny over performance claims [11].
DeepSeek首次回应“蒸馏OpenAI”质疑
Di Yi Cai Jing· 2025-09-18 04:34
Core Insights - DeepSeek's research paper, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning," has been published in the prestigious journal Nature, highlighting significant advancements in AI reasoning capabilities [1][11]. Group 1: Research and Development - The initial version of DeepSeek's paper was released on arXiv in January, and the Nature publication includes more detailed model specifications and reduced anthropomorphism in descriptions [5]. - DeepSeek-R1's training cost was reported to be $294,000, with specific costs for different components outlined, including $202,000 for DeepSeek-R1-Zero training and $82,000 for SFT data creation [9]. - The training utilized A100 GPUs for smaller models and expanded to 660 billion parameters for the R1 model, demonstrating a scalable approach to model development [8][10]. Group 2: Model Performance and Validation - DeepSeek-R1 has become the most popular open-source inference model globally, with over 10.9 million downloads on Hugging Face, marking it as the first mainstream large language model to undergo peer review [11]. - The research emphasizes that significant reasoning capabilities can be achieved through reinforcement learning without relying heavily on supervised fine-tuning, which is a departure from traditional methods [13]. - The model's training involved a reward mechanism that encourages correct reasoning, allowing it to self-validate and improve its performance on complex tasks [13]. Group 3: Industry Implications - The findings from DeepSeek's research could set a precedent for future AI model development, particularly in enhancing reasoning capabilities without extensive data requirements [11][13]. - The independent peer review process adds credibility to the model's performance claims, addressing concerns about potential manipulation in AI benchmarking [11].
开源Agent模型榜第一名,现在是阿里通义DeepResearch
量子位· 2025-09-18 04:20
Core Viewpoint - Alibaba has open-sourced its first deep research agent model, Tongyi DeepResearch, which outperforms existing models like OpenAI's Deep Research and DeepSeek-V3.1 in various authoritative evaluation sets [1][3]. Data Strategy - The model's capability enhancement is attributed to a multi-stage data strategy designed to generate high-quality training data without relying on expensive manual annotations [4][5]. - The team introduced Agentic CPT for incremental pre-training, establishing a solid foundation for the agent [6]. - A systematic and scalable data synthesis scheme was developed to create a positive feedback loop for data generation [7]. Data Construction - An open-world knowledge memory was constructed using a wide range of knowledge documents, web crawler data, knowledge graphs, and trajectory data from post-training [8]. - Three types of action data were created based on diverse question styles and historical trajectory data, enabling extensive exploration of the reasoning-action space [9]. Post-training Data - The team developed a fully automated synthetic data generation scheme to produce datasets that surpass the quality of manual annotations [11][12]. - A new process was designed to extract information from real website data, ensuring the authenticity of data structures while increasing question complexity [14]. Reasoning Modes - Tongyi DeepResearch features both a native ReAct Mode and a Heavy Mode for handling complex multi-step research tasks [15][18]. - The IterResearch paradigm was created to deconstruct tasks into a series of research rounds, allowing the agent to maintain cognitive focus and high-quality reasoning [20]. Training Process - The training process was innovated to connect Agentic CPT, Agentic SFT, and Agentic RL, leading to a new paradigm for agent model training [25][27]. - The team emphasized the importance of data quality and training environment stability over algorithmic factors in the success of reinforcement learning projects [37][39]. Application Deployment - Tongyi DeepResearch has empowered multiple internal applications within Alibaba, including the Gaode travel agent, which integrates complex query capabilities into its app [42][43]. - A simulated training environment was created to address the high costs and inconsistencies associated with real-time web API development [44]. Legal AI Application - Tongyi Law Rui, a legal AI agent, aims to provide professional legal services, leveraging innovative agent architecture and iterative planning technology for complex reasoning tasks [46].
“这一空白终于被打破”,梁文锋论文登上《自然》封面
Guan Cha Zhe Wang· 2025-09-18 03:27
《科技日报》则在报道中介绍称,梁文锋参与的研究表明,大语言模型的推理能力可通过纯强化学习来 提升,从而减少增强性能所需的人类输入工作量。训练出的模型在数学和STEM领域研究生水平问题等 任务上,比传统训练的大语言模型表现更好。 DeepSeek-R1包含一个在人类监督下的深入训练阶段,以优化推理过程。梁文锋团队报告称,该模型使 用了强化学习而非人类示例来开发推理步骤,减少了训练成本和复杂性。DeepSeek-R1在被展示优质的 问题解决案例后,会获得一个模板来产生推理过程,即这一模型通过解决问题获得奖励,从而强化学习 效果。在评估AI表现的各项测试中,DeepSeek-R1-Zero和DeepSeek-R1的表现都十分优异。 据智通财经9月18日消息,由DeepSeek团队共同完成、梁文锋担任通讯作者的DeepSeek-R1推理模型研 究论文,登上了国际权威期刊《自然(Nature)》的封面。 与今年1月发布的DeepSeek-R1的初版论文相比,本次论文披露了更多模型训练的细节,并正面回应了 模型发布之初的蒸馏质疑。DeepSeek-R1也是全球首个经过同行评审的主流大语言模型。Nature评价 道:目前几 ...
DeepSeek论文登上《自然》封面,R1成为首个严格学术审查大模型
Xin Lang Cai Jing· 2025-09-18 02:23
Core Insights - DeepSeek's R1 model has been recognized as the first major language model to be peer-reviewed and published in the prestigious journal Nature, marking a significant milestone in AI research [1][2] - The R1 model achieved over 10.9 million downloads on Hugging Face, making it the most popular open-source inference model globally [2] - DeepSeek's innovative approach utilizes pure reinforcement learning to enhance reasoning capabilities, diverging from traditional human-imitation methods [2][3] Company Developments - DeepSeek's R1 model was developed with a training cost of only $294,000, significantly lower than the costs associated with training AI models by OpenAI and Google, which can reach millions [2] - The company released an upgraded version, DeepSeek-V3.1, which features a mixed reasoning architecture, improved thinking efficiency, and enhanced agent capabilities [3] - DeepSeek was founded in 2023 in Hangzhou, backed by the quantitative firm Huansquare, with a team composed of experts from top universities and international institutions [3] Industry Context - The publication of DeepSeek's research is seen as a critical step in addressing the rampant speculation and unverified claims within the AI industry, emphasizing the importance of independent peer review [3] - The recognition of DeepSeek's work by Nature highlights China's advancements in foundational research in large models, contributing to the global AI landscape [2]
刚刚,DeepSeek-R1论文登上Nature封面,通讯作者梁文锋
机器之心· 2025-09-17 17:00
Core Viewpoint - The article highlights the significance of DeepSeek-R1, which is recognized as the first large language model (LLM) to pass peer review in a prestigious academic journal, Nature. This achievement marks a pivotal shift in the AI industry towards more rigorous scientific validation of AI models, moving from mere technical competition to a focus on scientific discipline and public trust [5][11][12]. Summary by Sections DeepSeek-R1 Overview - DeepSeek-R1 is trained using reinforcement learning, where the model receives rewards for correct answers and penalties for incorrect ones, enabling it to develop reasoning capabilities similar to human problem-solving [7][8]. - The model's ability to self-validate and reflect on its performance enhances its effectiveness in programming and advanced scientific inquiries [7]. Peer Review Significance - The peer review process serves as a critical gatekeeper, requiring AI companies to substantiate their claims with solid evidence rather than self-promotion [10]. - The rigorous evaluation of DeepSeek-R1's methodology and limitations by external experts helps to mitigate inflated claims in the AI industry [9][10]. Training Methodology - DeepSeek-R1 employs a novel multi-stage pipeline that enhances reasoning capabilities without relying heavily on supervised data [15]. - The model utilizes Group Relative Policy Optimization (GRPO) to reduce training costs and incorporates a dual reward mechanism based on accuracy and format [16][17]. - A structured training template guides the model to articulate its reasoning process before providing final answers, allowing for clear observation of its learning progress [18]. Performance and Limitations - DeepSeek-R1 demonstrates advanced self-evolution capabilities, developing higher-order reasoning skills autonomously during training [20]. - Despite its advancements, the model still faces challenges such as poor readability and language mixing in its outputs [21][26]. Cold Start and Reinforcement Learning - The development team collected a small amount of long Chain of Thought (CoT) data to stabilize the model during the early stages of reinforcement learning [22]. - The integration of language consistency rewards during training aims to improve the model's readability, although it may slightly affect performance [23]. Distillation and Model Efficiency - The team successfully distilled the reasoning capabilities of DeepSeek-R1 into smaller models, significantly enhancing their performance [29]. - Benchmark tests indicate that DeepSeek-R1 competes effectively with state-of-the-art models in reasoning tasks, showcasing its robust capabilities [30][31].
别克至境L7首次亮相:首发搭载高通SA8775P座舱芯片,采用“逍遥智行”辅助驾驶系统
Xin Lang Ke Ji· 2025-09-17 14:37
Core Viewpoint - Buick's high-end new energy sub-brand "Zhijing" has unveiled its flagship sedan, the Zhijing L7, which integrates over a century of Buick's experience and significant investment in resources [2] Group 1: Product Features - The Zhijing L7 is built on the new Buick "Xiaoyao" super integration vehicle architecture and is now available at Buick dealerships, with an early bird plan offering lifetime free maintenance for orders placed before September 28 [2] - It features the "Zhenlong" range extension system with a power output of 252 kW, achieving 0-100 km/h acceleration in 5.9 seconds and a low fuel consumption of 0.5 L per 100 km [2] - The vehicle offers a pure electric range of 302 km and a comprehensive range of 1420 km, with fast charging capabilities allowing 30% to 80% charge in just 18 minutes [2] Group 2: Advanced Technology - The Zhijing L7 is equipped with the Buick "Xiaoyao Zhixing" advanced driver assistance system, featuring the Momenta R6 flywheel model for full-scenario driving assistance, including "no-stop" city NOA and the industry's first "no-parking one-button parking" [3] - It incorporates Qualcomm's latest SA8775P chip with a computing power of 72 TOPS, a 50-inch panoramic AR-HUD head-up display, and a 15.6-inch smart central control screen [3] Group 3: Design and Comfort - The vehicle dimensions are 5032 mm x 1952 mm x 1500 mm with a wheelbase of 3000 mm, featuring a starry wing exterior design and a sleek coupe shape [3] - The interior boasts a new pure floating island design aesthetic, with high-quality Nappa leather seats and a 27-speaker Buick Sound theater-level audio system [3][4] Group 4: Chassis and Suspension - The Zhijing L7 utilizes a front double wishbone and rear five-link suspension structure, with RTD continuous damping variable suspension for real-time body posture control, enhancing ride comfort and stability [4]
稚晖君机器人炸场:全球首秀“真男人必会的韦伯斯特空翻”
量子位· 2025-09-17 11:06
金磊 发自 凹非寺 量子位 | 公众号 QbitAI 真就一个大写的"哇塞"—— 智元的 灵犀X2 ,成了 全球首个 完成 韦伯斯特空翻 的机器人! 要知道,韦伯斯特空翻是空翻里的进阶技巧,属于中高级水平。 一般完成这个动作,需要靠一条腿强有力地蹬地,另一条腿摆动带动身体翻转,对腿部爆发力和协调性要求更高。 而且啊,人类在抖音上也是以能完成这个动作为由头频发视频,例如 "重庆炫阳特技东哥" : △ 图源:抖音"重庆炫阳特技东哥" 网友们看完灵犀X2的韦伯斯特空翻,也是在评论区纷纷打出那句famous的"名言": 真男人必会韦伯斯特。 不过现在来看,这句话得改成 "真机器人,也得必会韦伯斯特" 了 。 稚晖君 还打趣说道: 灵犀X2成功做到了我都做不出的动作。 先来了解一下这个机器人 从官方的介绍来看,灵犀X2身高 1.3米 左右,全身有 25-31个自由度 (包括头部的2个自由度)。 由于这次完成韦伯斯特的灵犀X2去掉了头,因此应当是少了2个自由度。 从效果上来看,灵犀X2在运动方面的交互已经有着人类基本的水平,像 跑步 这样的基操,已经是可以应对各种各样的地形: 在无需导航的情况下,灵犀X2也可以完成 自主 ...