Workflow
强化学习
icon
Search documents
开源Agent模型榜第一名,现在是阿里通义DeepResearch
量子位· 2025-09-18 04:20
Core Viewpoint - Alibaba has open-sourced its first deep research agent model, Tongyi DeepResearch, which outperforms existing models like OpenAI's Deep Research and DeepSeek-V3.1 in various authoritative evaluation sets [1][3]. Data Strategy - The model's capability enhancement is attributed to a multi-stage data strategy designed to generate high-quality training data without relying on expensive manual annotations [4][5]. - The team introduced Agentic CPT for incremental pre-training, establishing a solid foundation for the agent [6]. - A systematic and scalable data synthesis scheme was developed to create a positive feedback loop for data generation [7]. Data Construction - An open-world knowledge memory was constructed using a wide range of knowledge documents, web crawler data, knowledge graphs, and trajectory data from post-training [8]. - Three types of action data were created based on diverse question styles and historical trajectory data, enabling extensive exploration of the reasoning-action space [9]. Post-training Data - The team developed a fully automated synthetic data generation scheme to produce datasets that surpass the quality of manual annotations [11][12]. - A new process was designed to extract information from real website data, ensuring the authenticity of data structures while increasing question complexity [14]. Reasoning Modes - Tongyi DeepResearch features both a native ReAct Mode and a Heavy Mode for handling complex multi-step research tasks [15][18]. - The IterResearch paradigm was created to deconstruct tasks into a series of research rounds, allowing the agent to maintain cognitive focus and high-quality reasoning [20]. Training Process - The training process was innovated to connect Agentic CPT, Agentic SFT, and Agentic RL, leading to a new paradigm for agent model training [25][27]. - The team emphasized the importance of data quality and training environment stability over algorithmic factors in the success of reinforcement learning projects [37][39]. Application Deployment - Tongyi DeepResearch has empowered multiple internal applications within Alibaba, including the Gaode travel agent, which integrates complex query capabilities into its app [42][43]. - A simulated training environment was created to address the high costs and inconsistencies associated with real-time web API development [44]. Legal AI Application - Tongyi Law Rui, a legal AI agent, aims to provide professional legal services, leveraging innovative agent architecture and iterative planning technology for complex reasoning tasks [46].
“这一空白终于被打破”,梁文锋论文登上《自然》封面
Guan Cha Zhe Wang· 2025-09-18 03:27
《科技日报》则在报道中介绍称,梁文锋参与的研究表明,大语言模型的推理能力可通过纯强化学习来 提升,从而减少增强性能所需的人类输入工作量。训练出的模型在数学和STEM领域研究生水平问题等 任务上,比传统训练的大语言模型表现更好。 DeepSeek-R1包含一个在人类监督下的深入训练阶段,以优化推理过程。梁文锋团队报告称,该模型使 用了强化学习而非人类示例来开发推理步骤,减少了训练成本和复杂性。DeepSeek-R1在被展示优质的 问题解决案例后,会获得一个模板来产生推理过程,即这一模型通过解决问题获得奖励,从而强化学习 效果。在评估AI表现的各项测试中,DeepSeek-R1-Zero和DeepSeek-R1的表现都十分优异。 据智通财经9月18日消息,由DeepSeek团队共同完成、梁文锋担任通讯作者的DeepSeek-R1推理模型研 究论文,登上了国际权威期刊《自然(Nature)》的封面。 与今年1月发布的DeepSeek-R1的初版论文相比,本次论文披露了更多模型训练的细节,并正面回应了 模型发布之初的蒸馏质疑。DeepSeek-R1也是全球首个经过同行评审的主流大语言模型。Nature评价 道:目前几 ...
DeepSeek论文登上《自然》封面,R1成为首个严格学术审查大模型
Xin Lang Cai Jing· 2025-09-18 02:23
Core Insights - DeepSeek's R1 model has been recognized as the first major language model to be peer-reviewed and published in the prestigious journal Nature, marking a significant milestone in AI research [1][2] - The R1 model achieved over 10.9 million downloads on Hugging Face, making it the most popular open-source inference model globally [2] - DeepSeek's innovative approach utilizes pure reinforcement learning to enhance reasoning capabilities, diverging from traditional human-imitation methods [2][3] Company Developments - DeepSeek's R1 model was developed with a training cost of only $294,000, significantly lower than the costs associated with training AI models by OpenAI and Google, which can reach millions [2] - The company released an upgraded version, DeepSeek-V3.1, which features a mixed reasoning architecture, improved thinking efficiency, and enhanced agent capabilities [3] - DeepSeek was founded in 2023 in Hangzhou, backed by the quantitative firm Huansquare, with a team composed of experts from top universities and international institutions [3] Industry Context - The publication of DeepSeek's research is seen as a critical step in addressing the rampant speculation and unverified claims within the AI industry, emphasizing the importance of independent peer review [3] - The recognition of DeepSeek's work by Nature highlights China's advancements in foundational research in large models, contributing to the global AI landscape [2]
刚刚,DeepSeek-R1论文登上Nature封面,通讯作者梁文锋
机器之心· 2025-09-17 17:00
Core Viewpoint - The article highlights the significance of DeepSeek-R1, which is recognized as the first large language model (LLM) to pass peer review in a prestigious academic journal, Nature. This achievement marks a pivotal shift in the AI industry towards more rigorous scientific validation of AI models, moving from mere technical competition to a focus on scientific discipline and public trust [5][11][12]. Summary by Sections DeepSeek-R1 Overview - DeepSeek-R1 is trained using reinforcement learning, where the model receives rewards for correct answers and penalties for incorrect ones, enabling it to develop reasoning capabilities similar to human problem-solving [7][8]. - The model's ability to self-validate and reflect on its performance enhances its effectiveness in programming and advanced scientific inquiries [7]. Peer Review Significance - The peer review process serves as a critical gatekeeper, requiring AI companies to substantiate their claims with solid evidence rather than self-promotion [10]. - The rigorous evaluation of DeepSeek-R1's methodology and limitations by external experts helps to mitigate inflated claims in the AI industry [9][10]. Training Methodology - DeepSeek-R1 employs a novel multi-stage pipeline that enhances reasoning capabilities without relying heavily on supervised data [15]. - The model utilizes Group Relative Policy Optimization (GRPO) to reduce training costs and incorporates a dual reward mechanism based on accuracy and format [16][17]. - A structured training template guides the model to articulate its reasoning process before providing final answers, allowing for clear observation of its learning progress [18]. Performance and Limitations - DeepSeek-R1 demonstrates advanced self-evolution capabilities, developing higher-order reasoning skills autonomously during training [20]. - Despite its advancements, the model still faces challenges such as poor readability and language mixing in its outputs [21][26]. Cold Start and Reinforcement Learning - The development team collected a small amount of long Chain of Thought (CoT) data to stabilize the model during the early stages of reinforcement learning [22]. - The integration of language consistency rewards during training aims to improve the model's readability, although it may slightly affect performance [23]. Distillation and Model Efficiency - The team successfully distilled the reasoning capabilities of DeepSeek-R1 into smaller models, significantly enhancing their performance [29]. - Benchmark tests indicate that DeepSeek-R1 competes effectively with state-of-the-art models in reasoning tasks, showcasing its robust capabilities [30][31].
别克至境L7首次亮相:首发搭载高通SA8775P座舱芯片,采用“逍遥智行”辅助驾驶系统
Xin Lang Ke Ji· 2025-09-17 14:37
Core Viewpoint - Buick's high-end new energy sub-brand "Zhijing" has unveiled its flagship sedan, the Zhijing L7, which integrates over a century of Buick's experience and significant investment in resources [2] Group 1: Product Features - The Zhijing L7 is built on the new Buick "Xiaoyao" super integration vehicle architecture and is now available at Buick dealerships, with an early bird plan offering lifetime free maintenance for orders placed before September 28 [2] - It features the "Zhenlong" range extension system with a power output of 252 kW, achieving 0-100 km/h acceleration in 5.9 seconds and a low fuel consumption of 0.5 L per 100 km [2] - The vehicle offers a pure electric range of 302 km and a comprehensive range of 1420 km, with fast charging capabilities allowing 30% to 80% charge in just 18 minutes [2] Group 2: Advanced Technology - The Zhijing L7 is equipped with the Buick "Xiaoyao Zhixing" advanced driver assistance system, featuring the Momenta R6 flywheel model for full-scenario driving assistance, including "no-stop" city NOA and the industry's first "no-parking one-button parking" [3] - It incorporates Qualcomm's latest SA8775P chip with a computing power of 72 TOPS, a 50-inch panoramic AR-HUD head-up display, and a 15.6-inch smart central control screen [3] Group 3: Design and Comfort - The vehicle dimensions are 5032 mm x 1952 mm x 1500 mm with a wheelbase of 3000 mm, featuring a starry wing exterior design and a sleek coupe shape [3] - The interior boasts a new pure floating island design aesthetic, with high-quality Nappa leather seats and a 27-speaker Buick Sound theater-level audio system [3][4] Group 4: Chassis and Suspension - The Zhijing L7 utilizes a front double wishbone and rear five-link suspension structure, with RTD continuous damping variable suspension for real-time body posture control, enhancing ride comfort and stability [4]
稚晖君机器人炸场:全球首秀“真男人必会的韦伯斯特空翻”
量子位· 2025-09-17 11:06
Core Viewpoint - The article highlights the achievement of the Lingxi X2 robot, which has become the first robot globally to complete a Webster flip, a complex acrobatic maneuver that demonstrates advanced capabilities in robotics [1][7]. Group 1: Robot Capabilities - The Lingxi X2 robot stands approximately 1.3 meters tall and possesses 25-31 degrees of freedom, although it lost 2 degrees due to the removal of its head for the Webster flip [13][14]. - The robot can perform basic movements like running and can navigate various terrains without the need for navigation systems, showcasing its autonomous obstacle avoidance capabilities [16][19]. - The successful execution of the Webster flip required overcoming significant challenges, including high dynamical complexity, real-time perception and feedback, and high hardware reliability [23][24]. Group 2: Technological Innovations - The achievement is attributed to the Lingchuan platform, which is an AI-enhanced tool for robot motion and expression creation, allowing for the design and secondary development of robot movements [20][19]. - The robot's motion capabilities are based on a reinforcement learning strategy that utilizes human video data to train its movements, ensuring precise execution in real-world scenarios [24]. Group 3: Future Developments - The Lingxi X2 series includes other models such as Lingxi X2-W and Lingxi X2-N, which are designed for different operational capabilities, including task intelligence and adaptability to various terrains [26][34]. - The company plans to scale production of the Lingxi X2 by the second half of 2025, with an expected output of several thousand units by the end of 2026 [36].
“百分之百的中国车”,别克首款增程式轿车至境L7亮相
Guan Cha Zhe Wang· 2025-09-17 10:38
Core Viewpoint - The Buick Zhijing L7, the first extended-range sedan from SAIC-GM Buick, was unveiled on September 15, 2023, and is touted as the "strongest extended-range luxury sedan" in the industry, developed entirely in China [1][3]. Group 1: Product Features - The Buick Zhijing L7 is built on the "Xiaoyao" super fusion architecture and features the "Zhenlong" extended-range technology, which includes a maximum power output of 252 kW and accelerates from 0 to 100 km/h in just 5.9 seconds [5]. - The vehicle boasts a comprehensive fuel consumption of 0.5L per 100 km, with a pure electric range of up to 302 km and a total range of 1420 km [5]. - It supports the fastest charging in its class at 130 kW, allowing for a 30% to 80% charge in just 18 minutes [5]. Group 2: Technological Advancements - The Zhijing L7 is equipped with the latest Qualcomm SA8775P chip, providing a neural network computing power of 72 TOPS, and features a 50-inch panoramic AR-HUD and a 15.6-inch smart central control screen [9]. - It incorporates the "Xiaoyao Zhixing" advanced driver-assistance system, which includes full-scene driving assistance capabilities and the industry's first "no-stop one-button parking" feature [7]. Group 3: Design and Comfort - The vehicle's dimensions are 5032 mm in length, 1952 mm in width, and 1500 mm in height, with a wheelbase of 3000 mm, positioning it as a C-class sedan with a sleek fastback design [11]. - The interior features a premium design with high-quality materials, including a 27-speaker Buick Sound theater-level audio system and multi-mode headrest speakers [11]. Group 4: Market Positioning - The Zhijing L7 will compete with domestic electric vehicles such as the Xiangjie S9 and Avita 12, and its brand strength in the new energy era remains to be validated [13].
腾讯AI Lab首创RL框架Parallel-R1,教大模型学会「并行思维」
机器之心· 2025-09-17 09:37
自从 Google Gemini 将数学奥赛的成功部分归功于「并行思维」后,如何让大模型掌握这种并行探索多种推理路径的能力,成为了学界关注的焦点。 然而,现有方法多依赖于监督微调(SFT),模型一来只能模仿预先构造的 parallel thinking 数据,难以泛化到真实的复杂任务中,其次这种方式对数据要求很高, 往往需要复杂的 data pipeline 来构造。 为解决这些难题,来自 腾讯 AI Lab 西雅图、马里兰大学、卡内基梅隆大学、北卡教堂山分校、香港城市大学、圣路易斯华盛顿大学等机构的研究者们( 第一作 者郑童是马里兰大学博士生,本工作于其在腾讯 AI Lab 西雅图实习期间完成) 首创了 Parallel-R1 框架 —— 这是第一个通过强化学习(RL)在通用数学推理任务 上教会大模型进行并行思维的框架 。该框架通过创新的「渐进式课程」与「交替式奖励」设计,成功解决了 RL 训练中的冷启动和奖励设计难题。 实验表明,Parallel-R1 不仅在多个数学基准上带来高达 8.4% 的平均准确率提升,更通过一种 "中程训练脚手架" 的策略,在 AIME25 测试中实现了 42.9% 的性能飞 跃 ...
AI革命下一站:Anthropic与OpenAI斥巨资打造“虚拟员工”
3 6 Ke· 2025-09-17 05:11
这样的训练成本不菲。据知情人士透露,Anthropic计划在未来一年内投入10亿美元,专门建设被称为"强化学习环境"或"健身房"的模拟 办公平台。OpenAI同样不惜重金,预计今年在数据相关领域的支出就将达到10亿美元,到2030年更将增至80亿美元。这些资金既用于搭 建虚拟办公环境,也用于支付专家薪酬。 9月17日消息,AI领域的两大巨头Anthropic和OpenAI正致力于开发能够替代人类执行复杂工作的"AI同事"。其核心方法是使用模拟企业 软件来训练AI模型,使其能像人类员工那样理解和操作真实的工作流程。 为加速这一进程,Anthropic计划在明年投入10亿美元建设大规模的AI训练"健身房"。OpenAI则认为,整个经济未来都可能变成巨大 的"强化学习机器",AI将通过与人类协作和反馈不断进化,从根本上重塑生产力与工作模式。 时薪最高250美元,"AI家教"正在教大模型如何办公 Anthropic与OpenAI正在做一件前所未有的事:让大语言模型真正走进"办公室",学习当一名合格的"数字员工"。 这些AI模型正在接受高强度职业培训,学习操作各类专业办公软件,从Salesforce的客户管理系统、Ze ...
速递|OpenAI和Anthropic的新战场:训练AI操作企业软件,成本年飙80亿美元
Z Potentials· 2025-09-17 03:34
Anthropic 、 OpenAI 等人工智能开发公司正在让大型语言模型 " 上班办公 " 。 这些 AI 模型正在学习使用从 Salesforce 的客户关系管理软件到 Zendesk 的客户支持系统,再到 Cerner 的医疗记录应用等各种工具。其目的是教会 AI 如何处理白领工作者所面临的一些复杂任务。 这种训练模式与 AI 模型以往的任何训练都不同。研究人员为 AI 提供模拟应用程序进行交互练习,同时聘请各领域专家向模型示范如何操作这些应 用。 这些技术的成本并不低廉。据一位知情人士透露, Anthropic 高管内部讨论过未来一年将斥资 10 亿美元打造这些 " 企业应用克隆体 " ——也被称为 强化学习环境或训练场。 雇佣生物学、软件编程和医学等领域的人类专家来教导模型学习新知识及办公软件操作,其成本也日益攀升。 OpenAI 今年早些时候预测,计划今年在数据相关成本上支出约 10 亿美元(包括支付人类专家费用和强化学习训练场), 到 2030 年这一数字将攀 升至 80 亿美元。 若取得成功,这些 AI 训练方法或能帮助 OpenAI 和 Anthropic 突破传统训练技术近期遭遇的部分局限 ...