强化学习

Search documents
官方揭秘ChatGPT Agent背后原理!通过强化学习让模型自主探索最佳工具组合
量子位· 2025-07-23 10:36
Core Insights - The article discusses the technical details and implications of OpenAI's newly launched ChatGPT Agent, marking a significant step in the development of intelligent agents [1][2]. Group 1: ChatGPT Agent Overview - ChatGPT Agent consists of four main components: Deep Research, Operator, and additional tools such as terminal and image generation [3][9]. - The integration of Deep Research and Operator was driven by user demand for a more versatile tool that could handle both research and visual interaction tasks [6][11]. Group 2: Training Methodology - The training method involves integrating all tools into a virtual machine environment, allowing the model to autonomously explore the best tool combinations through reinforcement learning [12]. - The model learns to switch between tools seamlessly, enhancing its ability to complete tasks efficiently without explicit instructions on tool usage [13][14]. Group 3: Team Structure and Collaboration - The ChatGPT Agent team is a merger of the Deep Research and Operator teams, consisting of around 20 to 35 members who collaborated closely to complete the project in a few months [19][20]. - The team emphasizes a user scenario-driven approach, with application engineers participating in model training and researchers involved in deployment [21][22]. Group 4: Challenges and Future Directions - The main challenges faced during training included stability issues and the need for robustness against external factors like website downtime and API limitations [24]. - Future developments aim to create a general-purpose super agent capable of handling a wide range of tasks, with a focus on enhancing adaptability and user feedback integration [25][26]. Group 5: Security Measures - The team has implemented multi-layered security measures to address potential risks, including monitoring for abnormal behavior and requiring user confirmation for sensitive actions [27]. - Special attention is given to biological risks, ensuring that the agent cannot be misused for harmful purposes [24][27].
端到端自动驾驶万字长文总结
自动驾驶之心· 2025-07-23 09:56
Core Viewpoint - The article discusses the current development status of end-to-end autonomous driving algorithms, comparing them with traditional algorithms and highlighting their advantages and limitations [1][3][53]. Summary by Sections Traditional vs. End-to-End Algorithms - Traditional autonomous driving algorithms follow a pipeline of perception, prediction, and planning, where each module has distinct inputs and outputs [3]. - End-to-end algorithms take raw sensor data as input and directly output path points, simplifying the process and reducing error accumulation [3][5]. - Traditional algorithms are easier to debug and have some level of interpretability, but they suffer from cumulative error issues due to the inability to ensure complete accuracy in perception and prediction modules [3][5]. Limitations of End-to-End Algorithms - End-to-end algorithms face challenges such as limited ability to handle corner cases, as they rely heavily on data-driven methods [7][8]. - The use of imitation learning in these algorithms can lead to difficulties in learning optimal ground truth and handling exceptional cases [53]. - Current end-to-end paradigms include imitation learning (behavior cloning and inverse reinforcement learning) and reinforcement learning, with evaluation methods categorized into open-loop and closed-loop [8]. Current Implementations - The ST-P3 algorithm is highlighted as an early work focusing on end-to-end autonomous driving, utilizing a framework that includes perception, prediction, and planning modules [10][11]. - Innovations in the ST-P3 algorithm include a perception module that uses a self-centered cumulative alignment technique and a prediction module that employs a dual-path prediction mechanism [11][13]. - The planning phase of ST-P3 optimizes predicted trajectories by incorporating traffic light information [14][15]. Advanced Techniques - The UniAD system employs a full Transformer framework for end-to-end autonomous driving, integrating multiple tasks to enhance performance [23][25]. - The TrackFormer framework focuses on the collaborative updating of track queries and detect queries to improve prediction accuracy [26]. - The VAD (Vectorized Autonomous Driving) method introduces vectorized representations for better structural information and faster computation in trajectory planning [32][33]. Future Directions - The article suggests that end-to-end algorithms still primarily rely on imitation learning frameworks, which have inherent limitations that need further exploration [53]. - The introduction of more constraints and multi-modal planning methods aims to address trajectory prediction instability and improve model performance [49][52].
夸克健康大模型万字调研报告流出:国内首个!透视主任医师级「AI大脑」背后的深度工程化
机器之心· 2025-07-23 08:57
Core Insights - The Quark Health Model has successfully passed assessments in 12 core medical disciplines, marking it as the first AI model in China to achieve this milestone, demonstrating its advanced capabilities in the healthcare sector [1][3]. Group 1: Research Summary - The development of high-performance reasoning models in the healthcare sector remains challenging despite rapid advancements in general AI models. The Quark Health Model has established a comprehensive process that enhances performance and interpretability by clearly defining data sources and learning methods [3][5]. - The Quark Health Model team emphasizes the importance of high-quality thinking data (Chain-of-Thought, CoT) as foundational material for enhancing the model's reasoning capabilities through reinforcement learning [5][6]. Group 2: Data Production Lines - The Quark Health Model employs two parallel data production lines: one for verifiable data and another for non-verifiable data, ensuring a systematic approach to data quality and model training [6][17]. - The first production line focuses on cold-start data and model fine-tuning, utilizing high-quality data generated by state-of-the-art language models, which are then validated by medical professionals to ensure accuracy and reliability [19][24]. Group 3: Reinforcement Learning and Training - The reinforcement learning phase is critical for enhancing the model's reasoning capabilities, with a focus on generating diverse and high-quality outputs through iterative training and data selection [24][26]. - The model's training process incorporates various mechanisms to evaluate and improve the quality of reasoning, including the use of preference reward models and verification systems to ensure the accuracy and relevance of outputs [33][38]. Group 4: Quality Assessment and Challenges - The Quark Health Model addresses the complexities of multi-solution and multi-path scenarios in healthcare by implementing a robust evaluation system that recognizes the value of diverse reasoning paths and outputs [31][32]. - The model's training includes strategies to mitigate "cheating" behaviors, ensuring that the outputs are not only structurally sound but also medically accurate and reliable [40][42].
刚刚,阿里最强编程模型开源,4800亿参数,Agent分数碾Kimi K2,训练细节公开
3 6 Ke· 2025-07-22 23:53
Core Insights - Alibaba's Qwen team has released its latest flagship programming model, Qwen3-Coder-480B-A35B-Instruct, which is claimed to be the most powerful open-source programming model to date, featuring 480 billion parameters and supporting up to 1 million tokens in context [1][2][16] - The model has achieved state-of-the-art performance in various programming and agent tasks, surpassing other open-source models and even competing with proprietary models like GPT-4.1 [1][3][20] - Qwen3-Coder is designed to significantly enhance productivity, allowing novice programmers to accomplish tasks in a fraction of the time it would take experienced developers [2][24] Model Specifications - Qwen3-Coder offers multiple sizes, with the current release being the most powerful variant at 480 billion parameters, which is greater than Alibaba's previous flagship model Qwen3 at 235 billion parameters but less than Kimi K2 at 1 trillion parameters [2][3] - The model supports a native context of 256K tokens and can be extended to 1 million tokens, optimized for programming tasks [16][20] Performance Metrics - In benchmark tests, Qwen3-Coder has outperformed other models in categories such as Agentic Coding, Agentic Browser Use, and Agentic Tool Use, achieving the best performance among open-source models [1][3][20] - Specific performance metrics include scores in various benchmarks, such as 69.6 in SWE-bench Verified and 77.5 in TAU-Bench Retail, showcasing its capabilities in real-world programming tasks [3][20] Pricing Structure - The API for Qwen3-Coder is available on Alibaba Cloud's platform with a tiered pricing model based on input token volume, ranging from $1 to $6 per million tokens for input and $5 to $60 for output, depending on the token range [4][5][24] - The pricing is competitive compared to other models like Claude Sonnet 4, which has lower input and output costs [4][5] User Experience and Applications - Qwen3-Coder has been made available for free on the Qwen Chat web platform, allowing users to experience its capabilities firsthand [6][24] - Users have reported impressive results in various tasks, including game development and UI design, with the model demonstrating high completion rates and aesthetic quality [9][11][12] Future Developments - The Qwen team is actively working on enhancing the model's performance and exploring self-improvement capabilities for coding agents [24] - More model sizes are expected to be released, aiming to balance deployment costs and performance [24]
字节发布GR-3大模型,开启通用机器人“大脑”新纪元
Jing Ji Guan Cha Bao· 2025-07-22 07:23
Core Insights - ByteDance's Seed team launched a new Vision-Language-Action Model (VLA) named GR-3, which boasts strong generalization capabilities, understanding of abstract concepts, and the ability to manipulate flexible objects [2][3] Model Features - GR-3's key advantage lies in its exceptional generalization ability and understanding of abstract concepts, allowing for efficient fine-tuning with minimal human data [3] - The model utilizes a Mixture-of-Transformers (MoT) architecture, integrating visual-language and action generation modules into an end-to-end model with 4 billion parameters [3] - GR-3 can perform a series of actions based on verbal commands, such as "clean the table," executing tasks like packing leftovers and disposing of trash [3] Training Methodology - GR-3 employs a three-in-one data training method, combining teleoperated robot data, human VR trajectory data, and publicly available image-text data to enhance model performance [4] - The inclusion of teleoperated robot data ensures stability and accuracy in basic tasks, while human VR trajectory data allows for rapid learning of new tasks at nearly double the efficiency of traditional methods [4] Application and Performance - In practical applications, GR-3 demonstrates outstanding performance in general pick-and-place tasks, maintaining high command adherence and success rates even in unfamiliar environments [6] - For long-range table cleaning tasks, GR-3 achieves an average completion rate exceeding 95% based solely on the command "clean the table" [6] - The model exhibits remarkable flexibility and robustness in delicate operations, successfully completing tasks like hanging clothes regardless of the garment type [6] Future Developments - The Seed team plans to expand the model's scale and training data to further enhance GR-3's generalization capabilities for unknown objects [7] - Future enhancements will include the introduction of reinforcement learning (RL) methods to allow the robot to learn from trial and error during actual operations [7] - The release of GR-3 is seen as a significant step towards developing a general-purpose robotic "brain," with aspirations for robots to assist in daily human tasks [7]
关于机器人数据,强化学习大佬Sergey Levine刚刚写了篇好文章
机器之心· 2025-07-22 04:25
Core Viewpoint - The article discusses the challenges and limitations of using alternative data for training large models in the context of artificial intelligence, particularly in robotics, emphasizing that while alternative data can reduce costs, it often compromises the model's generalization capabilities [6][30][40]. Group 1: Challenges in Training Large Models - Training large models, especially in robotics, requires vast amounts of real-world interaction data, which is costly to obtain [2][4]. - Researchers are exploring alternative data sources to balance cost and training effectiveness, but achieving this balance is complex [5][8]. Group 2: Alternative Data Strategies - Various methods for obtaining alternative data include simulation, human videos, and handheld gripper devices, each with its own strengths and weaknesses [10][12][13]. - While these methods have produced significant research outcomes, they represent compromises that may weaken the inherent capabilities of large-scale learning models [14]. Group 3: Limitations of Alternative Data - The reliance on alternative data can lead to a disconnect between the training environment and real-world applications, limiting the model's ability to generalize effectively [26][28]. - The design decisions made when creating alternative data can significantly impact the overlap between successful strategies in real-world scenarios and those learned from alternative data [23][24]. Group 4: Importance of Real-World Data - Real-world data is essential for developing models with broad generalization capabilities, as it allows models to learn the true mechanisms of the world [36]. - Alternative data should be viewed as a supplementary source of knowledge rather than a replacement for real-world experience [37][38]. Group 5: The Concept of "Sporks" - The term "sporks" is used to describe alternative data approaches that attempt to combine the benefits of large-scale training with the cost-effectiveness of alternative data [39][40]. - Other "spork" methods include hybrid systems that combine manual design with learning components, aiming to mitigate the high data demands of machine learning [41][42].
计算机行业点评报告:Kimi:Researcher、K2双线突破,强化学习革新与开源智能的双擎驱动
Huaxin Securities· 2025-07-21 13:34
Investment Rating - The report maintains a "Recommended" investment rating for the computer industry, indicating an expected outperformance of over 10% compared to the benchmark index [10]. Core Insights - The report highlights significant advancements in AI and computer technology, particularly through the launch of Kimi-Researcher and Kimi K2 models by Moonshot AI, which demonstrate breakthroughs in end-to-end reinforcement learning and open-source intelligence [5][6]. - The performance of the computer industry has outpaced the broader market, with a 12.1% increase over the past month and a remarkable 60.5% increase over the past year, compared to the 14.7% increase of the CSI 300 index [2][3]. Summary by Sections Market Performance - The computer industry has shown strong relative performance, with a 1-month increase of 12.1%, a 3-month increase of 10.3%, and a 12-month increase of 60.5% [2]. Investment Highlights - Kimi-Researcher, launched in June 2025, achieved a Pass@1 score of 26.9% on the Humanity's Last Exam benchmark, setting a new record in the field [5]. - The Kimi K2 model, released in July 2025, features a MuonClip optimizer that enhances training stability and supports complex task processing with a context length of 16K, achieving a Pass@1 score of 65.8% on the SWE-bench Verified benchmark [6]. - The Kimi series technologies are positioned to drive the democratization of AI, with API tools enabling developers to integrate intelligent agents into various applications [8]. Investment Recommendations - The report suggests focusing on leading companies in the AI and computer sectors, particularly those with core innovative capabilities, to capitalize on long-term structural growth opportunities [9]. - Notable companies to watch include Google (GOOGL.O) and Microsoft (MSFT.O), which are expected to leverage their positions in AI and cloud computing for future growth [9].
为什么不推荐研究生搞强化学习研究?
自动驾驶之心· 2025-07-21 11:18
原文链接: https://www.zhihu.com/question/1900927726795334198 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我-> 领取大模型巨卷干货 >> 点击进入→ 大模型没那么大Tech技术交流群 本文只做学术分享,如有侵权,联系删文 ,自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一步咨 询 写在前面 我已经很久没答学术上的问题了,因为最近审的申请书一半都是强化学习相关的?所以知乎老给我推强化 学习的各种东西……我就来简单的谈一谈强化学习吧。 强化学习如果说你要是 读到硕士研究生为止 ,哪怕你读的是清华北大的,最重要的基本功就是 调包 ,搞 清楚什么时候该调什么包就可以了,其次就是怎么排列组合,怎么缩小解空间,对一些算法只需要有个基 本的流程性了解就好了。 如果你读的是 博士 ,建议 换个方向 ,我觉得在现在的强化学习上雕花就是浪费时间和生命,当然你要是 以发很多papers,混个教职当然可以,就是你可能很久都做不出真正很好的工作来,混口饭吃也不注重这 个。 我对强化学习的感受就是 古老且原始 ,感觉就好像现在我还拿着一 ...
具身学习专属!硬件结构迭代12版,这款双足机器人平台稳定性提升了300%......
具身智能之心· 2025-07-21 08:24
Core Viewpoint - TRON1 is a cutting-edge research platform designed for educational and scientific purposes, featuring a modular design that supports multiple locomotion forms and algorithms, maximizing research flexibility [1]. Function Overview - TRON1 serves as a humanoid gait development platform, ideal for reinforcement learning research, and supports external devices for navigation and perception [6][4]. - The platform supports C++ and Python for development, making it accessible for users without C++ knowledge [6]. Features and Specifications - The platform includes a comprehensive perception expansion kit with specifications such as: - GPU: NVIDIA Ampere architecture with 1024 CUDA Cores and 32 Tensor Cores - AI computing power: 157 TOPS (sparse) and 78 TOPS (dense) - Memory: 16GB LPDDR5 with a bandwidth of 102.4 GB/s [16]. - TRON1 can integrate various sensors, including LiDAR and depth cameras, to facilitate 3D mapping, localization, navigation, and dynamic obstacle avoidance [13]. Development and Customization - The SDK and development documentation are well-structured, allowing for easy secondary development, even for beginners [34]. - Users can access online updates for software and model structures, enhancing convenience [36]. Additional Capabilities - TRON1 supports voice interaction features, enabling voice wake-up and control, suitable for educational and interactive applications [18]. - The platform can be equipped with robotic arms for various mobile operation tasks, supporting both single-arm and dual-leg configurations [11]. Product Variants - TRON1 is available in standard and EDU versions, both featuring a modular design and similar mechanical parameters, including a maximum load capacity of approximately 10kg [26].
人形机器人产业链展更新
2025-07-21 00:32
Summary of Key Points from the Conference Call Industry Overview - The humanoid robot industry is experiencing significant growth with many large companies entering the market, including traditional automotive parts manufacturers, smartphone companies, and internet firms, which accelerates industry development and exploration of practical applications [1][8][10]. Company-Specific Insights Tesla - Tesla is considering replacing its harmonic gear reducer due to wear issues under high-intensity use, which may delay the launch of its third-generation robot by 4-6 months, now expected in Q3 or Q4 of this year [1][2][5]. - The company is making hardware adjustments to improve the robot's durability and impact resistance, indicating that the original design's stability was insufficient for long-term use [2][14]. - New gear structures, such as cycloidal pinwheel gears, are being tested, but their maturity and reliability still need validation [13][22]. Yush Robot - Yush Robot is a leading player in the domestic robot industry, with high product maturity and strong after-sales service, nearing commercialization through software development partnerships [3][7]. Zhiyuan Company - Zhiyuan recently acquired a listed company but has not yet triggered a backdoor listing concept. Their recent demonstration of a robot using a wheeled chassis and dual-arm structure was deemed technically unremarkable [4][6]. Technological Developments - The core technologies in humanoid robots are focused on VRA operation, VRA post-training, and reinforcement learning, aiming to enhance the success rate of operations for commercial applications [1][11]. - The dexterous hand market is experiencing differentiation, with some companies seeing reduced orders due to ineffective grasping algorithms, leading many to switch to specialized grippers [12][25][26]. Market Trends - The component maturity has significantly improved, especially in joint parts like harmonic gear reducers, but new designs still require extensive testing [13][22]. - The entry of large companies into the humanoid robot sector is accelerating development, enhancing supply chain management and ecosystem building [10]. Challenges and Future Outlook - General-purpose robots face challenges in achieving intelligent capabilities, with expectations that it may take several years before they can enter the household market [32][33]. - Transitionary robotic solutions, such as wheeled mobility and specialized grippers, are seen as more feasible in the near term compared to fully humanoid robots [34]. Additional Insights - The industry is witnessing a split in the performance of dexterous hand manufacturers, with some companies thriving while others struggle due to a lack of effective grasping algorithms [12][25][26]. - Data collection for dexterous hands is challenging due to high precision requirements and immature data collection methods, leading to reliance on virtual simulation environments [28]. This summary encapsulates the key points discussed in the conference call, highlighting the current state and future direction of the humanoid robot industry and specific companies involved.