Workflow
强化学习
icon
Search documents
阿里Qwen提出强化学习新算法GSPO
news flash· 2025-07-27 15:20
Core Insights - The article discusses the introduction of the Group Sequence Policy Optimization (GSPO) algorithm by Tongyi Qwen to enhance Reinforcement Learning (RL) capabilities [1] Group 1 - GSPO defines importance ratios at the sequence level, differentiating it from previous RL algorithms [1] - The algorithm executes clipping, rewards, and optimization at the sequence level [1]
中国互联网大会上,参展的众多AI应用企业不约而同选择这一发展模式,为什么?
Mei Ri Jing Ji Xin Wen· 2025-07-26 16:19
Core Viewpoint - The China Internet Conference held from July 23 to 25 showcased numerous AI-related technologies, with many companies opting for an open-source development model, indicating a significant shift in commercial strategies and business models in the tech industry [1][2]. Group 1: Open Source Development - Open-source represents a different technical route compared to closed-source, reflecting substantial differences in business strategies and profit distribution [2]. - Companies are choosing open-source to enhance collaboration and innovation, allowing for faster development and a more vibrant ecosystem [6]. Group 2: Robotics and AI Applications - A robotics company showcased bipedal robots that perform dynamic movements, primarily focusing on providing open interfaces for schools and developers to customize functionalities [3][5]. - The robots are designed with 3D-printed shells and electric motors, emphasizing the importance of motor torque in robotic performance [5]. - The company aims to support humanoid robot manufacturers, including notable firms like Boston Dynamics, by providing foundational technology [5]. Group 3: Xiaomi's Open Source Initiatives - Xiaomi presented its Vela operating system, which is now fully open-source, aimed at enhancing the development efficiency of other manufacturers and promoting interoperability among devices [6]. - The Xiaomi AIoT training box, also open-source, is designed for educational purposes, allowing partner institutions to implement the system as part of collaborative projects [9]. - A holographic digital human showcased at the event utilizes DeepSeek's open-source code, allowing for cost-effective deployment without licensing fees, with expenses primarily related to computational power for training [9].
二段式端到端新SOTA!港科大FiM:从Planning的角度重新思考轨迹预测(ICCV'25)
自动驾驶之心· 2025-07-26 13:30
Core Viewpoint - The article presents a novel approach to trajectory prediction in autonomous driving, emphasizing a "First Reasoning, Then Forecasting" strategy that integrates intention reasoning to enhance prediction accuracy and reliability [2][4][47]. Group 1: Methodology - The proposed method introduces an intention reasoner based on a query-centric Inverse Reinforcement Learning (IRL) framework, which explicitly incorporates behavioral intentions as spatial guidance for trajectory prediction [2][5][47]. - A bidirectional selective state space model (Bi-Mamba) is developed to improve the accuracy and confidence of trajectory predictions by capturing sequential dependencies in trajectory states [9][47]. - The approach utilizes a grid-level graph representation to model participant behavior, formalizing the task as a Markov Decision Process (MDP) to define future intentions [5][6][21]. Group 2: Experimental Results - Extensive experiments on large-scale datasets such as Argoverse and nuScenes demonstrate that the proposed method significantly enhances trajectory prediction confidence, achieving competitive performance compared to state-of-the-art models [2][33][36]. - The method outperforms existing models in various metrics, including Brier score and minFDE6, indicating its robustness in complex driving scenarios [33][35][36]. - The integration of a spatial-temporal occupancy grid map (S-T OGM) enhances the model's ability to predict future interactions among participants, further improving prediction quality [9][39]. Group 3: Contributions - The article highlights the critical role of intention reasoning in motion prediction, establishing a promising baseline model for future research in trajectory prediction [47]. - The introduction of a reward-driven intention reasoning mechanism provides valuable prior information for trajectory generation, addressing the inherent uncertainties in driving behavior [8][47]. - The work emphasizes the potential of reinforcement learning paradigms in modeling driving behavior, paving the way for advancements in autonomous driving technology [5][47].
开发者福利!一台机器搞定人形运控、强化学习、VLN/VLA
具身智能之心· 2025-07-25 07:11
Core Viewpoint - TRON1 is a cutting-edge research platform designed for educational and scientific purposes, featuring a modular design that supports multiple robotic forms and algorithms, catering to diverse research needs [1]. Group 1: Product Features - TRON1 supports humanoid gait development and is ideal for reinforcement learning research, with the EDU version allowing for external camera integration for navigation and perception tasks [6][24]. - The platform supports C++ and Python for development, making it accessible for users without C++ knowledge [6]. - It features a "three-in-one" modular design that allows for quick switching between bipedal, point-foot, and wheeled locomotion [1]. Group 2: Technical Specifications - The platform is compatible with major simulation platforms like NVIDIA Isaac, Mujoco, and Gazebo, enhancing validation efficiency and lowering research barriers [9]. - TRON1 can be equipped with a robotic arm for various mobile operation tasks, supporting both single-arm and dual-foot configurations [11]. - It integrates LiDAR and depth cameras for 3D mapping, localization, navigation, and dynamic obstacle avoidance [13]. Group 3: Hardware and Performance - The TRON1 standard version and EDU version share similar mechanical parameters, with a weight limit of approximately 10 kg and a maximum speed of 5 m/s for wheeled locomotion [26]. - The platform is powered by an 8-core Arm Cortex-A78AE CPU and features NVIDIA Ampere architecture GPU with AI computing power of 157 TOPS (sparse) and 78 TOPS (dense) [16][19]. - The battery supports a maximum power of 1000W, with a runtime of over 2 hours under rated conditions [26]. Group 4: User Support and Development - Comprehensive user manuals and development guides are provided, ensuring ease of use and support for new users [29][33]. - The platform offers a one-year after-sales service post-acceptance, with paid maintenance and parts support available thereafter [40].
NVIDIA最新!ThinkAct:复杂的具身任务中实现少样本适应、长时程规划
具身智能之心· 2025-07-24 09:53
Core Insights - The article introduces ThinkAct, a dual-system framework designed to enhance the reasoning capabilities of multi-modal large language models (MLLMs) in physical environments by connecting high-level reasoning with low-level action execution [4][9][12] - ThinkAct aims to address the limitations of existing VLA models that struggle with long-term planning and adapting to complex tasks by utilizing reinforced visual latent planning [4][6][9] Group 1: Framework and Methodology - ThinkAct employs a structured approach to VLA reasoning tasks, where the model receives visual observations and textual instructions to predict actions, effectively linking abstract planning with low-level control [12][21] - The framework utilizes reinforcement learning to enhance the reasoning capabilities of MLLMs, encouraging them to generate low-level actions after reasoning through the task [13][19] - A novel action-aligned visual feedback mechanism is introduced to capture long-term goals and encourage visual associations during the planning process [14][18] Group 2: Performance Evaluation - ThinkAct demonstrates superior performance in various robotic operation tasks, achieving a top success rate of 84.4% on the LIBERO benchmark, outperforming other models like DiT-Policy and CoT-VLA [25][26] - In the SimplerEnv evaluation, ThinkAct outperformed baseline action models by significant margins, achieving overall scores of 71.5%, 65.1%, and 43.8% across different settings [25] - The framework also excels in embodied reasoning tasks, showing advantages in long-term and multi-step planning capabilities, as evidenced by its performance on EgoPlan-Bench2 and RoboVQA benchmarks [26][27] Group 3: Qualitative Insights - The article provides qualitative examples illustrating ThinkAct's reasoning process and execution in tasks, showcasing its ability to decompose instructions into meaningful sub-goals and visualize planning trajectories [30][31] - The framework's reinforcement learning adjustments significantly enhance its reasoning capabilities, allowing it to better understand tasks and environments compared to cold-start models [31][32] Group 4: Adaptability and Error Correction - ThinkAct demonstrates effective few-shot adaptation capabilities, successfully generalizing to unseen environments and new skills with minimal demonstration samples [35][37] - The framework's ability to detect execution errors and perform ego correction is highlighted, showcasing its structured reasoning to reconsider tasks and generate corrective plans when faced with failures [37][38]
AI的未来,或许就藏在我们大脑的进化密码之中 | 红杉Library
红杉汇· 2025-07-24 06:29
Core Viewpoint - The article discusses the evolution of the human brain and its implications for artificial intelligence (AI), emphasizing that understanding the brain's evolutionary breakthroughs may unlock new advancements in AI capabilities [2][7]. Summary by Sections Evolutionary Breakthroughs - The evolution of the brain is categorized into five significant breakthroughs that can be linked to AI development [8]. 1. **First Breakthrough - Reflex Action**: This initial function allowed primitive brains to distinguish between good and bad stimuli using a few hundred neurons [8]. 2. **Second Breakthrough - Reinforcement Learning**: This advanced the brain's ability to quantify the likelihood of achieving goals, enhancing AI's learning processes through rewards [8]. 3. **Third Breakthrough - Neocortex Development**: The emergence of the neocortex enabled mammals to plan and simulate actions mentally, akin to slow thinking in AI models [9]. 4. **Fourth Breakthrough - Theory of Mind**: This allowed primates to understand others' intentions and emotions, which is still a developing area for AI [10]. 5. **Fifth Breakthrough - Language**: Language as a learned social system has allowed humans to share complex knowledge, a capability that AI is beginning to grasp [11]. AI Development - Current AI systems have made strides in areas like language understanding but still lag in aspects such as emotional intelligence and self-planning [10][11]. - The article illustrates the potential future of AI through a hypothetical robot's evolution, showcasing how it could develop from simple reflex actions to complex emotional understanding and communication [13][14]. Historical Context - The narrative emphasizes that significant evolutionary changes often arise from unexpected events, suggesting that future breakthroughs in AI may similarly emerge from unforeseen circumstances [15][16].
大模型模型取得国际奥数竞赛金牌级成绩
Ke Ji Ri Bao· 2025-07-24 00:07
Core Insights - Google's DeepMind and OpenAI have both announced that their AI models achieved gold medal-level results in the recent International Mathematical Olympiad (IMO), marking a significant milestone in AI's mathematical reasoning capabilities [1] - Last year, DeepMind's AI models "AlphaProof" and "AlphaGeometry" achieved silver medal-level results, indicating a progression in AI performance [1] - OpenAI's new AI system solved 5 out of 6 IMO problems in 4.5 hours, while DeepMind's "Gemini DeepMind" system achieved the same result shortly after [1] Group 1 - The IMO is considered a benchmark for evaluating AI systems' mathematical reasoning abilities [1] - Both teams utilized natural language processing techniques for their models, differing from previous systems that were specifically designed for IMO and used a programming language called "Lean" [1] - DeepMind's developers explained that reinforcement learning, a branch of machine learning, is key to their success in AI applications, similar to their previous achievements with "AlphaZero" [1] Group 2 - Mathematician Terence Tao expressed excitement about the progress but emphasized the need for reproducible research data to support these claims [2] - IMO gold medalist Joseph Meyer noted that while natural language proofs have readability advantages, lengthy arguments may complicate verification [2]
官方揭秘ChatGPT Agent背后原理!通过强化学习让模型自主探索最佳工具组合
量子位· 2025-07-23 10:36
Core Insights - The article discusses the technical details and implications of OpenAI's newly launched ChatGPT Agent, marking a significant step in the development of intelligent agents [1][2]. Group 1: ChatGPT Agent Overview - ChatGPT Agent consists of four main components: Deep Research, Operator, and additional tools such as terminal and image generation [3][9]. - The integration of Deep Research and Operator was driven by user demand for a more versatile tool that could handle both research and visual interaction tasks [6][11]. Group 2: Training Methodology - The training method involves integrating all tools into a virtual machine environment, allowing the model to autonomously explore the best tool combinations through reinforcement learning [12]. - The model learns to switch between tools seamlessly, enhancing its ability to complete tasks efficiently without explicit instructions on tool usage [13][14]. Group 3: Team Structure and Collaboration - The ChatGPT Agent team is a merger of the Deep Research and Operator teams, consisting of around 20 to 35 members who collaborated closely to complete the project in a few months [19][20]. - The team emphasizes a user scenario-driven approach, with application engineers participating in model training and researchers involved in deployment [21][22]. Group 4: Challenges and Future Directions - The main challenges faced during training included stability issues and the need for robustness against external factors like website downtime and API limitations [24]. - Future developments aim to create a general-purpose super agent capable of handling a wide range of tasks, with a focus on enhancing adaptability and user feedback integration [25][26]. Group 5: Security Measures - The team has implemented multi-layered security measures to address potential risks, including monitoring for abnormal behavior and requiring user confirmation for sensitive actions [27]. - Special attention is given to biological risks, ensuring that the agent cannot be misused for harmful purposes [24][27].
端到端自动驾驶万字长文总结
自动驾驶之心· 2025-07-23 09:56
Core Viewpoint - The article discusses the current development status of end-to-end autonomous driving algorithms, comparing them with traditional algorithms and highlighting their advantages and limitations [1][3][53]. Summary by Sections Traditional vs. End-to-End Algorithms - Traditional autonomous driving algorithms follow a pipeline of perception, prediction, and planning, where each module has distinct inputs and outputs [3]. - End-to-end algorithms take raw sensor data as input and directly output path points, simplifying the process and reducing error accumulation [3][5]. - Traditional algorithms are easier to debug and have some level of interpretability, but they suffer from cumulative error issues due to the inability to ensure complete accuracy in perception and prediction modules [3][5]. Limitations of End-to-End Algorithms - End-to-end algorithms face challenges such as limited ability to handle corner cases, as they rely heavily on data-driven methods [7][8]. - The use of imitation learning in these algorithms can lead to difficulties in learning optimal ground truth and handling exceptional cases [53]. - Current end-to-end paradigms include imitation learning (behavior cloning and inverse reinforcement learning) and reinforcement learning, with evaluation methods categorized into open-loop and closed-loop [8]. Current Implementations - The ST-P3 algorithm is highlighted as an early work focusing on end-to-end autonomous driving, utilizing a framework that includes perception, prediction, and planning modules [10][11]. - Innovations in the ST-P3 algorithm include a perception module that uses a self-centered cumulative alignment technique and a prediction module that employs a dual-path prediction mechanism [11][13]. - The planning phase of ST-P3 optimizes predicted trajectories by incorporating traffic light information [14][15]. Advanced Techniques - The UniAD system employs a full Transformer framework for end-to-end autonomous driving, integrating multiple tasks to enhance performance [23][25]. - The TrackFormer framework focuses on the collaborative updating of track queries and detect queries to improve prediction accuracy [26]. - The VAD (Vectorized Autonomous Driving) method introduces vectorized representations for better structural information and faster computation in trajectory planning [32][33]. Future Directions - The article suggests that end-to-end algorithms still primarily rely on imitation learning frameworks, which have inherent limitations that need further exploration [53]. - The introduction of more constraints and multi-modal planning methods aims to address trajectory prediction instability and improve model performance [49][52].
夸克健康大模型万字调研报告流出:国内首个!透视主任医师级「AI大脑」背后的深度工程化
机器之心· 2025-07-23 08:57
Core Insights - The Quark Health Model has successfully passed assessments in 12 core medical disciplines, marking it as the first AI model in China to achieve this milestone, demonstrating its advanced capabilities in the healthcare sector [1][3]. Group 1: Research Summary - The development of high-performance reasoning models in the healthcare sector remains challenging despite rapid advancements in general AI models. The Quark Health Model has established a comprehensive process that enhances performance and interpretability by clearly defining data sources and learning methods [3][5]. - The Quark Health Model team emphasizes the importance of high-quality thinking data (Chain-of-Thought, CoT) as foundational material for enhancing the model's reasoning capabilities through reinforcement learning [5][6]. Group 2: Data Production Lines - The Quark Health Model employs two parallel data production lines: one for verifiable data and another for non-verifiable data, ensuring a systematic approach to data quality and model training [6][17]. - The first production line focuses on cold-start data and model fine-tuning, utilizing high-quality data generated by state-of-the-art language models, which are then validated by medical professionals to ensure accuracy and reliability [19][24]. Group 3: Reinforcement Learning and Training - The reinforcement learning phase is critical for enhancing the model's reasoning capabilities, with a focus on generating diverse and high-quality outputs through iterative training and data selection [24][26]. - The model's training process incorporates various mechanisms to evaluate and improve the quality of reasoning, including the use of preference reward models and verification systems to ensure the accuracy and relevance of outputs [33][38]. Group 4: Quality Assessment and Challenges - The Quark Health Model addresses the complexities of multi-solution and multi-path scenarios in healthcare by implementing a robust evaluation system that recognizes the value of diverse reasoning paths and outputs [31][32]. - The model's training includes strategies to mitigate "cheating" behaviors, ensuring that the outputs are not only structurally sound but also medically accurate and reliable [40][42].