马尔可夫决策过程 - filings, earnings calls, financial reports, news

马尔可夫决策过程

Search documents

自动驾驶之心· 2025-08-28 23:32

Core Viewpoint - The article discusses the advancements and potential of reinforcement learning (RL) in the field of autonomous driving, highlighting its evolution and comparison with other learning paradigms such as supervised learning and imitation learning [4][7][8]. Summary by Sections Background - The article notes the recent industry focus on new technological paradigms like VLA and reinforcement learning, emphasizing the growing interest in RL following significant milestones in AI, such as AlphaZero and ChatGPT [4]. Supervised Learning - In autonomous driving, perception tasks like object detection are framed as supervised learning tasks, where a model is trained to map inputs to outputs using labeled data [5]. Imitation Learning - Imitation learning involves training models to replicate actions based on observed behaviors, akin to how a child learns from adults. This is a primary learning objective in end-to-end autonomous driving [6]. Reinforcement Learning - Reinforcement learning differs from imitation learning by focusing on learning through interaction with the environment, using feedback from task outcomes to optimize the model. It is particularly relevant for sequential decision-making tasks in autonomous driving [7]. Inverse Reinforcement Learning - Inverse reinforcement learning addresses the challenge of defining reward functions in complex tasks by learning from user feedback to create a reward model, which can then guide the main model's training [8]. Basic Concepts of Reinforcement Learning - Key concepts include policies, rewards, and value functions, which are essential for understanding how RL operates in autonomous driving contexts [14][15][16]. Markov Decision Process - The article explains the Markov decision process as a framework for modeling sequential tasks, which is applicable to various autonomous driving scenarios [10]. Common Algorithms - Various algorithms are discussed, including dynamic programming, Monte Carlo methods, and temporal difference learning, which are foundational to reinforcement learning [26][30]. Policy Optimization - The article differentiates between on-policy and off-policy algorithms, highlighting their respective advantages and challenges in training stability and data utilization [27][28]. Advanced Reinforcement Learning Techniques - Techniques such as DQN, TRPO, and PPO are introduced, showcasing their roles in enhancing training stability and efficiency in reinforcement learning applications [41][55]. Application in Autonomous Driving - The article emphasizes the importance of reward design and closed-loop training in autonomous driving, where the vehicle's actions influence the environment, necessitating sophisticated modeling techniques [60][61]. Conclusion - The rapid development of reinforcement learning algorithms and their application in autonomous driving is underscored, encouraging practical engagement with the technology [62].

自动驾驶之心· 2025-08-08 16:04

Core Insights - The article discusses the evolution and importance of planning modules in autonomous driving, emphasizing the need for engineers to understand both traditional and machine learning-based approaches to effectively address challenges in the field [5][8][10]. Group 1: Importance of Planning - Understanding planning is crucial for engineers, especially in the context of autonomous driving, as it allows for better service to downstream customers and enhances problem-solving capabilities [8][10]. - The transition from rule-based systems to machine learning systems in planning will likely see a coexistence of both methods for an extended period, with a gradual shift in their usage ratio from 8:2 to 2:8 [8][10]. Group 2: Planning System Overview - The planning system in autonomous vehicles is essential for generating safe, comfortable, and efficient driving trajectories, relying on inputs from perception outputs [11][12]. - Traditional planning modules consist of global path planning, behavior planning, and trajectory planning, with behavior and trajectory planning often working in tandem [12]. Group 3: Challenges in Planning - A significant challenge in the planning technology stack is the lack of standardized terminology, leading to confusion in both academic and industrial contexts [15]. - The article highlights the need for a unified approach to behavior planning, as the current lack of consensus on semantic actions limits the effectiveness of planning systems [18]. Group 4: Planning Techniques - The article outlines three primary tools used in planning: search, sampling, and optimization, each with its own methodologies and applications in autonomous driving [24][41]. - Search methods, such as Dijkstra and A* algorithms, are popular for path planning, while sampling methods like Monte Carlo are used for evaluating numerous options quickly [25][32]. Group 5: Industrial Practices - The article discusses the distinction between decoupled and joint spatiotemporal planning methods, with decoupled solutions being easier to implement but potentially less optimal in complex scenarios [52][54]. - The Apollo EM planner is presented as an example of a decoupled planning approach, which simplifies the problem by breaking it into two-dimensional issues [56][58]. Group 6: Decision-Making in Autonomous Driving - Decision-making in autonomous driving focuses on interactions with other road users, addressing uncertainties and dynamic behaviors that complicate planning [68][69]. - The use of Markov Decision Processes (MDP) and Partially Observable Markov Decision Processes (POMDP) frameworks is essential for handling the probabilistic nature of interactions in driving scenarios [70][74].

任务级奖励提升App Agent思考力，淘天提出Mobile-R1，3B模型可超32B

量子位· 2025-07-20 02:49

Core Insights - The article discusses the limitations of existing Mobile/APP Agents that primarily rely on action-level rewards, which restrict their adaptability in dynamic environments [1][2] - A new interactive reinforcement learning framework called Mobile-R1 is proposed, which incorporates task-level rewards to enhance agent adaptability and exploration capabilities [5][30] - The training process for Mobile-R1 consists of three stages: format fine-tuning, action-level training, and task-level training, which collectively improve the model's performance [6][31] Summary by Sections Existing Limitations - Current Mobile/APP Agents struggle with real-time adaptability due to their reliance on action-level rewards, making it difficult to handle changing mobile environments [1][2] - An example illustrates the failure of existing models in executing complex multi-step tasks [3] Proposed Solution - The collaboration between TaoTian Group's algorithm team and Future Life Lab introduces a multi-round, task-oriented learning approach that combines online learning and trajectory correction [4] - Mobile-R1 is designed to utilize task-level rewards, which are more effective in guiding agents through complex tasks [5] Training Methodology - The training process is divided into three stages: 1. **Format Fine-tuning**: Initial adjustments using supervised fine-tuning with high-quality trajectory data [16] 2. **Action-level Training**: Utilizes group relative policy optimization (GRPO) to evaluate action correctness with action-level rewards [17] 3. **Task-level Training**: Enhances model generalization and exploration through multi-step task-level training [18][20] Experimental Results - Mobile-R1 demonstrated superior performance across various benchmarks, achieving a task success rate of 49.40%, significantly higher than the best baseline model [26] - The results indicate that the three-stage training process effectively improves the model's robustness and adaptability, particularly in dynamic environments [29][30] - The article concludes that Mobile-R1's integration of interactive reinforcement learning and task-level rewards significantly enhances the capabilities of visual language model-based mobile agents [30][32]

港科大 | LiDAR端到端四足机器人全向避障系统 (宇树G1/Go2+PPO)

具身智能之心· 2025-06-29 09:51

Core Viewpoint - The article discusses the Omni-Perception framework developed by a team from the Hong Kong University of Science and Technology, which enables quadruped robots to navigate complex dynamic environments by directly processing raw LiDAR point cloud data for omnidirectional obstacle avoidance [2][4]. Group 1: Omni-Perception Framework Overview - The Omni-Perception framework consists of three main modules: PD-RiskNet perception network, high-fidelity LiDAR simulation tool, and risk-aware reinforcement learning strategy [4]. - The system takes raw LiDAR point clouds as input, extracts environmental risk features using PD-RiskNet, and outputs joint control signals, forming a complete closed-loop control [5]. Group 2: Advantages of the Framework - Direct utilization of spatiotemporal information avoids information loss during point cloud to grid/map conversion, preserving precise geometric relationships from the original data [7]. - Dynamic adaptability is achieved through reinforcement learning, allowing the robot to optimize obstacle avoidance strategies for previously unseen obstacle shapes [7]. - Computational efficiency is improved by reducing intermediate processing steps compared to traditional SLAM and planning pipelines [7]. Group 3: PD-RiskNet Architecture - PD-RiskNet employs a hierarchical risk perception network that processes near-field and far-field point clouds differently to capture local and global environmental features [8]. - The near-field processing uses farthest point sampling (FPS) to reduce data density while retaining key geometric features, and employs gated recurrent units (GRU) to capture local dynamic changes [8]. - The far-field processing uses average down-sampling to reduce noise and extract spatiotemporal features from distant environments [8]. Group 4: Reinforcement Learning Strategy - The obstacle avoidance task is modeled as an infinite horizon discounted Markov decision process, with state space including the robot's kinematic information and historical LiDAR point cloud sequences [10]. - The action space directly outputs target joint positions, allowing the policy to learn the mapping from raw sensor inputs to control signals without complex inverse kinematics [11]. - The reward function incorporates obstacle avoidance and distance maximization rewards to encourage the robot to seek open paths while penalizing deviations from target speeds [13][14]. Group 5: Simulation and Real-World Testing - The framework was validated against real LiDAR data collected using the Unitree G1 robot, demonstrating high consistency in point cloud distribution and structural integrity between simulated and real data [21]. - The Omni-Perception tool showed significant advantages in rendering efficiency, maintaining linear growth in rendering time as the number of environments increased, unlike traditional methods which exhibited exponential growth [22]. - In various tests, the framework achieved a 100% success rate in static obstacle scenarios and demonstrated superior performance in dynamic environments compared to traditional methods [26][27].