Workflow
策略学习
icon
Search documents
一个近300篇工作的综述!从“高层规划和低层控制”来看Manipulation任务的发展
具身智能之心· 2026-01-06 00:32
Core Insights - The article discusses the transformative advancements in robotic manipulation driven by the rapid development of visual, language, and multimodal learning, emphasizing the role of large foundation models in enhancing robots' perception and semantic representation capabilities [1][2]. Group 1: High-Level Planning - High-level planning is responsible for clarifying action intentions, organizing sequences, and allocating environmental attention, providing structured guidance for low-level execution [4]. - The core components of high-level planning include task decomposition and decision guidance, integrating multimodal information to address "what to do" and "in what order" [4]. - Task planning based on large language models (LLMs) maps natural language to task steps, with methods like SayCan and Grounded Decoding enhancing execution skill selection and planning capabilities [5]. - Multimodal large language models (MLLMs) break the limitations of pure text input by integrating visual and language reasoning, with models like PaLM-E and VILA demonstrating superior performance in embodied tasks [8]. - Code generation techniques convert planning into executable programs, improving the precision of language-based plans through methods like Code as Policies and Demo2Code [9]. - Motion planning utilizes LLMs and VLMs to generate continuous motion targets, linking high-level reasoning with low-level trajectory optimization [10]. - Usability learning focuses on establishing intrinsic associations between perception and action across geometric, visual, semantic, and multimodal dimensions [11]. - 3D scene representation transforms environmental perception into structured action proposals, bridging perception and action through techniques like Gaussian splatting [12]. Group 2: Low-Level Learning Control - Low-level control translates high-level planning into precise physical actions, addressing the "how to do" aspect of robotic manipulation [14]. - Learning strategies for skill acquisition are categorized into three main types, including pre-training and model-free reinforcement learning [16]. - Input modeling defines how robots perceive the world, emphasizing the integration of multimodal signals through reinforcement learning and imitation learning [18]. - Visual-action models utilize both 2D and 3D visual inputs to enhance action generation, while visual-language-action models integrate semantic, spatial, and temporal information [19]. - Additional modalities like tactile and auditory signals improve robustness in contact-rich manipulation scenarios [20]. Group 3: Challenges and Future Directions - Despite significant technological advancements, robotic manipulation faces four core challenges: the lack of universal architectures, data and simulation bottlenecks, insufficient multimodal physical interaction, and safety and collaboration issues [23][27][28][29]. - Future research directions include developing a "robotic brain" for flexible modal interfaces, establishing autonomous data collection mechanisms, enhancing multimodal physical interaction, and ensuring safety in human-robot collaboration [30]. - The review emphasizes the need for a unified framework that integrates high-level planning and low-level control, with a focus on overcoming data efficiency, physical interaction, and safety collaboration bottlenecks to facilitate the transition of robotic manipulation from laboratory settings to real-world applications [31].
策略学习助力LLM推理效率:MIT与谷歌团队提出异步并行生成新范式
机器之心· 2025-05-21 04:00
Core Insights - The article discusses the innovative research on asynchronous generation in large language models (LLMs) conducted by MIT and Google, highlighting the transition from traditional sequential generation to a more efficient parallel generation approach [5][25]. Group 1: Asynchronous Generation Paradigm - The emerging asynchronous generation paradigm allows for the parallel processing of semantically independent content blocks, achieving a geometric speedup of 1.21 to 1.93 times compared to traditional methods, with quality variations ranging from +2.2% to -7.1% [4][21]. - The research introduces a new markup language, PASTA-LANG, designed specifically for asynchronous generation, which includes three core tags: <promise/>, <async>, and <sync/> [8][10]. Group 2: PASTA System and Training - The PASTA system employs a two-stage training process: the first stage involves supervised fine-tuning using a dataset with PASTA-LANG tags, while the second stage focuses on preference optimization through policy learning [16][18]. - The PASTA model demonstrates significant improvements in both speed and output quality, showcasing its ability to adaptively determine the best asynchronous generation strategies based on content characteristics [6][21]. Group 3: Performance and Scalability - Experimental results indicate that PASTA achieves a balance between performance and quality, with the ability to provide substantial speed improvements even when prioritizing quality [23]. - The scalability of the PASTA method is evident, as ongoing preference optimization continues to enhance model performance, indicating a sustainable path for efficiency improvements [23][24].