一个近300篇工作的综述！从“高层规划和低层控制”来看Manipulation任务的发展

Core Insights - The article discusses the transformative advancements in robotic manipulation driven by the rapid development of visual, language, and multimodal learning, emphasizing the role of large foundation models in enhancing robots' perception and semantic representation capabilities [1][2]. Group 1: High-Level Planning - High-level planning is responsible for clarifying action intentions, organizing sequences, and allocating environmental attention, providing structured guidance for low-level execution [4]. - The core components of high-level planning include task decomposition and decision guidance, integrating multimodal information to address "what to do" and "in what order" [4]. - Task planning based on large language models (LLMs) maps natural language to task steps, with methods like SayCan and Grounded Decoding enhancing execution skill selection and planning capabilities [5]. - Multimodal large language models (MLLMs) break the limitations of pure text input by integrating visual and language reasoning, with models like PaLM-E and VILA demonstrating superior performance in embodied tasks [8]. - Code generation techniques convert planning into executable programs, improving the precision of language-based plans through methods like Code as Policies and Demo2Code [9]. - Motion planning utilizes LLMs and VLMs to generate continuous motion targets, linking high-level reasoning with low-level trajectory optimization [10]. - Usability learning focuses on establishing intrinsic associations between perception and action across geometric, visual, semantic, and multimodal dimensions [11]. - 3D scene representation transforms environmental perception into structured action proposals, bridging perception and action through techniques like Gaussian splatting [12]. Group 2: Low-Level Learning Control - Low-level control translates high-level planning into precise physical actions, addressing the "how to do" aspect of robotic manipulation [14]. - Learning strategies for skill acquisition are categorized into three main types, including pre-training and model-free reinforcement learning [16]. - Input modeling defines how robots perceive the world, emphasizing the integration of multimodal signals through reinforcement learning and imitation learning [18]. - Visual-action models utilize both 2D and 3D visual inputs to enhance action generation, while visual-language-action models integrate semantic, spatial, and temporal information [19]. - Additional modalities like tactile and auditory signals improve robustness in contact-rich manipulation scenarios [20]. Group 3: Challenges and Future Directions - Despite significant technological advancements, robotic manipulation faces four core challenges: the lack of universal architectures, data and simulation bottlenecks, insufficient multimodal physical interaction, and safety and collaboration issues [23][27][28][29]. - Future research directions include developing a "robotic brain" for flexible modal interfaces, establishing autonomous data collection mechanisms, enhancing multimodal physical interaction, and ensuring safety in human-robot collaboration [30]. - The review emphasizes the need for a unified framework that integrates high-level planning and low-level control, with a focus on overcoming data efficiency, physical interaction, and safety collaboration bottlenecks to facilitate the transition of robotic manipulation from laboratory settings to real-world applications [31].