Workflow
ManiAgent
icon
Search documents
原力灵机提出ManiAgent!会 “动手”,会 “思考”,还会“采数据”!
具身智能之心· 2025-10-20 10:00
Core Insights - The article introduces ManiAgent, an innovative agentic framework designed for general robotic manipulation tasks, addressing limitations in existing Vision-Language-Action (VLA) models in complex reasoning and long-term task planning [1][2][26]. Group 1: Framework Overview - ManiAgent consists of multiple agents that collaboratively handle environment perception, sub-task decomposition, and action generation, enabling efficient responses to complex operational scenarios [2][10]. - The framework employs four key technologies: tool invocation, context engineering, real-time optimization, and automated data collection, creating a complete technical link from perception to action execution [8][12]. Group 2: Performance Metrics - In the SimplerEnv benchmark tests, ManiAgent achieved a task success rate of 86.8%, while in real-world pick-and-place tasks, the success rate reached 95.8% [2][10][28]. - The high success rates indicate that ManiAgent can serve as an effective automated data collection tool, generating training data that can match the performance of models trained on manually annotated datasets [2][10]. Group 3: Methodology - The framework includes four types of agents: 1. Scene perception agent, which generates task-relevant scene descriptions using visual language models [11]. 2. Reasoning agent, which evaluates task states and proposes achievable sub-tasks using large language models [11]. 3. Object-level perception agent, which identifies target objects and extracts detailed information for action generation [11]. 4. Controller agent, which generates executable action sequences based on sub-task descriptions and object details [11]. Group 4: Data Collection and Optimization - The automated data collection system is designed to operate with minimal human intervention, significantly reducing labor costs while ensuring high-quality data for VLA model training [12][21]. - The framework incorporates a context processing mechanism to enhance task relevance and information effectiveness, alongside a caching mechanism to reduce action generation delays [12][17]. Group 5: Experimental Results - In the SimplerEnv simulation environment, various tasks demonstrated an average success rate of 86.8%, with specific tasks achieving rates as high as 95.8% [22][28]. - Real-world experiments with the WidowX 250S robotic arm showed a range of tasks with success rates, indicating the framework's versatility across different operational contexts [25][28].