端到端的视觉-语言-动作(VLA)模型

Search documents
端到端模型!GraphCoT-VLA:面向模糊指令的操作任务的VLA模型
具身智能之心· 2025-08-13 00:04
Core Viewpoint - The article introduces GraphCoT-VLA, an advanced end-to-end model designed to enhance robot operations under ambiguous instructions and in open-world conditions, significantly improving task success rates and response times compared to existing methods [3][15][37]. Group 1: Introduction and Background - The VLA (Vision-Language-Action) model has become a key paradigm in robotic operations, integrating perception, understanding, and action to interpret and execute natural language commands [5]. - Existing VLA models struggle with ambiguous language instructions and unknown environmental states, limiting their effectiveness in real-world applications [3][8]. Group 2: GraphCoT-VLA Model - GraphCoT-VLA addresses the limitations of current VLA models by incorporating a structured Chain-of-Thought (CoT) reasoning module, which enhances understanding of ambiguous instructions and improves task planning [3][15]. - The model features a real-time updatable 3D pose-object graph that captures the spatial configuration of robot joints and the topological relationships of objects in three-dimensional space, allowing for better interaction modeling [3][9]. Group 3: Key Contributions - The introduction of a novel CoT architecture enables dynamic observation analysis, interpretation of ambiguous instructions, generation of failure feedback, and prediction of future object states and robot actions [15][19]. - The model integrates a dropout-based mixed reasoning strategy to balance rapid inference and deep reasoning, ensuring real-time performance [15][27]. Group 4: Experimental Results - Experiments demonstrate that GraphCoT-VLA significantly outperforms existing methods in task success rates and action fluidity, particularly in scenarios with ambiguous instructions [37][40]. - In the "food preparation" task, GraphCoT-VLA improved accuracy by 10% over the best baseline, while in the "outfit selection" task, it outperformed the leading model by 18.33% [37][38]. Group 5: Ablation Studies - The introduction of the pose-object graph improved success rates by up to 18.33%, enhancing the model's accuracy and action generation fluidity [40]. - The CoT module significantly improved the model's ability to interpret and respond to ambiguous instructions, demonstrating enhanced task planning and future action prediction capabilities [41].