CoRL 2025最新工作！ControlVLA：机器人看10遍就会，“通智大脑”能力再升级！

Core Insights - The article discusses the development of ControlVLA, a novel framework that allows robots to learn complex tasks with minimal human demonstrations, achieving a success rate exceeding 75%, which is nearly four times higher than traditional methods [1][10][15]. Group 1: Research Background - Robots face significant challenges in performing tasks in real-world scenarios, especially with limited demonstrations. Existing few-shot learning methods often rely on simulation-enhanced data or pre-built modules, which struggle with the gap between simulation and reality [7][8]. - Recent advancements in Vision-Language-Action (VLA) models show promise in enhancing robot performance across multiple tasks and environments, but adapting these models efficiently to specific tasks in data-scarce situations remains a challenge [8][9]. Group 2: ControlVLA Framework - ControlVLA integrates pre-trained VLA models with object-centric representations to facilitate efficient few-shot fine-tuning for robot operation tasks. The framework employs a ControlNet-style architecture to maintain the rich prior knowledge of VLA models while focusing on task-critical objects [9][10]. - The workflow of ControlVLA consists of three main steps: 1. Pre-training a large-scale VLA model on diverse operation datasets to learn conditional distributions from visual and language instructions to action spaces [12]. 2. Extracting object-centric representations from demonstration videos to capture geometric and positional features of relevant objects [12]. 3. Fine-tuning the model using a dual attention mechanism that incorporates object information while preserving the pre-trained strategy [12]. Group 3: Experimental Results - The research team tested ControlVLA on the Astribot S1 robot, demonstrating its ability to efficiently complete both short-term and complex long-term tasks with only 10-20 demonstration data points [14][15]. - In experiments involving eight real-world tasks, ControlVLA achieved an overall success rate of 76.7%, significantly surpassing the traditional method's success rate of 20.8% [15][19]. - For long-sequence tasks, ControlVLA maintained an average success rate of 60%, approximately three times better than existing best methods, showcasing its capability to reduce error accumulation during task execution [19][24]. Group 4: Generalization and Cost Efficiency - ControlVLA demonstrated robust generalization capabilities, maintaining a success rate of 60%-70% when tested with unseen objects and new backgrounds, indicating its adaptability in dynamic environments [24][26]. - The framework allows for substantial reductions in the cost of collecting real operation demonstrations, as evidenced by achieving an 80% success rate in the OrganizeToy task with only 20 demonstration data points, while other methods required 100 data points to reach similar performance [21][26].