AAAI 2026最新！OC-VLA：解决感知与动作的错位问题

Core Viewpoint - The article introduces the Observation-Centric VLA (OC-VLA) model, which addresses the misalignment between perception and action in robotic control by redefining action predictions in the camera coordinate system, enhancing generalization and robustness across different perspectives [3][24]. Background and Motivation - Traditional visual-language models are trained in camera coordinates, leading to a misalignment with robot control signals defined in robot base coordinates, which hinders effective learning [2]. - The disparity between perception and action spaces complicates the transfer of pre-trained visual models to robotic control tasks, especially when data is collected from diverse camera perspectives [2]. Methodology - OC-VLA decouples the supervision of robot actions from the robot base coordinate system, predicting actions directly in the third-person camera coordinate system [3][5]. - This approach aligns the action targets with visual observations, reducing ambiguity caused by varying camera angles and enhancing the model's ability to learn spatial relationships between the robot and camera [3][10]. Training and Inference - The training phase involves transforming the robot's pose from the robot base coordinate system to the camera coordinate system, facilitating a unified reference for both images and predicted actions [6][7]. - During inference, the model's predictions are converted back to the robot base coordinate system for control [8]. Experimental Results - Experiments were conducted using the Dita model structure across discrete and continuous action spaces, demonstrating significant improvements in task success rates when actions are predicted in the camera coordinate system [11][15]. - In the ManiSkill2 simulation, using camera-defined actions led to a success rate increase of approximately 14% in discrete action models [15]. Performance Evaluation - The OC-VLA model was tested under various conditions, including fixed camera positions and slight perturbations, showing enhanced performance and robustness in zero-shot scenarios [21]. - The model's ability to generalize across different camera perspectives was validated, indicating its practical value in real-world robotic applications [21][24]. Conclusion - OC-VLA presents a simple yet effective framework for action prediction based on camera coordinates, resolving spatial misalignment issues in existing models and demonstrating significant potential for general robotic strategy development [24].