跨视角泛化能力
Search documents
AAAI 2026最新!OC-VLA:解决感知与动作的错位问题
具身智能之心· 2026-01-19 00:49
Core Viewpoint - The article introduces the Observation-Centric VLA (OC-VLA) model, which addresses the misalignment between perception and action in robotic control by redefining action predictions in the camera coordinate system, enhancing generalization and robustness across different perspectives [3][24]. Background and Motivation - Traditional visual-language models are trained in camera coordinates, leading to a misalignment with robot control signals defined in robot base coordinates, which hinders effective learning [2]. - The disparity between perception and action spaces complicates the transfer of pre-trained visual models to robotic control tasks, especially when data is collected from diverse camera perspectives [2]. Methodology - OC-VLA decouples the supervision of robot actions from the robot base coordinate system, predicting actions directly in the third-person camera coordinate system [3][5]. - This approach aligns the action targets with visual observations, reducing ambiguity caused by varying camera angles and enhancing the model's ability to learn spatial relationships between the robot and camera [3][10]. Training and Inference - The training phase involves transforming the robot's pose from the robot base coordinate system to the camera coordinate system, facilitating a unified reference for both images and predicted actions [6][7]. - During inference, the model's predictions are converted back to the robot base coordinate system for control [8]. Experimental Results - Experiments were conducted using the Dita model structure across discrete and continuous action spaces, demonstrating significant improvements in task success rates when actions are predicted in the camera coordinate system [11][15]. - In the ManiSkill2 simulation, using camera-defined actions led to a success rate increase of approximately 14% in discrete action models [15]. Performance Evaluation - The OC-VLA model was tested under various conditions, including fixed camera positions and slight perturbations, showing enhanced performance and robustness in zero-shot scenarios [21]. - The model's ability to generalize across different camera perspectives was validated, indicating its practical value in real-world robotic applications [21][24]. Conclusion - OC-VLA presents a simple yet effective framework for action prediction based on camera coordinates, resolving spatial misalignment issues in existing models and demonstrating significant potential for general robotic strategy development [24].
AAAI 2026最新!OC-VLA:解决感知与动作的错位问题,以观测视角为中心的VLA范式
具身智能之心· 2026-01-18 09:33
Core Insights - The article introduces the Observation-Centric VLA (OC-VLA) model, which addresses the misalignment between perception and action in robotic tasks by redefining action predictions in the camera coordinate system, enhancing generalization and robustness across different perspectives [3][24]. Background and Motivation - Traditional visual-language models are trained in camera coordinates, while robot control signals are defined in robot base coordinates, leading to a misalignment that hinders effective learning [2]. - The disparity between perception and action spaces complicates the transfer of pre-trained visual models to robotic control tasks, especially when data is collected from diverse camera perspectives [2]. Methodology - OC-VLA decouples the supervision of robot actions from the robot base coordinate system, predicting actions directly in the third-person camera coordinate system [3][5]. - This approach aligns the action targets with the visual observations, reducing ambiguity caused by varying camera angles and enhancing the model's ability to learn spatial relationships between the robot and the camera [3][10]. Training and Inference - The training phase involves transforming the robot's pose from the robot base coordinate system to the camera coordinate system, facilitating a unified reference frame for both images and predicted actions [6][7]. - During inference, the model's predicted poses or actions are redefined back to the robot base coordinate system for control purposes [8]. Experimental Results - The OC-VLA model was tested using the Dita model structure across discrete and continuous action spaces, demonstrating significant improvements in task success rates, particularly in discrete action models, with a success rate increase of approximately 14% [11][15]. - Experiments on the ManiSkill2 suite with various tasks showed that using actions defined in the camera coordinate system consistently improved task success rates [13][15]. Robustness and Generalization - OC-VLA was evaluated under different camera settings, including fixed positions and slight perturbations, showing enhanced performance in zero-shot scenarios and overall task success rates [21]. - The model's design allows for seamless integration with existing frameworks without additional computational costs, improving learning stability and convergence efficiency [10][24]. Conclusion - OC-VLA presents a simple yet effective framework for action prediction based on camera coordinates, addressing spatial misalignment issues in existing models and demonstrating significant potential for practical applications in multi-modal robotic systems [24].