王鹤团队最新工作！解决VLA 模型多依赖单视角图像，缺乏精准几何信息的问题

Core Viewpoint - The article discusses the development of the StereoVLA model, which integrates stereo vision into Vision-Language-Action (VLA) models to enhance spatial perception and improve robotic manipulation capabilities. Group 1: Challenges in Existing VLA Models - Existing VLA models face three core challenges in spatial perception: limitations of single-modal vision, difficulties in integrating geometric and semantic information, and the constraints of current sensor technologies [4][5][6]. Group 2: Technical Architecture of StereoVLA - StereoVLA is built on a three-layer technical architecture: feature extraction, auxiliary training, and data support, which allows for deep integration of geometric perception and semantic understanding [8][10]. - The feature extraction module efficiently combines geometric cues from stereo vision with semantic information from single-view images, enhancing the model's performance [12]. Group 3: Performance Validation - StereoVLA demonstrates significant performance improvements in three key tasks compared to baseline models, achieving near-perfect success rates in specific object manipulation scenarios [13]. - In a comparison of camera configurations, StereoVLA shows superior robustness to camera pose variations, outperforming other setups in various scenarios [14][17]. Group 4: Key Findings from Ablation Studies - Ablation studies confirm the necessity of key design features, showing that the absence of semantic features leads to a significant drop in success rates, highlighting the importance of geometric-semantic integration [15][18]. Group 5: Limitations and Future Directions - While StereoVLA represents a breakthrough in integrating stereo vision with VLA models, there are areas for optimization, including the need for better long-term dependency capture and adaptation to multi-robot scenarios [16][18].