SpatialPoint
Search documents
将深度信息作为VLM核心输入!视启未来×清华×IDEA帮机器人看懂物理世界
量子位· 2026-03-30 03:39
Core Viewpoint - The article discusses the limitations of current Visual-Language Models (VLM) in physical interactions and introduces the SpatialPoint framework, which integrates depth information to enhance spatial perception and interaction capabilities of AI systems [5][11][32]. Group 1: Limitations of Current VLM - Current VLMs can recognize objects but struggle with spatial operations due to reliance on RGB images without accurate depth information, leading to issues like misgrabbing and collisions [6][8]. - Traditional VLMs output 2D bounding boxes and semantic labels, which lack actionable 3D coordinates necessary for robotic execution, creating a gap between perception and action [8][9]. - Existing technologies often treat real and virtual points separately, lacking a unified framework to predict both types of critical spatial points needed for effective interaction [9][12]. Group 2: Introduction of SpatialPoint Framework - SpatialPoint is designed to address the shortcomings of traditional VLMs by incorporating structured depth information as a core input alongside RGB and language data, enabling direct output of actionable 3D coordinates [11][12]. - The framework employs a two-stage training strategy to seamlessly integrate depth information without compromising the existing capabilities of pre-trained VLMs [17][19]. - SpatialPoint allows for simultaneous prediction of both TouchablePoints (real points) and AirPoints (virtual points), significantly improving the efficiency and accuracy of robotic tasks [11][13]. Group 3: Technical Implementation - The framework includes a depth encoding process that converts single-channel depth maps into a format compatible with RGB inputs, ensuring aligned feature extraction [16]. - Multi-modal collaborative reasoning is facilitated by introducing specific boundary markers for depth tokens, allowing for integrated processing of RGB, depth, and language features [17][18]. - The output is structured in a 3D coordinate format (u, v, Z), which can be directly interpreted by robotic systems, reducing the complexity of translating model predictions into executable actions [18]. Group 4: Experimental Results - SpatialPoint demonstrated a significant improvement in identifying effective operational positions, achieving a 79% success rate in locating TouchablePoints, compared to 74.1% and 50.3% of other models [23]. - For AirPoints, the model showed a 50.71% success rate in direction finding and a 33.47% success rate in locating specific positions within 5 centimeters, outperforming traditional models [26]. - The framework's performance in complex spatial positioning tasks consistently exceeded that of other models, indicating its robustness across various scenarios [28]. Group 5: Practical Applications - SpatialPoint has been validated in real-world robotic applications, successfully executing tasks such as object retrieval and navigation without the need for model fine-tuning [29][30]. - The framework's unified visual interface allows for integrated multi-task operations, enhancing the efficiency of robotic systems in dynamic environments [31]. - By addressing the core challenges of spatial interaction, SpatialPoint aims to facilitate the transition of AI from virtual environments to real-world applications, contributing to the development of embodied intelligence [32][36].