Workflow
RoboRefer
icon
Search documents
复杂空间指令也能秒懂?RoboRefer 让机器人理解推理空间,开放世界也能精准行动!
机器之心· 2025-07-06 06:06
Core Viewpoint - The article discusses the development and capabilities of RoboRefer, a multimodal large model designed for spatial referring tasks in robotics, emphasizing its advanced spatial understanding and reasoning abilities. Group 1: RoboRefer Model Overview - RoboRefer is a multimodal large model that possesses three-dimensional spatial understanding and reasoning capabilities, featuring independent image and depth encoders [12] - The model can accurately answer various spatial perception questions and perform complex combinatorial reasoning based on multiple spatial relationships [12][13] Group 2: Training Techniques - RoboRefer employs full parameter tuning (SFT) to enhance spatial perception and reinforcement learning fine-tuning (RFT) to improve generalization reasoning capabilities [15][16] - The model's training includes a process-based reward function that enhances the quality of intermediate reasoning processes, leading to improved multi-step reasoning abilities [17] Group 3: Performance Metrics - After SFT training, RoboRefer achieved an average success rate of 89.6% in spatial understanding tasks, setting a new advanced level [21] - In the high-difficulty spatial referring task benchmark RefSpatial-Bench, RFT-trained RoboRefer outperformed all other models, surpassing Gemini-2.5-Pro by 17.4% in average accuracy [22] Group 4: Dataset Development - The research team created a large-scale, high-quality dataset called RefSpatial, which includes 2.5 million samples and 20 million question-answer pairs, significantly larger than similar datasets [20] - RefSpatial features detailed multi-step reasoning processes and covers a wide range of everyday interaction scenarios, integrating 31 types of spatial relationships [20] Group 5: Real-World Application - RoboRefer can be flexibly integrated into various types of robots, such as UR5 robotic arms and G1 humanoid robots, enabling precise execution of complex, dynamic, multi-step tasks in real-world environments [9]