Workflow
5700问答对全面评估拷问AI空间感!最新空间智能评测基准来了丨浙大&成电&港中文
量子位·2025-06-02 04:13

Core Insights - The article discusses the limitations of current Visual Language Models (VLMs) in spatial reasoning and multi-perspective understanding, highlighting the need for improved AI systems that can collaborate effectively with humans [1][3][20]. Group 1: ViewSpatial-Bench Development - A new benchmark system called ViewSpatial-Bench has been developed by research teams from Zhejiang University, University of Electronic Science and Technology of China, and The Chinese University of Hong Kong to evaluate VLMs' spatial reasoning capabilities across multiple perspectives [4][33]. - ViewSpatial-Bench includes 5 different task types and over 5700 question-answer pairs, assessing models from both camera and human perspectives [5][7]. - The benchmark aims to address the fragmented understanding of spatial information in VLMs, which often leads to performance issues in multi-perspective tasks [2][20]. Group 2: Model Performance Evaluation - The evaluation of various leading models, including GPT-4o and Gemini 2.0, revealed that their performance in understanding spatial relationships is still inadequate, with overall accuracy scores being low [19][20]. - The results indicated a significant performance gap between tasks based on camera perspectives and those based on human perspectives, suggesting a lack of unified spatial cognitive frameworks in current VLMs [22][23]. - The Multi-View Spatial Model (MVSM) was introduced to enhance cross-perspective spatial understanding, achieving a 46.24% absolute performance improvement over its backbone model [27][28]. Group 3: Future Directions - The findings highlight the structural imbalance in training data regarding perspective distribution, indicating a need for future data construction and model optimization efforts [26]. - The development of MVSM and ViewSpatial-Bench provides a feasible path for AI systems to achieve human-like spatial cognitive abilities, which is crucial for the next generation of robots and multimodal assistants [34].