大模型首次拥有“脖子”!纽大团队实现360度类人视觉搜索
量子位·2025-11-27 07:30

Core Insights - The research introduces a new task called Humanoid Visual Search (HVS), enabling models to perform 360-degree visual searches in real-world environments like train stations and shopping malls [6][10][12] - A new benchmark test, HBench, has been developed to evaluate the search capabilities of intelligent agents in complex environments, moving beyond traditional simple household scenarios [7][8][9] - The study aims to transition visual spatial reasoning from a "disembodied passive paradigm" to an "embodied active paradigm," enhancing the model's ability to integrate physical actions with visual reasoning [9][12] Group 1: Humanoid Visual Search - HVS allows intelligent agents to autonomously rotate their heads to search for target objects or paths in immersive environments [6][12] - The task focuses on two main search problems: Humanoid Object Search (HOS) and Humanoid Path Search (HPS), each with varying levels of difficulty based on visibility and environmental cues [12][16] - HOS involves locating and focusing on target objects, while HPS requires identifying navigable paths and adjusting body orientation [16][12] Group 2: Benchmark and Dataset - The H dataset consists of approximately 3,000 labeled task instances derived from diverse high-resolution panoramic videos, providing a comprehensive geographical coverage [21][22] - The benchmark includes scenes from six main categories: retail environments, transportation hubs, urban streets, public institutions, offices, and entertainment venues [24] - The dataset allows for 12,000 search rounds by initializing agents from four different starting directions [22] Group 3: Model Training and Performance - The research utilizes a multi-modal reasoning task, employing a strategy network to integrate tool usage and head rotation, enhancing the model's decision-making capabilities [17][28] - Training results show significant improvements in search accuracy for target search (from 14.83% to 47.38%) and path search (from 6.44% to 24.94%) after model training [28] - The study highlights that larger model sizes do not necessarily guarantee better performance, with smaller models outperforming larger counterparts in certain tasks [33][34] Group 4: Challenges and Insights - The research identifies fundamental bottlenecks in advanced reasoning that require physical, spatial, and social common sense, despite improvements in low-level perception and motion capabilities [34][36] - Errors in HOS primarily stem from insufficient perception in cluttered environments, while HPS errors are more complex, involving a lack of physical and social common sense [36] - The study emphasizes that active visual search (rotating in panoramic views) is more intuitive and effective than passive analysis of static images [36]