MVGGT
Search documents
告别「上帝视角」,机器人仅凭几张图精准锁定3D目标,新基准SOTA
量子位· 2026-01-23 05:03
Core Insights - The article discusses the challenges faced by embodied intelligent agents in understanding 3D environments due to limited and sparse visual data, proposing a new task called Multiview 3D Referring Expression Segmentation (MV-3DRES) to address these issues [4][10][30]. Group 1: Problem Statement - Embodied intelligent agents often lack a comprehensive view of their surroundings, relying on sparse RGB images that lead to incomplete and noisy 3D reconstructions [2][9]. - Existing 3D referring segmentation methods are based on idealized assumptions of dense and reliable point cloud inputs, which do not reflect real-world conditions [3][9]. Group 2: Proposed Solution - A new solution, MVGGT (Multimodal Visual Geometry Grounded Transformer), is introduced, which utilizes a dual-branch architecture combining geometric and language features to enhance 3D scene understanding and segmentation [4][11]. - The architecture includes a frozen geometric reconstruction branch that provides stable 3D geometric priors and a trainable multimodal branch that integrates language instructions with visual features [13][15]. Group 3: Optimization Strategy - The research identifies a core optimization challenge known as Foreground Gradient Dilution (FGD), which complicates training due to the sparse representation of target instances [20][18]. - To address this, the team introduces the PVSO (Per-View No-Target Suppression Optimization) strategy, which amplifies meaningful gradient signals from effective views while suppressing misleading signals from no-target views [22][18]. Group 4: Experimental Results - The team developed a benchmark dataset called MVRefer to evaluate the MV-3DRES task, simulating scenarios with eight randomly collected sparse views [23][24]. - Experimental results demonstrate that MVGGT significantly outperforms existing baseline methods across various metrics, particularly in challenging scenarios where target pixel ratios are low [25][26]. Group 5: Practical Implications - The work emphasizes the practical significance of aligning 3D grounding with real-world perception conditions, providing new directions for enhancing the perception capabilities of embodied intelligence in constrained environments [30]. - The research team invites further exploration and improvements based on the established benchmark to advance the field of sparse perception in embodied intelligence [30].