ICCV 2025 | Mamba-3VL：单一模型攻克18类异构任务，重新定义具身智能大模型能力边界

Core Insights - The article discusses the Mamba-3VL model, which integrates state space modeling into 3D vision-language learning, addressing the challenge of task adaptability in embodied intelligence [2][3][18] - Mamba-3VL demonstrates the capability to handle 18 heterogeneous tasks across various domains, marking a significant advancement in the field of embodied intelligence [3][11][17] Summary by Sections 1. Core Method Innovations - Mamba-3VL introduces three key technological breakthroughs to overcome the limitations of traditional embodied models, particularly those based on Transformer architecture [3][5] - The model utilizes a multi-modal Mamba Mixer module to efficiently fuse 3D point clouds, visual data, and language inputs, enhancing spatial relationship modeling [5][6] - A dynamic position encoding mechanism, IDPA, combines geometric priors and semantic modulation to adapt to varying task precision requirements [6][9] - The unified query decoding framework allows for flexible output across multiple tasks without the need for module reconstruction [6][10] 2. Comprehensive Task Coverage - Mamba-3VL supports 18 distinct tasks categorized into four major dimensions, showcasing its versatility in both foundational and advanced embodied interactions [11][12] - The tasks include basic 3D perception, language reasoning, instance segmentation, and advanced interaction and planning tasks [11][14] 3. Performance and Generalization - The model sets new performance records on key benchmarks, demonstrating superior capabilities in handling large-scale 3D data with linear computational complexity [15][16] - Mamba-3VL achieves state-of-the-art results in various tasks, including dense description generation and robotic operations, indicating strong generalization abilities [15][17] 4. Research Significance - The advancements presented by Mamba-3VL redefine the direction of general embodied intelligence, suggesting applications in robotics, autonomous driving, virtual reality, and smart home control [17][18] - The model's ability to adapt to 18 heterogeneous tasks without extensive retraining paves the way for future developments in multi-task embodied intelligence [20]