Workflow
MambaFusion
icon
Search documents
ICCV25! 上交&中科院MambaFusion: 首个SOTA Mamba多模态3D检测
自动驾驶之心· 2025-07-10 12:40
Core Viewpoint - The article presents MambaFusion, a state-of-the-art (SOTA) framework for multi-modal 3D object detection, utilizing a pure Mamba module for efficient dense global fusion, achieving significant performance improvements in camera-LiDAR integration [1][3][30]. Summary by Sections Introduction - 3D object detection is essential for modern autonomous driving, providing critical environmental understanding for downstream tasks like perception and motion planning. Multi-sensor fusion, particularly between LiDAR and cameras, enhances detection accuracy and robustness due to their complementary strengths [4]. Methodology - The proposed method includes a high-fidelity LiDAR encoding that compresses voxel data in continuous space, preserving precise height information and improving feature alignment between camera and LiDAR [2][18]. - The Hybrid Mamba Block (HMB) is introduced, which combines local and global context learning to enhance multi-modal 3D detection performance [15][11]. Key Contributions 1. Introduction of the Hybrid Mamba Block, the first dense global fusion module supporting pure linear attention, balancing efficiency and global perception [11]. 2. Development of high-fidelity LiDAR encoding that significantly improves multi-modal alignment accuracy [11][18]. 3. Validation of the feasibility of pure linear fusion, achieving SOTA performance in camera-LiDAR 3D object detection [11][30]. Experimental Results - The method achieved a 75.0 NDS score on the nuScenes validation set, outperforming various top-tier methods while also demonstrating superior inference speed [2][24]. - Compared to the IS-FUSION method, MambaFusion showed a 50% increase in inference speed while maintaining competitive detection accuracy [24][30]. Conclusion - MambaFusion represents a significant advancement in multi-modal 3D object detection, demonstrating effective dense global fusion capabilities and precise cross-modal feature alignment, with implications for further research in the field [30].