Workflow
Visual Reasoning
icon
Search documents
自动驾驶论文速递 | 视觉重建、RV融合、推理、VLM等
自动驾驶之心· 2025-08-16 09:43
Core Insights - The article discusses two innovative approaches in autonomous driving technology: Dream-to-Recon for monocular 3D scene reconstruction and SpaRC-AD for radar-camera fusion in end-to-end autonomous driving [2][13]. Group 1: Dream-to-Recon - Dream-to-Recon is a method developed by the Technical University of Munich that enables monocular 3D scene reconstruction using only a single image for training [2][6]. - The method integrates a pre-trained diffusion model with a deep network through a three-stage framework: 1. View Completion Model (VCM) enhances occlusion filling and image distortion correction, achieving a PSNR of 23.9 [2][6]. 2. Synthetic Occupancy Field (SOF) constructs dense 3D scene geometry from multiple synthetic views, with occlusion reconstruction accuracy (IE_acc) reaching 72%-73%, surpassing multi-view supervised methods by 2%-10% [2][6]. 3. A lightweight distilled model converts generated geometry into a real-time inference network, achieving overall accuracy (O_acc) of 90%-97% on KITTI-360/Waymo, with a 70x speed improvement (75ms/frame) [2][6]. - The method provides a new paradigm for efficient 3D perception in autonomous driving and robotics without complex sensor calibration [2][6]. Group 2: SpaRC-AD - SpaRC-AD is the first radar-camera fusion baseline framework for end-to-end autonomous driving, also developed by the Technical University of Munich [13][16]. - The framework utilizes sparse 3D feature alignment and Doppler velocity measurement techniques, achieving a 4.8% improvement in 3D detection mAP, an 8.3% increase in tracking AMOTA, a 4.0% reduction in motion prediction mADE, and a 0.11m decrease in trajectory planning L2 error [13][16]. - The overall radar-based fusion strategy significantly enhances performance across multiple tasks, including 3D detection, multi-object tracking, online mapping, and motion prediction [13][16]. - Comprehensive evaluations on open-loop nuScenes and closed-loop Bench2Drive benchmarks demonstrate its advantages in enhancing perception range, improving motion modeling accuracy, and robustness in adverse conditions [13][16].
突破高分辨率图像推理瓶颈,复旦联合南洋理工提出基于视觉Grounding的多轮强化学习框架MGPO
机器之心· 2025-07-21 04:04
Core Insights - The article discusses the development of a multi-turn reinforcement learning method called MGPO, which enhances the visual reasoning capabilities of large multi-modal models (LMMs) when processing high-resolution images [1][8][21] - MGPO allows LMMs to automatically predict key area coordinates and crop sub-images based on questions, improving the model's ability to focus on relevant information without requiring expensive grounding annotations [2][21] Summary by Sections Introduction - Current LMMs, such as Qwen2.5-VL, face challenges in processing high-resolution images due to the conversion of images into a large number of visual tokens, many of which are irrelevant to the task [5][6] - The human visual system employs a task-driven visual search strategy, which MGPO aims to replicate by enabling LMMs to focus on key areas of images [6][7] Method Overview - MGPO simulates a multi-step visual reasoning process where the model first predicts key area coordinates and then crops sub-images for further reasoning [10][21] - The method overcomes the limitations of traditional visual grounding models that require extensive grounding annotations for training [7][21] Key Innovations of MGPO - A top-down, interpretable visual reasoning mechanism that allows LMMs to conduct problem-driven visual searches [2] - The ability to accurately identify relevant area coordinates from high-resolution images, even when visual tokens are limited [2] - The model can be trained on standard Visual Question Answering (VQA) datasets without additional grounding annotations, relying solely on answer correctness for feedback [2][21] Experimental Results - MGPO demonstrated significant performance improvements over other methods like SFT and GRPO, achieving increases of 5.4% and 5.2% in benchmark tests [18][19] - The model outperformed OpenAI's models despite being trained on a smaller dataset, showcasing its effectiveness [18][19] - The proportion of effective grounding coordinates generated by MGPO increased significantly during training, indicating its ability to develop robust visual grounding capabilities autonomously [20] Conclusion - MGPO effectively addresses issues of visual token redundancy and key information loss in high-resolution image processing [21] - The method proves that reinforcement learning can foster robust grounding capabilities without the need for costly annotations, enhancing the efficiency of LMMs [21]