Workflow
视觉 - 语言 - 动作(VLA)框架
icon
Search documents
港科&理想最新!OmniReason: 时序引导的VLA决策新框架
自动驾驶之心· 2025-09-10 23:33
Core Insights - The article discusses the development of the OmniReason framework, a novel Vision-Language-Action (VLA) model designed to enhance spatiotemporal reasoning in autonomous driving by integrating dynamic 3D environment modeling and decision-making processes [2][6][8]. Data and Framework - OmniReason-Data consists of two large-scale VLA datasets: OmniReason-nuScenes and OmniReason-Bench2Drive, which provide dense spatiotemporal annotations and natural language explanations, ensuring physical realism and temporal coherence [2][6][8]. - The OmniReason-Agent architecture incorporates a sparse temporal memory module for persistent scene context modeling and an explanation generator for human-interpretable decision-making, effectively capturing spatiotemporal causal reasoning patterns [2][7][8]. Performance and Evaluation - Extensive experiments on open-loop planning tasks and visual question answering (VQA) benchmarks demonstrate that the proposed methods achieve state-of-the-art performance, establishing new capabilities for interpretable and time-aware autonomous vehicles operating in complex dynamic environments [3][8][25][26]. - The OmniReason-Agent shows competitive results in open-loop planning with an average L2 error of 0.34 meters, matching the top method ORION, while achieving a new record for violation rate at 3.18% [25][26]. Contributions - The introduction of comprehensive VLA datasets emphasizes causal reasoning based on spatial and temporal contexts, setting a new benchmark for interpretability and authenticity in autonomous driving research [8]. - The design of a template-based annotation framework ensures high-quality, interpretable language-action pairs suitable for diverse driving scenarios, reducing hallucination phenomena and providing rich multimodal reasoning information [8][14][15]. Related Work - The article reviews the evolution of datasets for autonomous driving, highlighting the shift from single-task annotations to comprehensive scene understanding, and discusses the limitations of existing visual language models (VLMs) in dynamic environments [10][11].