视觉语言大模型

Search documents
高德TrafficVLM模型再升级:AI赋予“天眼”视角 可预知全局路况 当AI“看见”实时交通:智能导航体验或被重新定义
Yang Zi Wan Bao Wang· 2025-09-19 08:39
Core Insights - The article discusses the challenges drivers face in modern traffic environments, particularly the limitations of local visibility that hinder optimal decision-making. To address this, Gaode Navigation has upgraded its TrafficVLM model, enhancing users' ability to gain a comprehensive understanding of traffic conditions and improve their driving experience [1][2]. Group 1: TrafficVLM Model Capabilities - TrafficVLM provides users with a "heavenly eye" perspective, allowing for a complete understanding of traffic situations, enabling better decision-making in complex environments [2][4]. - The model operates in real-time, continuously analyzing traffic conditions and providing users with timely suggestions to navigate around potential congestion [4][11]. - TrafficVLM utilizes a powerful underlying system that creates dynamic twin video streams of real-time traffic data, ensuring accurate synchronization with the real world [5]. Group 2: Intelligent Decision-Making - The model can identify traffic incidents, such as accidents, and predict their impact on traffic flow, allowing for proactive navigation suggestions [4][11]. - TrafficVLM encompasses the entire traffic analysis process, from perception to decision-making, forming a complete intelligent feedback loop [11]. - The integration of traffic twin restoration and visual language models enables TrafficVLM to actively perceive and understand traffic dynamics, transforming navigation into a more intuitive and efficient experience [11].
闭环端到端暴涨20%!华科&小米打造开源框架ORION
自动驾驶之心· 2025-08-30 16:03
Core Viewpoint - The article discusses the advancements in end-to-end (E2E) autonomous driving technology, particularly focusing on the introduction of the ORION framework, which integrates vision-language models (VLM) for improved decision-making in complex environments [3][30]. Summary by Sections Introduction - Recent progress in E2E autonomous driving technology faces challenges in complex closed-loop interactions due to limited causal reasoning capabilities [3][12]. - VLMs offer new hope for E2E autonomous driving but there remains a significant gap between VLM's semantic reasoning space and the numerical action space required for driving [3][17]. ORION Framework - ORION is proposed as an end-to-end autonomous driving framework that utilizes visual-language instructions for trajectory generation [3][18]. - The framework incorporates QT-Former for aggregating long-term historical context, VLM for scene understanding and reasoning, and a generative model to align reasoning and action spaces [3][16][18]. Performance Evaluation - ORION achieved a driving score of 77.74 and a success rate of 54.62% on the challenging Bench2Drive dataset, outperforming previous state-of-the-art (SOTA) methods by 14.28 points and 19.61% in success rate [5][24]. - The framework demonstrated superior performance in specific driving scenarios such as overtaking (71.11%), emergency braking (78.33%), and traffic sign recognition (69.15%) [26]. Key Contributions - The article highlights several key contributions of ORION: 1. QT-Former enhances the model's understanding of historical scenes by effectively aggregating long-term visual context [20]. 2. VLM enables multi-dimensional analysis of driving scenes, integrating user instructions and historical information for action reasoning [21]. 3. The generative model aligns the reasoning space of VLM with the action space for trajectory prediction, ensuring reasonable driving decisions in complex scenarios [22]. Conclusion - ORION provides a novel solution for E2E autonomous driving by achieving semantic and action space alignment, integrating long-term context aggregation, and jointly optimizing visual understanding and path planning tasks [30].
5700问答对全面评估拷问AI空间感!最新空间智能评测基准来了丨浙大&成电&港中文
量子位· 2025-06-02 04:13
Core Insights - The article discusses the limitations of current Visual Language Models (VLMs) in spatial reasoning and multi-perspective understanding, highlighting the need for improved AI systems that can collaborate effectively with humans [1][3][20]. Group 1: ViewSpatial-Bench Development - A new benchmark system called ViewSpatial-Bench has been developed by research teams from Zhejiang University, University of Electronic Science and Technology of China, and The Chinese University of Hong Kong to evaluate VLMs' spatial reasoning capabilities across multiple perspectives [4][33]. - ViewSpatial-Bench includes 5 different task types and over 5700 question-answer pairs, assessing models from both camera and human perspectives [5][7]. - The benchmark aims to address the fragmented understanding of spatial information in VLMs, which often leads to performance issues in multi-perspective tasks [2][20]. Group 2: Model Performance Evaluation - The evaluation of various leading models, including GPT-4o and Gemini 2.0, revealed that their performance in understanding spatial relationships is still inadequate, with overall accuracy scores being low [19][20]. - The results indicated a significant performance gap between tasks based on camera perspectives and those based on human perspectives, suggesting a lack of unified spatial cognitive frameworks in current VLMs [22][23]. - The Multi-View Spatial Model (MVSM) was introduced to enhance cross-perspective spatial understanding, achieving a 46.24% absolute performance improvement over its backbone model [27][28]. Group 3: Future Directions - The findings highlight the structural imbalance in training data regarding perspective distribution, indicating a need for future data construction and model optimization efforts [26]. - The development of MVSM and ViewSpatial-Bench provides a feasible path for AI systems to achieve human-like spatial cognitive abilities, which is crucial for the next generation of robots and multimodal assistants [34].