视觉语言大模型 - filings, earnings calls, financial reports, news

视觉语言大模型

Search documents

自动驾驶之心· 2025-12-13 02:04

以下文章来源于深蓝AI ，作者深蓝学院深蓝AI . 专注于人工智能、机器人与自动驾驶的学习平台。作者 | 深蓝学院来源 | 深蓝AI 原文链接：南洋理工、哈佛提出OpenREAD：用端到端RL统一驾驶认知与轨迹规划点击下方卡片，关注" 自动驾驶之心 "公众号戳我-> 领取自动驾驶近30个方向学习路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球本文只做学术分享，如有侵权，联系删文让视觉语言大模型同时学会 " 思考 " 与 " 执行 " 」在自动驾驶研究中，利用大语言视觉语言模型(LLMNLM)学习开放式驾驶知识，进而提升轨迹规划与决策能力，正逐渐成为新的趋势。然而，传统的监督微调(SFT)范式难以充分挖掘模型的推理潜力，对知识的学习效率也存在不足。DeepSeek-R1的出现向我们展示了强化学习在提升模型推理与思考能力方面的巨大潜力，使模型具备更强的泛化表现。因此，一个关键问题随之而来:如何通过强化学习增强视觉语言模型的推理能力，让模型"学会思考"，并在同一框架下同时掌握开放式驾驶知识与轨迹规划?这正是基于视觉语言大模型实现端到端自动驾驶所面临的全新挑战。南洋理 ...

AAAI 2026 | 电子科技大学提出OWL，基于双路径注意力干预的多模态大模型物体幻觉缓解

机器之心· 2025-11-28 08:05

Core Insights - The article discusses the increasing attention on mitigating object hallucination in visual language models (LVLMs) and introduces a novel framework called Owl, which employs a causal dual-path attention intervention to address this issue [2][4]. Group 1: Problem Identification - Existing methods primarily focus on either visual or textual attention independently, neglecting the critical imbalance in cross-modal attention interaction [5]. - There is a lack of quantitative measures for cross-modal dependencies during the decoding process, leading to a coarse intervention mechanism without theoretical guidance [5]. Group 2: Proposed Solution - The paper introduces a structural causal model that formalizes the decomposition of visual and textual attention into key mediating variables, highlighting how confounding factors distort attention and lead to hallucinations [4]. - A new metric, VTACR, is proposed to quantify the model's dependency on visual and textual modalities at each decoding layer, providing a measurable signal for fine-grained attention intervention [7]. Group 3: Methodology - The Owl framework employs a dual-path attention intervention method, creating a visual enhancement path and a textual enhancement path, using a contrastive decoding strategy to dynamically correct attention biases [8][10]. - During inference, the framework decomposes the attention weights of the language decoder into visual and textual components, adjusting attention based on the VTACR distribution to enhance the focus on image tokens while moderating the influence of textual history [10]. Group 4: Experimental Results - The Owl method was evaluated on three representative LVLMs: LLaVA-1.5, MiniGPT-4, and Shikra, against various baseline methods to ensure comprehensive assessment [12]. - In the CHAIR benchmark, Owl significantly reduced sentence-level hallucination by 17.6% and instance-level hallucination by 21.4% on LLaVA-1.5, while generating longer texts, indicating that it effectively mitigates hallucinations without sacrificing content richness [13]. - The method demonstrated comparable or improved performance on five visual question answering (VQA) tasks, with a 7.6% enhancement on the VizWiz task, suggesting that it may enhance the model's understanding of complex visual scenes [14]. - Manual evaluations using GPT-4V showed improvements in correctness by 20.1% and detailedness by 11.3% for LLaVA-1.5, indicating that the generated content is not only more faithful to the images but also richer in information [16]. Group 5: Visual Evidence - The paper presents typical hallucination cases where Owl effectively suppresses errors, ensuring generated results align closely with the actual image content [18]. - Visualizations reveal that Owl acts like a precise editor, suppressing "hallucination words" while prioritizing "correct words" during the generation process [18][19].

牛津VGG、港大、上交发布ELIP：超越CLIP等，多模态图片检索的增强视觉语言大模型预训练

机器之心· 2025-10-29 11:02

Core Insights - The article discusses the significance of multimodal image retrieval in computer vision and multimodal machine learning, highlighting the use of large-scale pre-trained models like CLIP and SigLIP for enhanced zero-shot capabilities [2] - A new method called ELIP (Enhance Language-Image Pre-training) is proposed to improve the performance of visual-language models for text-image retrieval, which has been accepted as a best paper nominee at the IEEE International Conference on Content-Based Multimedia Indexing [2] Method Overview - The ELIP method involves an initial ranking of images using traditional CLIP/SigLIP, followed by a re-ranking of the top-k candidates using a simple MLP mapping network that incorporates text features into the image encoder [5] - ELIP can be applied to various large models, including CLIP, SigLIP, and BLIP-2, referred to as ELIP-C, ELIP-S, ELIP-S-2, and ELIP-B respectively [5] Challenges in Academic Research - The article notes that pre-training visual-language models is typically an industrial endeavor, but the proposed method allows for training with limited resources, such as two GPUs [8] Innovations in Model Architecture - The architecture innovation involves fixing the weights of large image and text encoders while training only the MLP mapping network, which consists of three layers of linear transformations and GeLU activations [9] - The training process involves mapping text features to the visual feature space to guide image encoding, using InfoNCE loss for CLIP and Sigmoid loss for SigLIP [9] Innovations in Training Data - ELIP addresses the challenge of limited GPU resources by creating hard sample training batches from CLIP feature similarities, enhancing the model's discriminative ability [13] - The article provides examples of how similar features are grouped to form hard samples for training [15] New Evaluation Datasets - In addition to standard datasets like COCO and Flickr, two new out-of-distribution (OOD) datasets, Occluded COCO and ImageNet-R, are introduced to evaluate the model's performance under different conditions [18] Experimental Results - The results indicate significant improvements in image retrieval performance for models using ELIP, with ELIP-S achieving a recall@1 of 61.03 on COCO, compared to 54.21 for SigLIP [21] - ELIP-B applied to BLIP-2 also shows enhanced performance, surpassing the latest Q-Pert method [20] Attention Mechanism Observations - The authors observed that ELIP improves the attention of the CLS token towards relevant areas in images when the text query is related, enhancing information extraction [23]

视觉语言大模型

多模态图片检索

Artificial Intelligence

Artificial Intelligence

ELIP

CLIP

SigLIP

高德TrafficVLM模型再升级：AI赋予“天眼”视角可预知全局路况当AI“看见”实时交通：智能导航体验或被重新定义

Yang Zi Wan Bao Wang· 2025-09-19 08:39

Core Insights - The article discusses the challenges drivers face in modern traffic environments, particularly the limitations of local visibility that hinder optimal decision-making. To address this, Gaode Navigation has upgraded its TrafficVLM model, enhancing users' ability to gain a comprehensive understanding of traffic conditions and improve their driving experience [1][2]. Group 1: TrafficVLM Model Capabilities - TrafficVLM provides users with a "heavenly eye" perspective, allowing for a complete understanding of traffic situations, enabling better decision-making in complex environments [2][4]. - The model operates in real-time, continuously analyzing traffic conditions and providing users with timely suggestions to navigate around potential congestion [4][11]. - TrafficVLM utilizes a powerful underlying system that creates dynamic twin video streams of real-time traffic data, ensuring accurate synchronization with the real world [5]. Group 2: Intelligent Decision-Making - The model can identify traffic incidents, such as accidents, and predict their impact on traffic flow, allowing for proactive navigation suggestions [4][11]. - TrafficVLM encompasses the entire traffic analysis process, from perception to decision-making, forming a complete intelligent feedback loop [11]. - The integration of traffic twin restoration and visual language models enables TrafficVLM to actively perceive and understand traffic dynamics, transforming navigation into a more intuitive and efficient experience [11].

闭环端到端暴涨20%！华科&小米打造开源框架ORION

自动驾驶之心· 2025-08-30 16:03

Core Viewpoint - The article discusses the advancements in end-to-end (E2E) autonomous driving technology, particularly focusing on the introduction of the ORION framework, which integrates vision-language models (VLM) for improved decision-making in complex environments [3][30]. Summary by Sections Introduction - Recent progress in E2E autonomous driving technology faces challenges in complex closed-loop interactions due to limited causal reasoning capabilities [3][12]. - VLMs offer new hope for E2E autonomous driving but there remains a significant gap between VLM's semantic reasoning space and the numerical action space required for driving [3][17]. ORION Framework - ORION is proposed as an end-to-end autonomous driving framework that utilizes visual-language instructions for trajectory generation [3][18]. - The framework incorporates QT-Former for aggregating long-term historical context, VLM for scene understanding and reasoning, and a generative model to align reasoning and action spaces [3][16][18]. Performance Evaluation - ORION achieved a driving score of 77.74 and a success rate of 54.62% on the challenging Bench2Drive dataset, outperforming previous state-of-the-art (SOTA) methods by 14.28 points and 19.61% in success rate [5][24]. - The framework demonstrated superior performance in specific driving scenarios such as overtaking (71.11%), emergency braking (78.33%), and traffic sign recognition (69.15%) [26]. Key Contributions - The article highlights several key contributions of ORION: 1. QT-Former enhances the model's understanding of historical scenes by effectively aggregating long-term visual context [20]. 2. VLM enables multi-dimensional analysis of driving scenes, integrating user instructions and historical information for action reasoning [21]. 3. The generative model aligns the reasoning space of VLM with the action space for trajectory prediction, ensuring reasonable driving decisions in complex scenarios [22]. Conclusion - ORION provides a novel solution for E2E autonomous driving by achieving semantic and action space alignment, integrating long-term context aggregation, and jointly optimizing visual understanding and path planning tasks [30].

5700问答对全面评估拷问AI空间感！最新空间智能评测基准来了丨浙大&成电&港中文

量子位· 2025-06-02 04:13

Core Insights - The article discusses the limitations of current Visual Language Models (VLMs) in spatial reasoning and multi-perspective understanding, highlighting the need for improved AI systems that can collaborate effectively with humans [1][3][20]. Group 1: ViewSpatial-Bench Development - A new benchmark system called ViewSpatial-Bench has been developed by research teams from Zhejiang University, University of Electronic Science and Technology of China, and The Chinese University of Hong Kong to evaluate VLMs' spatial reasoning capabilities across multiple perspectives [4][33]. - ViewSpatial-Bench includes 5 different task types and over 5700 question-answer pairs, assessing models from both camera and human perspectives [5][7]. - The benchmark aims to address the fragmented understanding of spatial information in VLMs, which often leads to performance issues in multi-perspective tasks [2][20]. Group 2: Model Performance Evaluation - The evaluation of various leading models, including GPT-4o and Gemini 2.0, revealed that their performance in understanding spatial relationships is still inadequate, with overall accuracy scores being low [19][20]. - The results indicated a significant performance gap between tasks based on camera perspectives and those based on human perspectives, suggesting a lack of unified spatial cognitive frameworks in current VLMs [22][23]. - The Multi-View Spatial Model (MVSM) was introduced to enhance cross-perspective spatial understanding, achieving a 46.24% absolute performance improvement over its backbone model [27][28]. Group 3: Future Directions - The findings highlight the structural imbalance in training data regarding perspective distribution, indicating a need for future data construction and model optimization efforts [26]. - The development of MVSM and ViewSpatial-Bench provides a feasible path for AI systems to achieve human-like spatial cognitive abilities, which is crucial for the next generation of robots and multimodal assistants [34].

Multi-View Spatial Model (MVSM)

Multi-View Spatial Model (MVSM)

GPT-4o