视觉语言模型(VLMs)
Search documents
第二代AI预训练范式:预测下个物理状态
机器之心· 2026-02-04 11:20
Core Viewpoint - The article discusses the shift from the first generation of AI models, primarily based on "next word prediction," to a second generation focused on "world modeling" or "predicting the next physical state," highlighting the limitations of current AI applications in the physical world [4][8]. Group 1: Current AI Paradigms - The first generation of AI models, exemplified by large language models (LLMs), has achieved significant success but struggles with real-world applications [4]. - The second generation, as proposed by Jim Fan, emphasizes world modeling, which involves predicting reasonable physical states under specific actions, marking a transformative shift in AI development [8]. Group 2: World Modeling Definition and Implications - World modeling is defined as predicting the next physical state based on specific actions, with video generation models serving as a practical example [8]. - The article anticipates that 2026 will be a pivotal year for large world models (LWMs) in robotics and multimodal AI, establishing a real foundation for future advancements [8]. Group 3: Comparison of AI Models - Visual language models (VLMs) are described as "language-first," where visual information is secondary, leading to a disparity in physical understanding compared to LLMs [9]. - The design of VLA (visual-language-action) models prioritizes language over physical interactions, resulting in inefficiencies in physical AI applications [10]. Group 4: Biological Insights and Future Directions - The article draws parallels between human cognitive processing and AI, noting that a significant portion of the human brain is dedicated to visual processing, which is crucial for physical interaction [11]. - The emergence of world modeling is seen as a response to the limitations of current AI paradigms, with potential for new types of reasoning and simulation that do not rely on language [12]. Group 5: Challenges and Future Research - The article raises questions about the future of AI, including how to decode action instructions and whether pixel reconstruction is the optimal goal for AI development [13]. - It emphasizes the need for further exploration in the field, suggesting a return to fundamental research principles as the industry seeks to advance towards a "GPT-3 moment" in robotics [13].
FutureSightDrive:世界模型&VLM 统一训练
自动驾驶之心· 2025-10-13 23:33
作者 | 么么牛 编辑 | 自动驾驶之心 原文链接: https://zhuanlan.zhihu.com/p/1961012043571266494 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 | https://arxiv.org/pdf/2505.17685 | | --- | | Q1: 这篇论文试图解决什么问题? | 这篇论文试图解决自动驾驶中视觉语言模型(VLMs)在进行轨迹规划和场景理解时存在的时空关系模糊和细粒度信息丢失的问题。现有的VLMs通常使用离散 的文本链式思考(Chain-of-Thought, CoT)来处理当前场景,这种方法本质上是对视觉信息的高度抽象和符号化压缩,可能导致时空关系不明确、细粒度信息丢 失以及模态转换的差距。论文提出了一种新的时空链式思考(spatio-temporal CoT)方法,使模型能够通过视觉方式思考,从而更有效地进行轨迹规划和场景理 解。 Q2: 有哪些相关研究? 论文中提到了以下相关研究: 统一多模态理解 ...
告别高耗时!上交Prune2Drive:自动驾驶VLM裁剪利器,加速6倍性能保持
自动驾驶之心· 2025-08-28 23:32
Core Viewpoint - The article discusses the Prune2Drive framework developed by Shanghai Jiao Tong University and Shanghai AI Lab, which achieves a 6.4x acceleration in visual token processing while only reducing performance by 3% through a pruning method that eliminates 90% of visual tokens [2][3][25]. Group 1: Research Background and Challenges - Visual Language Models (VLMs) provide a unified framework for perception, reasoning, and decision-making in autonomous driving, enhancing scene understanding and reducing error propagation [2]. - The deployment of VLMs in real driving scenarios faces significant computational challenges due to the high-resolution images from multiple cameras, leading to increased inference latency and memory consumption [3]. - Existing token pruning methods are limited in adapting to multi-view scenarios, often neglecting spatial semantic diversity and the varying contributions of different camera views [4]. Group 2: Prune2Drive Framework - Prune2Drive introduces the Token-wise Farthest Point Sampling (T-FPS) mechanism, which maximizes the semantic and spatial coverage of multi-view tokens rather than relying solely on individual token significance [6]. - The T-FPS method uses cosine distance to measure semantic similarity between tokens, ensuring that selected tokens are non-redundant and semantically rich [10][11]. - A view-adaptive pruning controller is designed to optimize the pruning ratio for different views, allowing for efficient resource allocation based on the contribution of each view to driving decisions [11][12]. Group 3: Experimental Design and Results - Experiments were conducted on two multi-view VLM benchmark datasets (DriveLM, DriveLMM-o1) to validate the performance retention and efficiency improvement of Prune2Drive compared to baseline methods [16]. - The framework demonstrated that even with a 90% token reduction, it maintained a risk assessment accuracy of 68.34, outperforming several baseline models [22]. - The efficiency of Prune2Drive was highlighted by a significant speedup in processing, achieving a 6.4x acceleration in the DriveMM model and a 2.64x acceleration in the DriveLMM-o1 model [25]. Group 4: Key Findings and Advantages - Prune2Drive effectively captures critical information in driving scenarios, outperforming other methods by accurately identifying key objects in various views [26]. - The framework is plug-and-play, requiring no retraining of VLMs and compatible with efficient implementations like Flash Attention [31]. - It balances performance and efficiency, achieving substantial reductions in computational load while preserving essential semantic information [31].