视觉-语言模型

Search documents
ICCV 2025 Highlight | 3D真值生成新范式,开放驾驶场景的语义Occupancy自动化标注!
机器之心· 2025-08-29 00:15
该论文的第一作者和通讯作者均来自北京大学王选计算机研究所的 VDIG (Visual Data Interpreting and Generation) 实验室,第一作者为北京大学博士生周啸宇, 通讯作者为博士生导师王勇涛副研究员 。VDIG 实验室近年来在 IJCV、CVPR、AAAI、ICCV、ICML、ECCV 等顶会上有多项重量级成果发表,多次荣获国内 外 CV 领域重量级竞赛的冠亚军奖项,和国内外知名高校、科研机构广泛开展合作。 本文介绍了来自北京大学王选计算机研究所王勇涛团队及合作者的最新研究成果 AutoOcc。针对开放自动驾驶场景,该篇工作提出了一个高效、高质量的 Open- ended 三维 语义占据栅格真值标注 框架,无需任何人类标注即可超越现有语义占据栅格自动化标注和预测管线,并展现优秀的通用性和泛化能力,论文已被 ICCV 2025 录用为 Highlight 。 论文标题:AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting 论 ...
ExploreVLM:基于视觉-语言模型的闭环机器人探索任务规划框架
具身智能之心· 2025-08-20 00:03
Research Background and Core Issues - The development of embodied intelligence has led to the integration of robots into daily life as human assistants, necessitating their ability to interpret high-level instructions, perceive dynamic environments, and adjust plans in real-time [3] - Vision-Language Models (VLMs) have emerged as a significant direction for robot task planning, but existing methods exhibit limitations in three areas: insufficient interactive exploration capabilities, limited perception accuracy, and poor planning adaptability [6] Proposed Framework - The ExploreVLM framework is introduced, which integrates perception, planning, and execution verification through a closed-loop design to address the identified limitations [5] Core Framework Design - ExploreVLM operates on a "perception-planning-execution-verification" closed-loop model, which includes: 1. Insufficient interactive exploration capabilities for scenarios requiring active information retrieval [6] 2. Limited perception accuracy in capturing object spatial relationships and dynamic changes [6] 3. Poor planning adaptability, primarily relying on open-loop static planning, which can fail in complex environments [6] Key Module Analysis 1. **Goal-Centric Spatial Relation Graph (Scene Perception)** - Constructs a structured graph representation to support complex reasoning, extracting object categories, attributes, and spatial relationships from initial RGB images and task goals [8] - A two-stage planner generates sub-goals and action sequences for exploration and completion phases, optimizing through self-reflection [8] - The execution validator compares pre- and post-execution states to generate feedback and dynamically adjust plans until task completion [8] 2. **Dual-Stage Self-Reflective Planner** - Designed to separate the needs for "unknown information exploration" and "goal achievement," employing a self-reflection mechanism to correct plans and address logical errors [10] - The exploration phase generates sub-goals for information retrieval, while the completion phase generates action sequences based on exploration results [10] 3. **Execution Validator** - Implements a step-by-step validation mechanism to ensure real-time feedback integration into the closed loop, supporting dynamic adjustments [14] Experimental Validation 1. **Experimental Setup** - Conducted on a real robot platform with five tasks of increasing complexity, comparing against baseline methods ReplanVLM and VILA, with a 50% action failure rate introduced to test robustness [15] 2. **Core Results** - ExploreVLM achieved an average success rate of 94%, significantly outperforming ReplanVLM (22%) and VILA (30%) [16] - The framework demonstrated effective action validation and logical consistency checks, ensuring task goals were met [17] 3. **Ablation Studies** - Performance dropped significantly when core modules were removed, highlighting the importance of the collaborative function of the three modules [19] Comparison with Related Work - ExploreVLM addresses the limitations of existing methods through structured perception, dual-stage planning, and stepwise closure, enhancing task execution and adaptability [20]
小鹏最新!NavigScene:全局导航实现超视距自动驾驶VLA(ACMMM'25)
自动驾驶之心· 2025-07-14 11:30
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 今天自动驾驶之心为大家分享 中佛罗里达大学和小鹏汽车ACMMM25中稿的最新 工作 - NavigScene ! 连接局部感知和全局导航,实现超视距自动驾驶! 如果您有 相关工作需要分享,请在文末联系我们! 自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一 步咨询 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Qucheng Peng等 编辑 | 自动驾驶之心 写在前面 & 笔者的个人理解 自动驾驶系统在基于局部视觉信息的感知、预测和规划方面取得了显著进展,但它们难以整合人类驾驶员 通常使用的更广泛的导航背景。为此,小鹏汽车的团队提出了NavigScene,期望解决局部传感器数据与全 局导航信息之间的关键差距,NavigScene是一种辅助的导航引导自然语言数据集,可在自主驾驶系统中模 拟类人驾驶环境。此外开发了三种互补的方法来利用NavigScene:(1)导航引导推理,通过在提示方法中 结合导航上下文来增强视觉-语言模型;(2)导航引导偏好优化,这是一 ...
One RL to See Them All?一个强化学习统一视觉-语言任务!
机器之心· 2025-05-27 04:11
Core Insights - The article discusses the introduction of V-Triune, a unified reinforcement learning system by MiniMax that enhances visual-language models (VLM) for both visual reasoning and perception tasks in a single training process [2][4][5]. Group 1: V-Triune Overview - V-Triune consists of three complementary components: Sample-Level Data Formatting, Verifier-Level Reward Computation, and Source-Level Metric Monitoring, which work together to handle diverse tasks [3][8]. - The system utilizes a novel dynamic IoU reward mechanism that provides adaptive feedback for perception tasks, leading to performance improvements in reasoning and perception tasks [3][4]. Group 2: Performance Improvements - Orsta, the model generated by V-Triune, achieved significant performance gains in the MEGA-Bench Core benchmark, with improvements ranging from +2.1 to +14.1 across different model variants [4][49]. - The model's training on diverse datasets covering various visual reasoning and perception tasks has contributed to its broad capabilities [3][49]. Group 3: Sample-Level Data Formatting - MiniMax addresses the challenge of different tasks requiring distinct reward types and configurations by defining rewards at the sample level, allowing for dynamic routing and fine-grained weighting during training [9][13][16]. - This design enables seamless integration of diverse datasets into a unified training process while allowing for flexible and scalable reward control [16]. Group 4: Verifier-Level Reward Computation - MiniMax employs an independent, asynchronous reward server for generating reinforcement learning signals, enhancing modularity and scalability [17][19]. - The architecture allows for easy addition of new tasks or updates to reward logic without modifying the core training process [20]. Group 5: Source-Level Metric Monitoring - The Source-Level Metric Monitoring strategy records key performance indicators by data source for each training batch, facilitating targeted debugging and insights into the interactions between different data sources [21][24]. - Key monitored metrics include dynamic IoU rewards, perception task IoU/mAP, response length, and reflection rate, all tracked continuously by data source [24][22]. Group 6: Dynamic IoU Reward Strategy - The dynamic IoU reward strategy adjusts the IoU threshold during training to balance learning efficiency and final accuracy, starting with a relaxed threshold and progressively tightening it [26][25]. - This approach aims to guide the model's learning process smoothly while ensuring high performance in the later stages of training [26]. Group 7: Training Methodology - MiniMax's V-Triune supports scalable data, tasks, validators, and metrics systems, but early experiments indicated that joint training could lead to instability [28][29]. - To address this, MiniMax implemented targeted adjustments, including freezing ViT parameters to prevent gradient explosion and managing memory during large-scale training [34][35]. Group 8: Experimental Results - MiniMax conducted experiments using Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-32B-Instruct as base models, achieving a dataset comprising 20,600 perception samples and 27,100 reasoning samples [46]. - The results indicate that V-Triune significantly enhances performance in reasoning and perception tasks, particularly in areas with rich training data [49][55]. Group 9: Conclusion - Overall, MiniMax's findings suggest that reinforcement learning can effectively enhance visual reasoning and perception capabilities within a unified framework, demonstrating continuous performance improvements across various tasks [55][56].