Workflow
视觉-语言模型
icon
Search documents
美的团队分享!在七个工作中找到推理到执行,构建通用灵巧VLA模型的钥匙
具身智能之心· 2025-09-05 00:45
点击下方 卡片 ,关注" 具身智能 之心 "公众号 >>直播和内容获取转到 → 具身智能之心知识星球 点击按钮预约直播 热身材料 : 分享介绍 美的具身基座模型负责人 ...... 2. 拓展VLA模型能力边界 3. 提升VLA模型泛化能力 1. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control. 2. ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge. 3. ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model. 4. Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reason ...
ICCV 2025 Highlight | 3D真值生成新范式,开放驾驶场景的语义Occupancy自动化标注!
机器之心· 2025-08-29 00:15
Core Viewpoint - The article presents AutoOcc, an innovative framework for automatic open-ended 3D semantic occupancy annotation that surpasses existing methods without requiring human labeling, showcasing excellent generalization capabilities [5][11][26]. Summary by Sections Introduction - AutoOcc is developed by the VDIG laboratory at Peking University, led by researchers Zhou Xiaoyu and Wang Yongtao, and has been recognized in top conferences and competitions in the computer vision field [2][4]. Problem Statement - The challenge of generating accurate and complete semantic occupancy annotations from raw sensor data at low cost remains significant in the fields of autonomous driving and embodied intelligence [5][8]. Methodology - AutoOcc utilizes a vision-language model (VLM) to create semantic attention maps for scene description and dynamically expands the semantic list, while a self-estimating optical flow module identifies and processes dynamic objects in temporal rendering [5][11][17]. Key Innovations - The framework introduces a 3D Gaussian representation (VL-GS) that effectively models complete 3D geometry and semantics in driving scenarios, demonstrating superior representation efficiency, accuracy, and perception capabilities [6][17]. Experimental Results - Extensive experiments indicate that AutoOcc outperforms existing automated 3D semantic occupancy annotation methods and exhibits remarkable zero-shot generalization across datasets [7][21][22]. Comparison with Existing Methods - AutoOcc is compared with traditional methods that rely on human labeling and extensive post-processing, highlighting its speed and open-ended semantic annotation capabilities [14][21]. Performance Metrics - The framework shows significant advantages in terms of robustness and open semantic labeling ability, achieving state-of-the-art performance in both specific semantic categories and across datasets [20][21]. Efficiency Evaluation - AutoOcc demonstrates a notable reduction in computational costs while enhancing annotation performance, achieving a balance between efficiency and flexibility without relying on human annotations [24][25]. Conclusion - The article concludes that AutoOcc represents a significant advancement in automated open semantic 3D occupancy annotation, integrating visual language model guidance with differentiable 3D Gaussian techniques [26].
ExploreVLM:基于视觉-语言模型的闭环机器人探索任务规划框架
具身智能之心· 2025-08-20 00:03
Research Background and Core Issues - The development of embodied intelligence has led to the integration of robots into daily life as human assistants, necessitating their ability to interpret high-level instructions, perceive dynamic environments, and adjust plans in real-time [3] - Vision-Language Models (VLMs) have emerged as a significant direction for robot task planning, but existing methods exhibit limitations in three areas: insufficient interactive exploration capabilities, limited perception accuracy, and poor planning adaptability [6] Proposed Framework - The ExploreVLM framework is introduced, which integrates perception, planning, and execution verification through a closed-loop design to address the identified limitations [5] Core Framework Design - ExploreVLM operates on a "perception-planning-execution-verification" closed-loop model, which includes: 1. Insufficient interactive exploration capabilities for scenarios requiring active information retrieval [6] 2. Limited perception accuracy in capturing object spatial relationships and dynamic changes [6] 3. Poor planning adaptability, primarily relying on open-loop static planning, which can fail in complex environments [6] Key Module Analysis 1. **Goal-Centric Spatial Relation Graph (Scene Perception)** - Constructs a structured graph representation to support complex reasoning, extracting object categories, attributes, and spatial relationships from initial RGB images and task goals [8] - A two-stage planner generates sub-goals and action sequences for exploration and completion phases, optimizing through self-reflection [8] - The execution validator compares pre- and post-execution states to generate feedback and dynamically adjust plans until task completion [8] 2. **Dual-Stage Self-Reflective Planner** - Designed to separate the needs for "unknown information exploration" and "goal achievement," employing a self-reflection mechanism to correct plans and address logical errors [10] - The exploration phase generates sub-goals for information retrieval, while the completion phase generates action sequences based on exploration results [10] 3. **Execution Validator** - Implements a step-by-step validation mechanism to ensure real-time feedback integration into the closed loop, supporting dynamic adjustments [14] Experimental Validation 1. **Experimental Setup** - Conducted on a real robot platform with five tasks of increasing complexity, comparing against baseline methods ReplanVLM and VILA, with a 50% action failure rate introduced to test robustness [15] 2. **Core Results** - ExploreVLM achieved an average success rate of 94%, significantly outperforming ReplanVLM (22%) and VILA (30%) [16] - The framework demonstrated effective action validation and logical consistency checks, ensuring task goals were met [17] 3. **Ablation Studies** - Performance dropped significantly when core modules were removed, highlighting the importance of the collaborative function of the three modules [19] Comparison with Related Work - ExploreVLM addresses the limitations of existing methods through structured perception, dual-stage planning, and stepwise closure, enhancing task execution and adaptability [20]
小鹏最新!NavigScene:全局导航实现超视距自动驾驶VLA(ACMMM'25)
自动驾驶之心· 2025-07-14 11:30
Core Insights - The article discusses the development of NavigScene, a novel dataset aimed at bridging the gap between local perception and global navigation in autonomous driving systems, enhancing their reasoning and planning capabilities [2][12][14]. Group 1: Overview of NavigScene - NavigScene is designed to integrate local sensor data with global navigation context, addressing the limitations of existing autonomous driving models that primarily rely on immediate visual information [5][9]. - The dataset includes two subsets: NavigScene-nuScenes and NavigScene-NAVSIM, which provide paired data to facilitate comprehensive scene understanding and decision-making [9][14]. Group 2: Methodologies - Three complementary paradigms are proposed to leverage NavigScene: 1. Navigation-guided reasoning (NSFT) enhances visual-language models by incorporating navigation context [10][19]. 2. Navigation-guided preference optimization (NPO) improves generalization in new scenarios through reinforcement learning [24][26]. 3. Navigation-guided visual-language-action (NVLA) model integrates navigation guidance with traditional driving models for better performance [27][28]. Group 3: Experimental Results - Experiments demonstrate that integrating global navigation knowledge significantly improves the performance of autonomous driving systems in tasks such as perception, prediction, and planning [12][34][39]. - The results indicate that models trained with NavigScene outperform baseline models across various metrics, including BLEU-4, METEOR, and CIDEr, showcasing enhanced reasoning capabilities [32][34]. Group 4: Practical Implications - The integration of NavigScene allows autonomous systems to make more informed decisions in complex driving environments, leading to improved safety and reliability [12][42]. - The findings highlight the importance of incorporating beyond-visual-range (BVR) knowledge for effective navigation and planning in autonomous driving applications [8][12].
One RL to See Them All?一个强化学习统一视觉-语言任务!
机器之心· 2025-05-27 04:11
Core Insights - The article discusses the introduction of V-Triune, a unified reinforcement learning system by MiniMax that enhances visual-language models (VLM) for both visual reasoning and perception tasks in a single training process [2][4][5]. Group 1: V-Triune Overview - V-Triune consists of three complementary components: Sample-Level Data Formatting, Verifier-Level Reward Computation, and Source-Level Metric Monitoring, which work together to handle diverse tasks [3][8]. - The system utilizes a novel dynamic IoU reward mechanism that provides adaptive feedback for perception tasks, leading to performance improvements in reasoning and perception tasks [3][4]. Group 2: Performance Improvements - Orsta, the model generated by V-Triune, achieved significant performance gains in the MEGA-Bench Core benchmark, with improvements ranging from +2.1 to +14.1 across different model variants [4][49]. - The model's training on diverse datasets covering various visual reasoning and perception tasks has contributed to its broad capabilities [3][49]. Group 3: Sample-Level Data Formatting - MiniMax addresses the challenge of different tasks requiring distinct reward types and configurations by defining rewards at the sample level, allowing for dynamic routing and fine-grained weighting during training [9][13][16]. - This design enables seamless integration of diverse datasets into a unified training process while allowing for flexible and scalable reward control [16]. Group 4: Verifier-Level Reward Computation - MiniMax employs an independent, asynchronous reward server for generating reinforcement learning signals, enhancing modularity and scalability [17][19]. - The architecture allows for easy addition of new tasks or updates to reward logic without modifying the core training process [20]. Group 5: Source-Level Metric Monitoring - The Source-Level Metric Monitoring strategy records key performance indicators by data source for each training batch, facilitating targeted debugging and insights into the interactions between different data sources [21][24]. - Key monitored metrics include dynamic IoU rewards, perception task IoU/mAP, response length, and reflection rate, all tracked continuously by data source [24][22]. Group 6: Dynamic IoU Reward Strategy - The dynamic IoU reward strategy adjusts the IoU threshold during training to balance learning efficiency and final accuracy, starting with a relaxed threshold and progressively tightening it [26][25]. - This approach aims to guide the model's learning process smoothly while ensuring high performance in the later stages of training [26]. Group 7: Training Methodology - MiniMax's V-Triune supports scalable data, tasks, validators, and metrics systems, but early experiments indicated that joint training could lead to instability [28][29]. - To address this, MiniMax implemented targeted adjustments, including freezing ViT parameters to prevent gradient explosion and managing memory during large-scale training [34][35]. Group 8: Experimental Results - MiniMax conducted experiments using Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-32B-Instruct as base models, achieving a dataset comprising 20,600 perception samples and 27,100 reasoning samples [46]. - The results indicate that V-Triune significantly enhances performance in reasoning and perception tasks, particularly in areas with rich training data [49][55]. Group 9: Conclusion - Overall, MiniMax's findings suggest that reinforcement learning can effectively enhance visual reasoning and perception capabilities within a unified framework, demonstrating continuous performance improvements across various tasks [55][56].