视觉 - 语言 - 动作模型

Search documents
VLA+RL还是纯强化?从200多篇工作中看强化学习的发展路线
具身智能之心· 2025-08-18 00:07
Core Insights - The article provides a comprehensive analysis of the intersection of reinforcement learning (RL) and visual intelligence, focusing on the evolution of strategies and key research themes in visual reinforcement learning [5][17][25]. Group 1: Key Themes in Visual Reinforcement Learning - The article categorizes over 200 representative studies into four main pillars: multimodal large language models, visual generation, unified model frameworks, and visual-language-action models [5][17]. - Each pillar is examined for algorithm design, reward engineering, and benchmark progress, highlighting trends and open challenges in the field [5][17][25]. Group 2: Reinforcement Learning Techniques - Various reinforcement learning techniques are discussed, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which are used to enhance stability and efficiency in training [15][16]. - The article emphasizes the importance of reward models, such as those based on human feedback and verifiable rewards, in guiding the training of visual reinforcement learning agents [10][12][21]. Group 3: Applications in Visual and Video Reasoning - The article outlines applications of reinforcement learning in visual reasoning tasks, including 2D and 3D perception, image reasoning, and video reasoning, showcasing how these methods improve task performance [18][19][20]. - Specific studies are highlighted that utilize reinforcement learning to enhance capabilities in complex visual tasks, such as object detection and spatial reasoning [18][19][20]. Group 4: Evaluation Metrics and Benchmarks - The article discusses the need for new evaluation metrics tailored to large model visual reinforcement learning, combining traditional metrics with preference-based assessments [31][35]. - It provides an overview of various benchmarks that support training and evaluation in the visual domain, emphasizing the role of human preference data in shaping reward models [40][41]. Group 5: Future Directions and Challenges - The article identifies key challenges in visual reinforcement learning, such as balancing depth and efficiency in reasoning processes, and suggests future research directions to address these issues [43][44]. - It highlights the importance of developing adaptive strategies and hierarchical reinforcement learning approaches to improve the performance of visual-language-action agents [43][44].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
图 1:代表性视觉强化学习模型时间线。该图按时间顺序概述了 2023 年至 2025 年的关键视觉强化学习(Visual RL)模型,并将其分为四个领域:多模态大语 言模型(Multimodal LLM)、视觉生成(Visual Generation)、统一模型(Unified Models)和视觉 - 语言 - 动作模型(VLA Models)。 在 大语言模型(LLM) 的江湖里, 强化学习(RL) ,特别是带有 人类反馈的强化学习(RLHF) ,早已不是什么新鲜词。正是它,如同一位内 力深厚的宗师,为 GPT、Qwen、DeepSeek 等模型注入了"灵魂",使其回答能够如此贴合人类的思维与价值观。这场由 RL 主导的革命,彻底改变 了我们与AI的交互方式。 然而,当所有人都以为强化学习的舞台仅限于文字的方寸之间时,一股同样的浪潮,正以迅雷不及掩耳之势,"卷"向了另一个更为广阔的领域—— 计算机视觉(CV) 。 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我 -> 领取大模型巨卷干货 >> 点击进入→ 大模型技术 交流群 本文只做学术分享,如有侵权,联系删文 写在前面 当RLHF"卷入"计 ...
自动驾驶中常提的VLM是个啥?与VLA有什么区别?
自动驾驶之心· 2025-08-08 16:04
Core Viewpoint - The article discusses the significance of Vision-Language Models (VLM) in the context of autonomous driving, highlighting their ability to integrate visual perception and natural language processing to enhance vehicle understanding and interaction with complex road environments [4][19]. Summary by Sections What is VLM? - VLM stands for Vision-Language Model, which combines the capabilities of understanding images and text within a single AI system. It enables deep comprehension of visual content and natural language interaction, enhancing applications like image retrieval, writing assistance, and robotic navigation [6]. How to Make VLM Work Efficiently? - VLM processes raw road images into feature representations using visual encoders, such as Convolutional Neural Networks (CNN) and Vision Transformers (ViT). Language encoders and decoders handle natural language input and output, learning semantic relationships between tokens [8]. Key Mechanism of VLM - The alignment of visual features and language modules is crucial for VLM. Cross-attention mechanisms allow the language decoder to focus on relevant image areas when generating text, ensuring high consistency between generated language and actual scenes [9]. Training Process of VLM - The training process for VLM typically involves pre-training on large datasets followed by fine-tuning with specific datasets related to autonomous driving scenarios, ensuring the model can accurately recognize and respond to traffic signs and conditions [11]. Applications of VLM - VLM supports various intelligent functions, including real-time scene alerts, interactive semantic Q&A, and recognition of road signs and text. It can generate natural language prompts based on visual inputs, enhancing driver awareness and decision-making [12]. Real-time Operation of VLM - VLM operates in a "cloud-edge collaboration" architecture, where large-scale pre-training occurs in the cloud, and optimized lightweight models are deployed in vehicles for real-time processing. This setup allows for quick responses to safety alerts and complex analyses [14]. Data Annotation and Quality Assurance - Data annotation is critical for VLM deployment, requiring detailed labeling of images under various conditions. This process ensures high-quality training data, which is essential for the model's performance in real-world scenarios [14]. Safety and Robustness - Safety and robustness are paramount in autonomous driving. VLM must quickly assess uncertainties and implement fallback measures when recognition errors occur, ensuring reliable operation under adverse conditions [15]. Differences Between VLA and VLM - VLA (Vision-Language-Action) extends VLM by integrating action decision-making capabilities. While VLM focuses on understanding and expressing visual information, VLA encompasses perception, cognition, and execution, making it essential for real-world applications like autonomous driving [18]. Future Developments - The continuous evolution of large language models (LLM) and large vision models (LVM) will enhance VLM's capabilities in multi-modal integration, knowledge updates, and human-machine collaboration, leading to safer and more comfortable autonomous driving experiences [16][19].
模拟大脑功能分化!Fast-in-Slow VLA,让“快行动”和“慢推理”统一协作
具身智能之心· 2025-07-13 09:48
Core Viewpoint - The article discusses the introduction of the Fast-in-Slow (FiS-VLA) model, a novel dual-system visual-language-action model that integrates high-frequency response and complex reasoning in robotic control, showcasing significant advancements in control frequency and task success rates [5][29]. Group 1: Model Overview - FiS-VLA combines a fast execution module with a pre-trained visual-language model (VLM), achieving a control frequency of up to 117.7Hz, which is significantly higher than existing mainstream solutions [5][25]. - The model employs a dual-system architecture inspired by Kahneman's dual-system theory, where System 1 focuses on rapid, intuitive decision-making, while System 2 handles slower, deeper reasoning [9][14]. Group 2: Architecture and Design - The architecture of FiS-VLA includes a visual encoder, a lightweight 3D tokenizer, and a large language model (LLaMA2-7B), with the last few layers of the transformer repurposed for the execution module [13]. - The model utilizes heterogeneous input modalities, with System 2 processing 2D images and language instructions, while System 1 requires real-time sensory inputs, including 2D images and 3D point cloud data [15]. Group 3: Performance and Testing - In simulation tests, FiS-VLA achieved an average success rate of 69% across various tasks, outperforming other models like CogACT and π0 [18]. - Real-world testing on robotic platforms showed success rates of 68% and 74% for different tasks, demonstrating superior performance in high-precision control scenarios [20]. - The model exhibited robust generalization capabilities, with a smaller accuracy decline when faced with unseen objects and varying environmental conditions compared to baseline models [23]. Group 4: Training and Optimization - FiS-VLA employs a dual-system collaborative training strategy, enhancing System 1's action generation through diffusion modeling while retaining System 2's reasoning capabilities [16]. - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two transformer layers, and the best operational frequency ratio between the two systems is 1:4 [25]. Group 5: Future Prospects - The authors suggest that future enhancements could include dynamic adjustments to the shared structure and collaborative frequency strategies, which would further improve the model's adaptability and robustness in practical applications [29].
首次!世界模型、动作模型融合,全自回归模型WorldVLA来了
机器之心· 2025-07-03 08:01
Core Viewpoint - Alibaba's Damo Academy has introduced WorldVLA, a model that integrates World Model and Action Model into a unified autoregressive framework, enhancing understanding and generation across text, images, and actions [1][4]. Summary by Sections Research Overview - The development of Vision-Language-Action (VLA) models has become a significant focus in robotic action modeling, typically built on large-scale pretrained multimodal language models (MLLMs) with added action output capabilities [4]. - Existing VLA models often lack a deep understanding of actions, treating them merely as output rather than analyzing them as input [5]. Model Description - WorldVLA addresses the limitations of both VLA and World Models by using a unified autoregressive mechanism for action and image understanding and generation [5][10]. - It employs three independent encoders for processing images, text, and action data, sharing the same vocabulary to facilitate cross-modal tasks [12]. Mechanism and Strategy - The World Model component generates visual representations based on input actions, learning the physical dynamics of the environment, while the Action Model enhances visual understanding [7]. - An action attention masking strategy is introduced to mitigate error accumulation during the generation of multiple actions, significantly improving performance in action chunking tasks [8][14]. Experimental Results - In the LIBERO benchmark, WorldVLA achieved a 4% improvement in grasp success rate compared to traditional action models and a 10% reduction in Fréchet Video Distance (FVD) compared to traditional world models [8]. - The introduction of the attention mask strategy led to a performance improvement in grasp success rates ranging from 4% to 23% in action chunking tasks [8]. Comparative Analysis - WorldVLA outperformed other models in various metrics, demonstrating its effectiveness in integrating action and world modeling [18]. - The model's ability to generate the next frame based on actions and images showcases its advanced capabilities in visual prediction [24].
自动驾驶中常提的VLA是个啥?
自动驾驶之心· 2025-06-18 13:37
Core Viewpoint - The article discusses the Vision-Language-Action (VLA) model, which integrates visual perception, language understanding, and action decision-making into a unified framework for autonomous driving, enhancing system generalization and adaptability [2][4][12]. Summary by Sections Introduction to VLA - VLA stands for Vision-Language-Action, aiming to unify the processes of environmental observation and control command output in autonomous driving [2]. - The model represents a shift from traditional modular approaches to an end-to-end system driven by large-scale data [2][4]. Technical Framework of VLA - The VLA model consists of four key components: 1. Visual Encoder: Extracts features from images and point cloud data [8]. 2. Language Encoder: Utilizes pre-trained language models to understand navigation instructions and traffic rules [11]. 3. Cross-Modal Fusion Layer: Aligns and integrates visual and language features for unified environmental understanding [11]. 4. Action Decoder: Generates control commands based on the fused multi-modal representation [8][11]. Advantages of VLA - VLA enhances scene generalization and contextual reasoning, allowing for quicker and more reasonable decision-making in complex scenarios [12]. - The integration of language understanding allows for more flexible driving strategies and improved human-vehicle interaction [12]. Industry Applications - Various companies, including DeepMind and Yuanrong Qixing, are applying VLA concepts in their autonomous driving research, showcasing its potential in real-world applications [13]. - The RT-2 model by DeepMind and the "end-to-end 2.0 version" by Yuanrong Qixing highlight the advancements in intelligent driving systems [13]. Challenges and Future Directions - Despite its advantages, VLA faces challenges such as lack of interpretability, high data quality requirements, and significant computational resource demands [13][15]. - Solutions being explored include integrating interpretability modules, optimizing trajectory generation, and combining VLA with traditional control methods to enhance safety and robustness [15][16]. - The future of VLA in autonomous driving looks promising, with expectations of becoming a foundational technology as advancements in large models and edge computing continue [16].