Workflow
视觉 - 语言 - 动作模型
icon
Search documents
刚刚,Figure 03惊天登场,四年狂造10万台,人类保姆集体失业
3 6 Ke· 2025-10-10 10:50
通用机器人曙光来临!今天,Figure 03正式亮相,专为Helix「大脑」量身打造,冰冷机身有了织物外覆。更值得一得的是,03手掌心配备一 颗摄像头,指尖即可感知3克的力。 Figure 03出世了! 时隔一年,Figure终于带来了下一代人形机器人Figure 03,正式开启通用机器人规模化时代! 这款人形机器人,专为Helix、家庭使用,以及全球规模化应用而设计。 从外观设计上看,Figure 03做了大幅升级,尤其是全机采用了「柔性织物外层」,取代机械外壳。 就连每只手掌心,都集成了一个广角摄像头。 它不仅能执行类人任务,还能通过与人类的互动直接学习,展现出前所未有的智能与适应性。 从浇花、端茶倒水,到收拾家务、陪孩子玩各种琐碎的事情,它都能胜任。 更令人惊叹的是,Figure 03每个指尖可以感知「3克的力」。 甚至,就连放在指尖的一枚回形针的重量,都可以察觉到。 今天,Figure 03还登上了TIME封面。 CEO Brett Adcock表示,「未来,每个家庭都将拥有一个人形机器人」。 这一次,Figure 03进化主要有四大亮点: Helix:配备全新设计的传感套件和手部系统,专为激活Hel ...
机器人「看片」自学新技能:NovaFlow从生成视频中提取动作流,实现零样本操控
机器之心· 2025-10-09 02:24
本文共同第一作者为李鸿宇(布朗大学博士生)和孙凌峰(Robotics and AI Institute 研究员,博士毕业于加州大学伯克利分校)。通讯作者付佳慧在 Robotics and AI Institute 任研究员,博士毕业于麻省理工学院。George Konidaris 为布朗大学副教授。 构建能够在新环境中、无需任何针对性训练就能执行多样化任务的通用机器人,是机器人学领域一个长期追逐的圣杯。近年来,随着大型语言模型(LLMs)和视 觉语言模型(VLMs)的飞速发展,许多研究者将希望寄托于视觉 - 语言 - 动作(VLA)模型,期望它们能复刻 LLM 和 VLM 在泛化性上取得的辉煌。然而,理想 很丰满,现实却很骨感。VLA 模型的端到端训练范式,要求海量与特定机器人相关的 "视觉 - 语言 - 动作" 数据。与 LLM 和 VLM 可以轻易获取的网络规模数据不 同,机器人数据的采集成本极高、难度极大,这形成了一个巨大的 "数据瓶颈"。有没有可能绕过这个瓶颈,让机器人不依赖于昂贵的 "亲身经历" 数据,也能学会 新技能呢? 最近,来自布朗大学(Brown University)和机器人与人工智能研究 ...
元戎启行 发布全新辅助驾驶平台
Shen Zhen Shang Bao· 2025-08-27 07:05
Core Viewpoint - Yuanrong Qixing launched its next-generation assisted driving platform, DeepRoute IO 2.0, which features a self-developed VLA (Vision-Language-Action) model that significantly improves safety and comfort compared to traditional end-to-end models [1] Group 1: Technology and Innovation - The VLA model integrates visual perception, semantic understanding, and action decision-making, making it more adept at handling complex road conditions [1] - DeepRoute IO 2.0 is designed with a "multi-modal + multi-chip + multi-vehicle" adaptation concept, supporting both LiDAR and pure vision versions for customized deployment across various mainstream passenger car platforms [1] - The VLA model addresses the "black box" issue of traditional models by linking and analyzing information to infer causal relationships, and it is inherently integrated with a vast knowledge base, enhancing its generalization ability in dynamic real-world environments [1] Group 2: Commercialization and Partnerships - Yuanrong Qixing has established a solid foundation for mass production and commercialization, securing partnerships with over 10 vehicle models for targeted collaboration [1]
VLA+RL还是纯强化?从200多篇工作中看强化学习的发展路线
具身智能之心· 2025-08-18 00:07
Core Insights - The article provides a comprehensive analysis of the intersection of reinforcement learning (RL) and visual intelligence, focusing on the evolution of strategies and key research themes in visual reinforcement learning [5][17][25]. Group 1: Key Themes in Visual Reinforcement Learning - The article categorizes over 200 representative studies into four main pillars: multimodal large language models, visual generation, unified model frameworks, and visual-language-action models [5][17]. - Each pillar is examined for algorithm design, reward engineering, and benchmark progress, highlighting trends and open challenges in the field [5][17][25]. Group 2: Reinforcement Learning Techniques - Various reinforcement learning techniques are discussed, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which are used to enhance stability and efficiency in training [15][16]. - The article emphasizes the importance of reward models, such as those based on human feedback and verifiable rewards, in guiding the training of visual reinforcement learning agents [10][12][21]. Group 3: Applications in Visual and Video Reasoning - The article outlines applications of reinforcement learning in visual reasoning tasks, including 2D and 3D perception, image reasoning, and video reasoning, showcasing how these methods improve task performance [18][19][20]. - Specific studies are highlighted that utilize reinforcement learning to enhance capabilities in complex visual tasks, such as object detection and spatial reasoning [18][19][20]. Group 4: Evaluation Metrics and Benchmarks - The article discusses the need for new evaluation metrics tailored to large model visual reinforcement learning, combining traditional metrics with preference-based assessments [31][35]. - It provides an overview of various benchmarks that support training and evaluation in the visual domain, emphasizing the role of human preference data in shaping reward models [40][41]. Group 5: Future Directions and Challenges - The article identifies key challenges in visual reinforcement learning, such as balancing depth and efficiency in reasoning processes, and suggests future research directions to address these issues [43][44]. - It highlights the importance of developing adaptive strategies and hierarchical reinforcement learning approaches to improve the performance of visual-language-action agents [43][44].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
自动驾驶中常提的VLM是个啥?与VLA有什么区别?
自动驾驶之心· 2025-08-08 16:04
Core Viewpoint - The article discusses the significance of Vision-Language Models (VLM) in the context of autonomous driving, highlighting their ability to integrate visual perception and natural language processing to enhance vehicle understanding and interaction with complex road environments [4][19]. Summary by Sections What is VLM? - VLM stands for Vision-Language Model, which combines the capabilities of understanding images and text within a single AI system. It enables deep comprehension of visual content and natural language interaction, enhancing applications like image retrieval, writing assistance, and robotic navigation [6]. How to Make VLM Work Efficiently? - VLM processes raw road images into feature representations using visual encoders, such as Convolutional Neural Networks (CNN) and Vision Transformers (ViT). Language encoders and decoders handle natural language input and output, learning semantic relationships between tokens [8]. Key Mechanism of VLM - The alignment of visual features and language modules is crucial for VLM. Cross-attention mechanisms allow the language decoder to focus on relevant image areas when generating text, ensuring high consistency between generated language and actual scenes [9]. Training Process of VLM - The training process for VLM typically involves pre-training on large datasets followed by fine-tuning with specific datasets related to autonomous driving scenarios, ensuring the model can accurately recognize and respond to traffic signs and conditions [11]. Applications of VLM - VLM supports various intelligent functions, including real-time scene alerts, interactive semantic Q&A, and recognition of road signs and text. It can generate natural language prompts based on visual inputs, enhancing driver awareness and decision-making [12]. Real-time Operation of VLM - VLM operates in a "cloud-edge collaboration" architecture, where large-scale pre-training occurs in the cloud, and optimized lightweight models are deployed in vehicles for real-time processing. This setup allows for quick responses to safety alerts and complex analyses [14]. Data Annotation and Quality Assurance - Data annotation is critical for VLM deployment, requiring detailed labeling of images under various conditions. This process ensures high-quality training data, which is essential for the model's performance in real-world scenarios [14]. Safety and Robustness - Safety and robustness are paramount in autonomous driving. VLM must quickly assess uncertainties and implement fallback measures when recognition errors occur, ensuring reliable operation under adverse conditions [15]. Differences Between VLA and VLM - VLA (Vision-Language-Action) extends VLM by integrating action decision-making capabilities. While VLM focuses on understanding and expressing visual information, VLA encompasses perception, cognition, and execution, making it essential for real-world applications like autonomous driving [18]. Future Developments - The continuous evolution of large language models (LLM) and large vision models (LVM) will enhance VLM's capabilities in multi-modal integration, knowledge updates, and human-machine collaboration, leading to safer and more comfortable autonomous driving experiences [16][19].
模拟大脑功能分化!Fast-in-Slow VLA,让“快行动”和“慢推理”统一协作
具身智能之心· 2025-07-13 09:48
Core Viewpoint - The article discusses the introduction of the Fast-in-Slow (FiS-VLA) model, a novel dual-system visual-language-action model that integrates high-frequency response and complex reasoning in robotic control, showcasing significant advancements in control frequency and task success rates [5][29]. Group 1: Model Overview - FiS-VLA combines a fast execution module with a pre-trained visual-language model (VLM), achieving a control frequency of up to 117.7Hz, which is significantly higher than existing mainstream solutions [5][25]. - The model employs a dual-system architecture inspired by Kahneman's dual-system theory, where System 1 focuses on rapid, intuitive decision-making, while System 2 handles slower, deeper reasoning [9][14]. Group 2: Architecture and Design - The architecture of FiS-VLA includes a visual encoder, a lightweight 3D tokenizer, and a large language model (LLaMA2-7B), with the last few layers of the transformer repurposed for the execution module [13]. - The model utilizes heterogeneous input modalities, with System 2 processing 2D images and language instructions, while System 1 requires real-time sensory inputs, including 2D images and 3D point cloud data [15]. Group 3: Performance and Testing - In simulation tests, FiS-VLA achieved an average success rate of 69% across various tasks, outperforming other models like CogACT and π0 [18]. - Real-world testing on robotic platforms showed success rates of 68% and 74% for different tasks, demonstrating superior performance in high-precision control scenarios [20]. - The model exhibited robust generalization capabilities, with a smaller accuracy decline when faced with unseen objects and varying environmental conditions compared to baseline models [23]. Group 4: Training and Optimization - FiS-VLA employs a dual-system collaborative training strategy, enhancing System 1's action generation through diffusion modeling while retaining System 2's reasoning capabilities [16]. - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two transformer layers, and the best operational frequency ratio between the two systems is 1:4 [25]. Group 5: Future Prospects - The authors suggest that future enhancements could include dynamic adjustments to the shared structure and collaborative frequency strategies, which would further improve the model's adaptability and robustness in practical applications [29].
首次!世界模型、动作模型融合,全自回归模型WorldVLA来了
机器之心· 2025-07-03 08:01
Core Viewpoint - Alibaba's Damo Academy has introduced WorldVLA, a model that integrates World Model and Action Model into a unified autoregressive framework, enhancing understanding and generation across text, images, and actions [1][4]. Summary by Sections Research Overview - The development of Vision-Language-Action (VLA) models has become a significant focus in robotic action modeling, typically built on large-scale pretrained multimodal language models (MLLMs) with added action output capabilities [4]. - Existing VLA models often lack a deep understanding of actions, treating them merely as output rather than analyzing them as input [5]. Model Description - WorldVLA addresses the limitations of both VLA and World Models by using a unified autoregressive mechanism for action and image understanding and generation [5][10]. - It employs three independent encoders for processing images, text, and action data, sharing the same vocabulary to facilitate cross-modal tasks [12]. Mechanism and Strategy - The World Model component generates visual representations based on input actions, learning the physical dynamics of the environment, while the Action Model enhances visual understanding [7]. - An action attention masking strategy is introduced to mitigate error accumulation during the generation of multiple actions, significantly improving performance in action chunking tasks [8][14]. Experimental Results - In the LIBERO benchmark, WorldVLA achieved a 4% improvement in grasp success rate compared to traditional action models and a 10% reduction in Fréchet Video Distance (FVD) compared to traditional world models [8]. - The introduction of the attention mask strategy led to a performance improvement in grasp success rates ranging from 4% to 23% in action chunking tasks [8]. Comparative Analysis - WorldVLA outperformed other models in various metrics, demonstrating its effectiveness in integrating action and world modeling [18]. - The model's ability to generate the next frame based on actions and images showcases its advanced capabilities in visual prediction [24].
自动驾驶中常提的VLA是个啥?
自动驾驶之心· 2025-06-18 13:37
Core Viewpoint - The article discusses the Vision-Language-Action (VLA) model, which integrates visual perception, language understanding, and action decision-making into a unified framework for autonomous driving, enhancing system generalization and adaptability [2][4][12]. Summary by Sections Introduction to VLA - VLA stands for Vision-Language-Action, aiming to unify the processes of environmental observation and control command output in autonomous driving [2]. - The model represents a shift from traditional modular approaches to an end-to-end system driven by large-scale data [2][4]. Technical Framework of VLA - The VLA model consists of four key components: 1. Visual Encoder: Extracts features from images and point cloud data [8]. 2. Language Encoder: Utilizes pre-trained language models to understand navigation instructions and traffic rules [11]. 3. Cross-Modal Fusion Layer: Aligns and integrates visual and language features for unified environmental understanding [11]. 4. Action Decoder: Generates control commands based on the fused multi-modal representation [8][11]. Advantages of VLA - VLA enhances scene generalization and contextual reasoning, allowing for quicker and more reasonable decision-making in complex scenarios [12]. - The integration of language understanding allows for more flexible driving strategies and improved human-vehicle interaction [12]. Industry Applications - Various companies, including DeepMind and Yuanrong Qixing, are applying VLA concepts in their autonomous driving research, showcasing its potential in real-world applications [13]. - The RT-2 model by DeepMind and the "end-to-end 2.0 version" by Yuanrong Qixing highlight the advancements in intelligent driving systems [13]. Challenges and Future Directions - Despite its advantages, VLA faces challenges such as lack of interpretability, high data quality requirements, and significant computational resource demands [13][15]. - Solutions being explored include integrating interpretability modules, optimizing trajectory generation, and combining VLA with traditional control methods to enhance safety and robustness [15][16]. - The future of VLA in autonomous driving looks promising, with expectations of becoming a foundational technology as advancements in large models and edge computing continue [16].