Workflow
视觉 - 语言 - 动作模型
icon
Search documents
开源对机器人的价值,远超想象丨唐文斌深度对谈抱抱脸联创
具身智能之心· 2025-10-21 00:03
Core Insights - The article discusses the challenges in the field of robotics, particularly the gap between simulation and real-world application, and introduces RoboChallenge.ai as a solution to create a standardized evaluation platform for embodied intelligence [2][42][51]. Group 1: Current Challenges in Robotics - Many models perform well in simulations but fail in real-world scenarios, highlighting a significant pain point in robotics research [2][42]. - The need for a unified, open, and reproducible evaluation system for robotics is emphasized, as current benchmarks are primarily based on simulations [50][44]. Group 2: Introduction of RoboChallenge.ai - RoboChallenge.ai is launched as an open, standardized platform for evaluating robotic models in real-world environments, allowing researchers to remotely test their models on physical robots [6][51]. - The platform enables users to control local models through an API, facilitating remote testing without the need to upload models [8][53]. Group 3: Importance of Open Source in Robotics - Open source is identified as a crucial driver for advancements in AI and robotics, enabling collaboration and innovation across global teams [10][19]. - The article argues that open source in robotics may be even more critical than in large language models (LLMs) due to the necessity of hardware accessibility for model application [20][22]. Group 4: Future Directions and Community Involvement - The article anticipates that the next three to five years will see significant evolution in embodied intelligence research, with robots capable of executing longer and more complex tasks [82]. - Community participation is encouraged, with the expectation that diverse contributions will enhance data availability and model robustness [66][68].
刚刚,Figure 03惊天登场,四年狂造10万台,人类保姆集体失业
3 6 Ke· 2025-10-10 10:50
Core Insights - Figure 03 marks the official launch of the next-generation humanoid robot, signifying the beginning of the era of general-purpose robots [1][3] - The robot is designed for Helix, home use, and global scalability, featuring significant upgrades in design and functionality [3][8] Design and Features - Figure 03 features a flexible fabric outer layer, replacing the mechanical shell, and integrates a wide-angle camera in each palm [3][8] - The robot can perform various tasks such as watering plants, serving tea, and cleaning, showcasing unprecedented intelligence and adaptability [3][8] - Each fingertip can sense a pressure of 3 grams, allowing it to detect even the weight of a paperclip [3][17][20] Technological Advancements - The robot is powered by Helix, an in-house developed visual-language-action model, enabling it to learn and operate autonomously in complex environments [10] - The visual system has been optimized for high-frequency motion control, improving clarity and responsiveness, with a frame rate doubled and latency reduced to a quarter [11][12] - Figure 03 supports 10 Gbps millimeter-wave data offloading, allowing for continuous learning and improvement from a fleet of robots [18] Manufacturing and Scalability - Figure aims to produce 100,000 units over the next four years, establishing the BotQ factory with an initial annual capacity of 12,000 units [8][22] - The design allows seamless transitions between home and commercial applications, with enhanced speed and torque density for faster operations [21][22] User Experience and Safety - The robot features a soft, adaptable design to ensure safety and ease of use, with a 9% reduction in weight compared to its predecessor [19] - It includes a wireless charging system and a robust battery management system, certified by international standards [19][24] - Users can customize the robot's appearance with removable, washable coverings [24]
机器人「看片」自学新技能:NovaFlow从生成视频中提取动作流,实现零样本操控
机器之心· 2025-10-09 02:24
Core Insights - The article discusses the development of NovaFlow, a novel framework for enabling robots to perform complex manipulation tasks without requiring extensive training data or demonstrations, leveraging large video generation models to extract common-sense knowledge from vast amounts of internet video content [2][4][23] Group 1: NovaFlow Framework Overview - NovaFlow aims to decouple task understanding from low-level control, allowing robots to learn from generated videos rather than requiring human demonstrations or trial-and-error learning [4][23] - The framework consists of two main components: the Actionable Flow Generator and the Flow Executor, which work together to translate natural language instructions into executable 3D object flows [8][9] Group 2: Actionable Flow Generation - The Actionable Flow Generator translates user input (natural language and RGB-D images) into a 3D action flow through a four-step process, including video generation, 2D to 3D enhancement, 3D point tracking, and object segmentation [9][12][14] - The generator utilizes state-of-the-art video generation models to create instructional videos, which are then processed to extract actionable 3D object flows [12][14] Group 3: Action Flow Execution - The Flow Executor converts the abstract 3D object flows into specific robot action sequences, employing different strategies based on the type of object being manipulated [15][20] - The framework has been tested on various robotic platforms, demonstrating its effectiveness in manipulating rigid, articulated, and deformable objects [16][18] Group 4: Experimental Results - NovaFlow outperformed other zero-shot methods and even surpassed traditional imitation learning approaches that required multiple demonstration data points, showcasing the potential of extracting common-sense knowledge from generated videos [19][20] - The framework achieved high success rates in tasks involving rigid and articulated objects, as well as more complex tasks with deformable objects, indicating its robustness and versatility [19][20] Group 5: Challenges and Future Directions - Despite its successes, the research highlights limitations in the current open-loop planning system, particularly in the physical execution phase, suggesting a need for closed-loop feedback systems to enhance robustness against real-world uncertainties [23] - Future research will focus on developing systems that can dynamically adjust or replan actions based on real-time environmental feedback, further advancing the capabilities of autonomous robots [23]
元戎启行 发布全新辅助驾驶平台
Shen Zhen Shang Bao· 2025-08-27 07:05
Core Viewpoint - Yuanrong Qixing launched its next-generation assisted driving platform, DeepRoute IO 2.0, which features a self-developed VLA (Vision-Language-Action) model that significantly improves safety and comfort compared to traditional end-to-end models [1] Group 1: Technology and Innovation - The VLA model integrates visual perception, semantic understanding, and action decision-making, making it more adept at handling complex road conditions [1] - DeepRoute IO 2.0 is designed with a "multi-modal + multi-chip + multi-vehicle" adaptation concept, supporting both LiDAR and pure vision versions for customized deployment across various mainstream passenger car platforms [1] - The VLA model addresses the "black box" issue of traditional models by linking and analyzing information to infer causal relationships, and it is inherently integrated with a vast knowledge base, enhancing its generalization ability in dynamic real-world environments [1] Group 2: Commercialization and Partnerships - Yuanrong Qixing has established a solid foundation for mass production and commercialization, securing partnerships with over 10 vehicle models for targeted collaboration [1]
VLA+RL还是纯强化?从200多篇工作中看强化学习的发展路线
具身智能之心· 2025-08-18 00:07
Core Insights - The article provides a comprehensive analysis of the intersection of reinforcement learning (RL) and visual intelligence, focusing on the evolution of strategies and key research themes in visual reinforcement learning [5][17][25]. Group 1: Key Themes in Visual Reinforcement Learning - The article categorizes over 200 representative studies into four main pillars: multimodal large language models, visual generation, unified model frameworks, and visual-language-action models [5][17]. - Each pillar is examined for algorithm design, reward engineering, and benchmark progress, highlighting trends and open challenges in the field [5][17][25]. Group 2: Reinforcement Learning Techniques - Various reinforcement learning techniques are discussed, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which are used to enhance stability and efficiency in training [15][16]. - The article emphasizes the importance of reward models, such as those based on human feedback and verifiable rewards, in guiding the training of visual reinforcement learning agents [10][12][21]. Group 3: Applications in Visual and Video Reasoning - The article outlines applications of reinforcement learning in visual reasoning tasks, including 2D and 3D perception, image reasoning, and video reasoning, showcasing how these methods improve task performance [18][19][20]. - Specific studies are highlighted that utilize reinforcement learning to enhance capabilities in complex visual tasks, such as object detection and spatial reasoning [18][19][20]. Group 4: Evaluation Metrics and Benchmarks - The article discusses the need for new evaluation metrics tailored to large model visual reinforcement learning, combining traditional metrics with preference-based assessments [31][35]. - It provides an overview of various benchmarks that support training and evaluation in the visual domain, emphasizing the role of human preference data in shaping reward models [40][41]. Group 5: Future Directions and Challenges - The article identifies key challenges in visual reinforcement learning, such as balancing depth and efficiency in reasoning processes, and suggests future research directions to address these issues [43][44]. - It highlights the importance of developing adaptive strategies and hierarchical reinforcement learning approaches to improve the performance of visual-language-action agents [43][44].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
自动驾驶中常提的VLM是个啥?与VLA有什么区别?
自动驾驶之心· 2025-08-08 16:04
Core Viewpoint - The article discusses the significance of Vision-Language Models (VLM) in the context of autonomous driving, highlighting their ability to integrate visual perception and natural language processing to enhance vehicle understanding and interaction with complex road environments [4][19]. Summary by Sections What is VLM? - VLM stands for Vision-Language Model, which combines the capabilities of understanding images and text within a single AI system. It enables deep comprehension of visual content and natural language interaction, enhancing applications like image retrieval, writing assistance, and robotic navigation [6]. How to Make VLM Work Efficiently? - VLM processes raw road images into feature representations using visual encoders, such as Convolutional Neural Networks (CNN) and Vision Transformers (ViT). Language encoders and decoders handle natural language input and output, learning semantic relationships between tokens [8]. Key Mechanism of VLM - The alignment of visual features and language modules is crucial for VLM. Cross-attention mechanisms allow the language decoder to focus on relevant image areas when generating text, ensuring high consistency between generated language and actual scenes [9]. Training Process of VLM - The training process for VLM typically involves pre-training on large datasets followed by fine-tuning with specific datasets related to autonomous driving scenarios, ensuring the model can accurately recognize and respond to traffic signs and conditions [11]. Applications of VLM - VLM supports various intelligent functions, including real-time scene alerts, interactive semantic Q&A, and recognition of road signs and text. It can generate natural language prompts based on visual inputs, enhancing driver awareness and decision-making [12]. Real-time Operation of VLM - VLM operates in a "cloud-edge collaboration" architecture, where large-scale pre-training occurs in the cloud, and optimized lightweight models are deployed in vehicles for real-time processing. This setup allows for quick responses to safety alerts and complex analyses [14]. Data Annotation and Quality Assurance - Data annotation is critical for VLM deployment, requiring detailed labeling of images under various conditions. This process ensures high-quality training data, which is essential for the model's performance in real-world scenarios [14]. Safety and Robustness - Safety and robustness are paramount in autonomous driving. VLM must quickly assess uncertainties and implement fallback measures when recognition errors occur, ensuring reliable operation under adverse conditions [15]. Differences Between VLA and VLM - VLA (Vision-Language-Action) extends VLM by integrating action decision-making capabilities. While VLM focuses on understanding and expressing visual information, VLA encompasses perception, cognition, and execution, making it essential for real-world applications like autonomous driving [18]. Future Developments - The continuous evolution of large language models (LLM) and large vision models (LVM) will enhance VLM's capabilities in multi-modal integration, knowledge updates, and human-machine collaboration, leading to safer and more comfortable autonomous driving experiences [16][19].
模拟大脑功能分化!Fast-in-Slow VLA,让“快行动”和“慢推理”统一协作
具身智能之心· 2025-07-13 09:48
Core Viewpoint - The article discusses the introduction of the Fast-in-Slow (FiS-VLA) model, a novel dual-system visual-language-action model that integrates high-frequency response and complex reasoning in robotic control, showcasing significant advancements in control frequency and task success rates [5][29]. Group 1: Model Overview - FiS-VLA combines a fast execution module with a pre-trained visual-language model (VLM), achieving a control frequency of up to 117.7Hz, which is significantly higher than existing mainstream solutions [5][25]. - The model employs a dual-system architecture inspired by Kahneman's dual-system theory, where System 1 focuses on rapid, intuitive decision-making, while System 2 handles slower, deeper reasoning [9][14]. Group 2: Architecture and Design - The architecture of FiS-VLA includes a visual encoder, a lightweight 3D tokenizer, and a large language model (LLaMA2-7B), with the last few layers of the transformer repurposed for the execution module [13]. - The model utilizes heterogeneous input modalities, with System 2 processing 2D images and language instructions, while System 1 requires real-time sensory inputs, including 2D images and 3D point cloud data [15]. Group 3: Performance and Testing - In simulation tests, FiS-VLA achieved an average success rate of 69% across various tasks, outperforming other models like CogACT and π0 [18]. - Real-world testing on robotic platforms showed success rates of 68% and 74% for different tasks, demonstrating superior performance in high-precision control scenarios [20]. - The model exhibited robust generalization capabilities, with a smaller accuracy decline when faced with unseen objects and varying environmental conditions compared to baseline models [23]. Group 4: Training and Optimization - FiS-VLA employs a dual-system collaborative training strategy, enhancing System 1's action generation through diffusion modeling while retaining System 2's reasoning capabilities [16]. - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two transformer layers, and the best operational frequency ratio between the two systems is 1:4 [25]. Group 5: Future Prospects - The authors suggest that future enhancements could include dynamic adjustments to the shared structure and collaborative frequency strategies, which would further improve the model's adaptability and robustness in practical applications [29].
首次!世界模型、动作模型融合,全自回归模型WorldVLA来了
机器之心· 2025-07-03 08:01
Core Viewpoint - Alibaba's Damo Academy has introduced WorldVLA, a model that integrates World Model and Action Model into a unified autoregressive framework, enhancing understanding and generation across text, images, and actions [1][4]. Summary by Sections Research Overview - The development of Vision-Language-Action (VLA) models has become a significant focus in robotic action modeling, typically built on large-scale pretrained multimodal language models (MLLMs) with added action output capabilities [4]. - Existing VLA models often lack a deep understanding of actions, treating them merely as output rather than analyzing them as input [5]. Model Description - WorldVLA addresses the limitations of both VLA and World Models by using a unified autoregressive mechanism for action and image understanding and generation [5][10]. - It employs three independent encoders for processing images, text, and action data, sharing the same vocabulary to facilitate cross-modal tasks [12]. Mechanism and Strategy - The World Model component generates visual representations based on input actions, learning the physical dynamics of the environment, while the Action Model enhances visual understanding [7]. - An action attention masking strategy is introduced to mitigate error accumulation during the generation of multiple actions, significantly improving performance in action chunking tasks [8][14]. Experimental Results - In the LIBERO benchmark, WorldVLA achieved a 4% improvement in grasp success rate compared to traditional action models and a 10% reduction in Fréchet Video Distance (FVD) compared to traditional world models [8]. - The introduction of the attention mask strategy led to a performance improvement in grasp success rates ranging from 4% to 23% in action chunking tasks [8]. Comparative Analysis - WorldVLA outperformed other models in various metrics, demonstrating its effectiveness in integrating action and world modeling [18]. - The model's ability to generate the next frame based on actions and images showcases its advanced capabilities in visual prediction [24].
自动驾驶中常提的VLA是个啥?
自动驾驶之心· 2025-06-18 13:37
Core Viewpoint - The article discusses the Vision-Language-Action (VLA) model, which integrates visual perception, language understanding, and action decision-making into a unified framework for autonomous driving, enhancing system generalization and adaptability [2][4][12]. Summary by Sections Introduction to VLA - VLA stands for Vision-Language-Action, aiming to unify the processes of environmental observation and control command output in autonomous driving [2]. - The model represents a shift from traditional modular approaches to an end-to-end system driven by large-scale data [2][4]. Technical Framework of VLA - The VLA model consists of four key components: 1. Visual Encoder: Extracts features from images and point cloud data [8]. 2. Language Encoder: Utilizes pre-trained language models to understand navigation instructions and traffic rules [11]. 3. Cross-Modal Fusion Layer: Aligns and integrates visual and language features for unified environmental understanding [11]. 4. Action Decoder: Generates control commands based on the fused multi-modal representation [8][11]. Advantages of VLA - VLA enhances scene generalization and contextual reasoning, allowing for quicker and more reasonable decision-making in complex scenarios [12]. - The integration of language understanding allows for more flexible driving strategies and improved human-vehicle interaction [12]. Industry Applications - Various companies, including DeepMind and Yuanrong Qixing, are applying VLA concepts in their autonomous driving research, showcasing its potential in real-world applications [13]. - The RT-2 model by DeepMind and the "end-to-end 2.0 version" by Yuanrong Qixing highlight the advancements in intelligent driving systems [13]. Challenges and Future Directions - Despite its advantages, VLA faces challenges such as lack of interpretability, high data quality requirements, and significant computational resource demands [13][15]. - Solutions being explored include integrating interpretability modules, optimizing trajectory generation, and combining VLA with traditional control methods to enhance safety and robustness [15][16]. - The future of VLA in autonomous driving looks promising, with expectations of becoming a foundational technology as advancements in large models and edge computing continue [16].