视觉 - 语言 - 动作模型
Search documents
高通发布机器人芯片架构 押注“物理AI”|直击CES
Xin Lang Ke Ji· 2026-01-05 19:58
专题:2026年度国际消费电子展(CES) 高通正在构建全面的机器人生态系统,已与Figure AI、Booster、VinMotion、Kuka Robotics等多家机器 人制造商展开合作。其中Figure AI这家备受瞩目的美国初创公司将使用Dragonwing IQ10开发下一代人 形机器人,而越南VinMotion的Motion 2人形机器人已搭载前代IQ9芯片在展会上展示。 该架构支持视觉-语言-动作模型(VLA)和视觉-语言模型(VLM)等端到端AI模型,能实现高级感 知、运动规划和人机交互功能,高通称这标志着机器人从原型阶段向实际商业部署的重要跨越。 高通在汽车领域的布局同样引人注目,其Snapdragon Cockpit Elite平台已成为高端电动车的事实标准。 该平台采用定制Oryon CPU架构,在功耗和连接性上优于英伟达和英特尔的竞争方案,已获得通用、 BMWYY>宝马、现代、法拉利等几乎所有主要汽车制造商的采用,汽车业务营收管线超过450亿美 元。高通正在与库卡机器人公司洽谈下一代机器人解决方案,展示其从移动、PC到汽车、机器人的全 方位布局野心。 责任编辑:李桐 新浪科技讯,高通 ...
专家指具身智能大规模落地仍处于早期阶段
Zhong Guo Xin Wen Wang· 2025-12-13 12:33
中新社北京12月13日电 (记者刘育英)13日在北京举行的2026中国信通院深度观察报告会上,中国信息通 信研究院副总工程师许志远认为,当前具身智能已经取得认知智能与物理智能的双线突破,但大规模落 地仍处于早期阶段。 展望未来,许志远认为,在VLA(视觉-语言-动作模型)的基础上引入世界模型(World Model),借助其对 物理世界的理解、预测与推演能力,有望成为进一步提升机器人大模型能力的重要发展路径。(完) 他表示,当前具身智能模型路线、数据范式以及最佳机器人形态仍未定型,大规模落地仍处于早期阶 段,其未来方向仍在持续竞争与快速演化中。 "当前行业仍面临三个核心焦点问题。"许志远表示,一是模型路线之争,即大模型范式是否适用于机器 人。虽然大模型在语言、图像、视频领域取得巨大成功,但"同样的范式能否直接迁移到机器人控制"仍 未被证明,业界正在探索多种途径。 二是数据训练范式之争。数据仍然是限制机器人能力跃升的核心瓶颈,混合数据、多模态数据、世界模 型生成数据等方向均在探索中。 三是形态路线之争,即人形机器人是否是"真需求"。当前,特斯拉、Figure AI等企业坚持全人形路线; 而中国国内今年涌现出多款 ...
超越ORION!CoT4AD:显式思维链推理VLA模型(北大最新)
自动驾驶之心· 2025-12-02 00:03
Core Insights - The article introduces CoT4AD, a new Vision-Language-Action (VLA) framework designed to enhance logical and causal reasoning capabilities in autonomous driving scenarios, addressing limitations in existing VLA models [1][3][10]. Background Review - Autonomous driving is a key research area in AI and robotics, promising improvements in traffic safety and efficiency, and playing a crucial role in smart city and intelligent transportation system development [2]. - Traditional modular architectures in autonomous driving face challenges such as error accumulation and limited generalization, leading to the emergence of end-to-end paradigms that utilize unified learning frameworks [2][3]. CoT4AD Framework - CoT4AD integrates chain-of-thought reasoning into end-to-end autonomous driving, allowing for explicit or implicit reasoning through a series of downstream tasks tailored for driving scenarios [3][10]. - The framework combines perception, language reasoning, future prediction, and trajectory planning, enabling the generation of explicit reasoning steps [6][10]. Experimental Results - CoT4AD was evaluated on the nuScenes and Bench2Drive datasets, achieving state-of-the-art performance in both open-loop and closed-loop assessments, outperforming existing LLM-based and end-to-end methods [10][19]. - In the nuScenes dataset, CoT4AD achieved L2 distance errors of 0.12m, 0.24m, and 0.53m at 1s, 2s, and 3s respectively, with an average collision rate of 0.10% [17][18]. Contributions of CoT4AD - The model's design allows for robust multi-task processing and future trajectory prediction, leveraging a diffusion model integrated with chain-of-thought reasoning [10][12]. - CoT4AD demonstrates superior performance in complex driving scenarios, enhancing decision-making consistency and reliability across diverse environments [19][23]. Ablation Studies - The effectiveness of various components, such as perception tokenizers and the chain-of-thought design, was validated through ablation studies, showing significant performance improvements when these elements were included [26][28]. - The model's ability to predict future scenarios was found to be crucial, with optimal performance achieved when predicting four future scenarios [29]. Conclusion - CoT4AD represents a significant advancement in autonomous driving technology, demonstrating enhanced reasoning capabilities and superior performance compared to existing methods, while also highlighting areas for future research to improve computational efficiency [30][32].
开源对机器人的价值,远超想象丨唐文斌深度对谈抱抱脸联创
具身智能之心· 2025-10-21 00:03
Core Insights - The article discusses the challenges in the field of robotics, particularly the gap between simulation and real-world application, and introduces RoboChallenge.ai as a solution to create a standardized evaluation platform for embodied intelligence [2][42][51]. Group 1: Current Challenges in Robotics - Many models perform well in simulations but fail in real-world scenarios, highlighting a significant pain point in robotics research [2][42]. - The need for a unified, open, and reproducible evaluation system for robotics is emphasized, as current benchmarks are primarily based on simulations [50][44]. Group 2: Introduction of RoboChallenge.ai - RoboChallenge.ai is launched as an open, standardized platform for evaluating robotic models in real-world environments, allowing researchers to remotely test their models on physical robots [6][51]. - The platform enables users to control local models through an API, facilitating remote testing without the need to upload models [8][53]. Group 3: Importance of Open Source in Robotics - Open source is identified as a crucial driver for advancements in AI and robotics, enabling collaboration and innovation across global teams [10][19]. - The article argues that open source in robotics may be even more critical than in large language models (LLMs) due to the necessity of hardware accessibility for model application [20][22]. Group 4: Future Directions and Community Involvement - The article anticipates that the next three to five years will see significant evolution in embodied intelligence research, with robots capable of executing longer and more complex tasks [82]. - Community participation is encouraged, with the expectation that diverse contributions will enhance data availability and model robustness [66][68].
刚刚,Figure 03惊天登场,四年狂造10万台,人类保姆集体失业
3 6 Ke· 2025-10-10 10:50
Core Insights - Figure 03 marks the official launch of the next-generation humanoid robot, signifying the beginning of the era of general-purpose robots [1][3] - The robot is designed for Helix, home use, and global scalability, featuring significant upgrades in design and functionality [3][8] Design and Features - Figure 03 features a flexible fabric outer layer, replacing the mechanical shell, and integrates a wide-angle camera in each palm [3][8] - The robot can perform various tasks such as watering plants, serving tea, and cleaning, showcasing unprecedented intelligence and adaptability [3][8] - Each fingertip can sense a pressure of 3 grams, allowing it to detect even the weight of a paperclip [3][17][20] Technological Advancements - The robot is powered by Helix, an in-house developed visual-language-action model, enabling it to learn and operate autonomously in complex environments [10] - The visual system has been optimized for high-frequency motion control, improving clarity and responsiveness, with a frame rate doubled and latency reduced to a quarter [11][12] - Figure 03 supports 10 Gbps millimeter-wave data offloading, allowing for continuous learning and improvement from a fleet of robots [18] Manufacturing and Scalability - Figure aims to produce 100,000 units over the next four years, establishing the BotQ factory with an initial annual capacity of 12,000 units [8][22] - The design allows seamless transitions between home and commercial applications, with enhanced speed and torque density for faster operations [21][22] User Experience and Safety - The robot features a soft, adaptable design to ensure safety and ease of use, with a 9% reduction in weight compared to its predecessor [19] - It includes a wireless charging system and a robust battery management system, certified by international standards [19][24] - Users can customize the robot's appearance with removable, washable coverings [24]
机器人「看片」自学新技能:NovaFlow从生成视频中提取动作流,实现零样本操控
机器之心· 2025-10-09 02:24
Core Insights - The article discusses the development of NovaFlow, a novel framework for enabling robots to perform complex manipulation tasks without requiring extensive training data or demonstrations, leveraging large video generation models to extract common-sense knowledge from vast amounts of internet video content [2][4][23] Group 1: NovaFlow Framework Overview - NovaFlow aims to decouple task understanding from low-level control, allowing robots to learn from generated videos rather than requiring human demonstrations or trial-and-error learning [4][23] - The framework consists of two main components: the Actionable Flow Generator and the Flow Executor, which work together to translate natural language instructions into executable 3D object flows [8][9] Group 2: Actionable Flow Generation - The Actionable Flow Generator translates user input (natural language and RGB-D images) into a 3D action flow through a four-step process, including video generation, 2D to 3D enhancement, 3D point tracking, and object segmentation [9][12][14] - The generator utilizes state-of-the-art video generation models to create instructional videos, which are then processed to extract actionable 3D object flows [12][14] Group 3: Action Flow Execution - The Flow Executor converts the abstract 3D object flows into specific robot action sequences, employing different strategies based on the type of object being manipulated [15][20] - The framework has been tested on various robotic platforms, demonstrating its effectiveness in manipulating rigid, articulated, and deformable objects [16][18] Group 4: Experimental Results - NovaFlow outperformed other zero-shot methods and even surpassed traditional imitation learning approaches that required multiple demonstration data points, showcasing the potential of extracting common-sense knowledge from generated videos [19][20] - The framework achieved high success rates in tasks involving rigid and articulated objects, as well as more complex tasks with deformable objects, indicating its robustness and versatility [19][20] Group 5: Challenges and Future Directions - Despite its successes, the research highlights limitations in the current open-loop planning system, particularly in the physical execution phase, suggesting a need for closed-loop feedback systems to enhance robustness against real-world uncertainties [23] - Future research will focus on developing systems that can dynamically adjust or replan actions based on real-time environmental feedback, further advancing the capabilities of autonomous robots [23]
元戎启行 发布全新辅助驾驶平台
Shen Zhen Shang Bao· 2025-08-27 07:05
Core Viewpoint - Yuanrong Qixing launched its next-generation assisted driving platform, DeepRoute IO 2.0, which features a self-developed VLA (Vision-Language-Action) model that significantly improves safety and comfort compared to traditional end-to-end models [1] Group 1: Technology and Innovation - The VLA model integrates visual perception, semantic understanding, and action decision-making, making it more adept at handling complex road conditions [1] - DeepRoute IO 2.0 is designed with a "multi-modal + multi-chip + multi-vehicle" adaptation concept, supporting both LiDAR and pure vision versions for customized deployment across various mainstream passenger car platforms [1] - The VLA model addresses the "black box" issue of traditional models by linking and analyzing information to infer causal relationships, and it is inherently integrated with a vast knowledge base, enhancing its generalization ability in dynamic real-world environments [1] Group 2: Commercialization and Partnerships - Yuanrong Qixing has established a solid foundation for mass production and commercialization, securing partnerships with over 10 vehicle models for targeted collaboration [1]
VLA+RL还是纯强化?从200多篇工作中看强化学习的发展路线
具身智能之心· 2025-08-18 00:07
Core Insights - The article provides a comprehensive analysis of the intersection of reinforcement learning (RL) and visual intelligence, focusing on the evolution of strategies and key research themes in visual reinforcement learning [5][17][25]. Group 1: Key Themes in Visual Reinforcement Learning - The article categorizes over 200 representative studies into four main pillars: multimodal large language models, visual generation, unified model frameworks, and visual-language-action models [5][17]. - Each pillar is examined for algorithm design, reward engineering, and benchmark progress, highlighting trends and open challenges in the field [5][17][25]. Group 2: Reinforcement Learning Techniques - Various reinforcement learning techniques are discussed, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which are used to enhance stability and efficiency in training [15][16]. - The article emphasizes the importance of reward models, such as those based on human feedback and verifiable rewards, in guiding the training of visual reinforcement learning agents [10][12][21]. Group 3: Applications in Visual and Video Reasoning - The article outlines applications of reinforcement learning in visual reasoning tasks, including 2D and 3D perception, image reasoning, and video reasoning, showcasing how these methods improve task performance [18][19][20]. - Specific studies are highlighted that utilize reinforcement learning to enhance capabilities in complex visual tasks, such as object detection and spatial reasoning [18][19][20]. Group 4: Evaluation Metrics and Benchmarks - The article discusses the need for new evaluation metrics tailored to large model visual reinforcement learning, combining traditional metrics with preference-based assessments [31][35]. - It provides an overview of various benchmarks that support training and evaluation in the visual domain, emphasizing the role of human preference data in shaping reward models [40][41]. Group 5: Future Directions and Challenges - The article identifies key challenges in visual reinforcement learning, such as balancing depth and efficiency in reasoning processes, and suggests future research directions to address these issues [43][44]. - It highlights the importance of developing adaptive strategies and hierarchical reinforcement learning approaches to improve the performance of visual-language-action agents [43][44].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
自动驾驶中常提的VLM是个啥?与VLA有什么区别?
自动驾驶之心· 2025-08-08 16:04
Core Viewpoint - The article discusses the significance of Vision-Language Models (VLM) in the context of autonomous driving, highlighting their ability to integrate visual perception and natural language processing to enhance vehicle understanding and interaction with complex road environments [4][19]. Summary by Sections What is VLM? - VLM stands for Vision-Language Model, which combines the capabilities of understanding images and text within a single AI system. It enables deep comprehension of visual content and natural language interaction, enhancing applications like image retrieval, writing assistance, and robotic navigation [6]. How to Make VLM Work Efficiently? - VLM processes raw road images into feature representations using visual encoders, such as Convolutional Neural Networks (CNN) and Vision Transformers (ViT). Language encoders and decoders handle natural language input and output, learning semantic relationships between tokens [8]. Key Mechanism of VLM - The alignment of visual features and language modules is crucial for VLM. Cross-attention mechanisms allow the language decoder to focus on relevant image areas when generating text, ensuring high consistency between generated language and actual scenes [9]. Training Process of VLM - The training process for VLM typically involves pre-training on large datasets followed by fine-tuning with specific datasets related to autonomous driving scenarios, ensuring the model can accurately recognize and respond to traffic signs and conditions [11]. Applications of VLM - VLM supports various intelligent functions, including real-time scene alerts, interactive semantic Q&A, and recognition of road signs and text. It can generate natural language prompts based on visual inputs, enhancing driver awareness and decision-making [12]. Real-time Operation of VLM - VLM operates in a "cloud-edge collaboration" architecture, where large-scale pre-training occurs in the cloud, and optimized lightweight models are deployed in vehicles for real-time processing. This setup allows for quick responses to safety alerts and complex analyses [14]. Data Annotation and Quality Assurance - Data annotation is critical for VLM deployment, requiring detailed labeling of images under various conditions. This process ensures high-quality training data, which is essential for the model's performance in real-world scenarios [14]. Safety and Robustness - Safety and robustness are paramount in autonomous driving. VLM must quickly assess uncertainties and implement fallback measures when recognition errors occur, ensuring reliable operation under adverse conditions [15]. Differences Between VLA and VLM - VLA (Vision-Language-Action) extends VLM by integrating action decision-making capabilities. While VLM focuses on understanding and expressing visual information, VLA encompasses perception, cognition, and execution, making it essential for real-world applications like autonomous driving [18]. Future Developments - The continuous evolution of large language models (LLM) and large vision models (LVM) will enhance VLM's capabilities in multi-modal integration, knowledge updates, and human-machine collaboration, leading to safer and more comfortable autonomous driving experiences [16][19].