视觉 - 语言 - 动作模型
Search documents
微软研究院发布Rho-alpha机器人模型,融合了视觉、语言和触觉功能
Sou Hu Cai Jing· 2026-02-06 21:19
Core Insights - Microsoft Research has launched Rho-alpha, a new robotic model designed to help robots understand natural language commands and perform complex physical tasks in less structured environments [1] - Rho-alpha aims to advance the next generation of robotic systems, enabling them to perceive, reason, and act in dynamic real-world settings [1] - The model is part of a trend towards "visual-language-action" models that enhance the autonomy of physical systems [1] Group 1 - Rho-alpha integrates touch data and is currently being researched to support additional sensory modalities, such as force sensing [2] - The model is designed to improve continuously during deployment by learning from user feedback during interactions with robots [2] - The training of Rho-alpha heavily relies on synthetic data, utilizing a multi-stage training process that combines reinforcement learning and simulation technology [2] Group 2 - A major challenge for foundational models is the lack of diverse real-world robotic data [4] - Researchers are collaborating with Microsoft to enhance pre-training datasets using synthetic demonstrations, addressing the impracticality of remote operation in many cases [4] - NVIDIA emphasizes the role of synthetic data in accelerating the development of robotic technologies, highlighting the collaboration with Microsoft to generate high-fidelity synthetic datasets [4] Group 3 - Microsoft has opened registration for the early access program for Rho-alpha and plans to release more updates on its robotic research in the coming months [4]
高通发布机器人芯片架构 押注“物理AI”|直击CES
Xin Lang Ke Ji· 2026-01-05 19:58
Group 1 - Qualcomm has launched a new robotics technology architecture and the Dragonwing IQ10 series processors at CES 2026, marking its entry into the industrial and humanoid robotics market [3] - The Dragonwing IQ10 processor is designed for autonomous mobile robots (AMR) and full-sized humanoid robots, integrating edge computing, edge AI, hybrid critical systems, and machine learning operations for high-efficiency "robot brain" capabilities [3] - Qualcomm aims to compete with Nvidia in the next-generation robotics market, leveraging its 40 years of experience in mobile chip technology to establish advantages in power efficiency and scalability [3] Group 2 - Qualcomm is building a comprehensive robotics ecosystem and has partnered with several robotics manufacturers, including Figure AI, Booster, VinMotion, and Kuka Robotics [3] - The architecture supports end-to-end AI models such as visual-language-action models (VLA) and visual-language models (VLM), enabling advanced perception, motion planning, and human-robot interaction [3] - Qualcomm's Snapdragon Cockpit Elite platform has become the de facto standard for high-end electric vehicles, with a revenue pipeline exceeding $45 billion from its automotive business [4]
专家指具身智能大规模落地仍处于早期阶段
Zhong Guo Xin Wen Wang· 2025-12-13 12:33
Core Insights - The current state of embodied intelligence has achieved breakthroughs in both cognitive and physical intelligence, but large-scale implementation is still in its early stages [1][2] - The future direction of embodied intelligence is characterized by ongoing competition and rapid evolution [1] Group 1: Key Issues in the Industry - The first core issue is the debate over model pathways, specifically whether large model paradigms are applicable to robotics. While large models have seen success in language, image, and video domains, it remains unproven if the same paradigm can be directly transferred to robot control [1] - The second core issue is the contention over data training paradigms. Data continues to be a critical bottleneck limiting the leap in robotic capabilities, with various approaches such as mixed data, multimodal data, and world model generation data being explored [1] - The third core issue is the debate over the form factor of robots, questioning whether humanoid robots represent a "true demand." Companies like Tesla and Figure AI are pursuing a fully humanoid approach, while several Chinese companies have introduced "wheel-arm composite robots" this year, emphasizing "engineering feasibility" for scalable commercial applications in the short term [1] Group 2: Future Development Paths - There is a consensus in the industry that enhancing robots' generalization capabilities using large models is essential, but effective application of large models in robotic systems still involves multiple technical pathways [2] - Looking ahead, the introduction of world models based on visual-language-action models (VLA) is expected to significantly enhance the capabilities of large models in robotics by leveraging their understanding, prediction, and reasoning abilities regarding the physical world [2]
超越ORION!CoT4AD:显式思维链推理VLA模型(北大最新)
自动驾驶之心· 2025-12-02 00:03
Core Insights - The article introduces CoT4AD, a new Vision-Language-Action (VLA) framework designed to enhance logical and causal reasoning capabilities in autonomous driving scenarios, addressing limitations in existing VLA models [1][3][10]. Background Review - Autonomous driving is a key research area in AI and robotics, promising improvements in traffic safety and efficiency, and playing a crucial role in smart city and intelligent transportation system development [2]. - Traditional modular architectures in autonomous driving face challenges such as error accumulation and limited generalization, leading to the emergence of end-to-end paradigms that utilize unified learning frameworks [2][3]. CoT4AD Framework - CoT4AD integrates chain-of-thought reasoning into end-to-end autonomous driving, allowing for explicit or implicit reasoning through a series of downstream tasks tailored for driving scenarios [3][10]. - The framework combines perception, language reasoning, future prediction, and trajectory planning, enabling the generation of explicit reasoning steps [6][10]. Experimental Results - CoT4AD was evaluated on the nuScenes and Bench2Drive datasets, achieving state-of-the-art performance in both open-loop and closed-loop assessments, outperforming existing LLM-based and end-to-end methods [10][19]. - In the nuScenes dataset, CoT4AD achieved L2 distance errors of 0.12m, 0.24m, and 0.53m at 1s, 2s, and 3s respectively, with an average collision rate of 0.10% [17][18]. Contributions of CoT4AD - The model's design allows for robust multi-task processing and future trajectory prediction, leveraging a diffusion model integrated with chain-of-thought reasoning [10][12]. - CoT4AD demonstrates superior performance in complex driving scenarios, enhancing decision-making consistency and reliability across diverse environments [19][23]. Ablation Studies - The effectiveness of various components, such as perception tokenizers and the chain-of-thought design, was validated through ablation studies, showing significant performance improvements when these elements were included [26][28]. - The model's ability to predict future scenarios was found to be crucial, with optimal performance achieved when predicting four future scenarios [29]. Conclusion - CoT4AD represents a significant advancement in autonomous driving technology, demonstrating enhanced reasoning capabilities and superior performance compared to existing methods, while also highlighting areas for future research to improve computational efficiency [30][32].
开源对机器人的价值,远超想象丨唐文斌深度对谈抱抱脸联创
具身智能之心· 2025-10-21 00:03
Core Insights - The article discusses the challenges in the field of robotics, particularly the gap between simulation and real-world application, and introduces RoboChallenge.ai as a solution to create a standardized evaluation platform for embodied intelligence [2][42][51]. Group 1: Current Challenges in Robotics - Many models perform well in simulations but fail in real-world scenarios, highlighting a significant pain point in robotics research [2][42]. - The need for a unified, open, and reproducible evaluation system for robotics is emphasized, as current benchmarks are primarily based on simulations [50][44]. Group 2: Introduction of RoboChallenge.ai - RoboChallenge.ai is launched as an open, standardized platform for evaluating robotic models in real-world environments, allowing researchers to remotely test their models on physical robots [6][51]. - The platform enables users to control local models through an API, facilitating remote testing without the need to upload models [8][53]. Group 3: Importance of Open Source in Robotics - Open source is identified as a crucial driver for advancements in AI and robotics, enabling collaboration and innovation across global teams [10][19]. - The article argues that open source in robotics may be even more critical than in large language models (LLMs) due to the necessity of hardware accessibility for model application [20][22]. Group 4: Future Directions and Community Involvement - The article anticipates that the next three to five years will see significant evolution in embodied intelligence research, with robots capable of executing longer and more complex tasks [82]. - Community participation is encouraged, with the expectation that diverse contributions will enhance data availability and model robustness [66][68].
刚刚,Figure 03惊天登场,四年狂造10万台,人类保姆集体失业
3 6 Ke· 2025-10-10 10:50
Core Insights - Figure 03 marks the official launch of the next-generation humanoid robot, signifying the beginning of the era of general-purpose robots [1][3] - The robot is designed for Helix, home use, and global scalability, featuring significant upgrades in design and functionality [3][8] Design and Features - Figure 03 features a flexible fabric outer layer, replacing the mechanical shell, and integrates a wide-angle camera in each palm [3][8] - The robot can perform various tasks such as watering plants, serving tea, and cleaning, showcasing unprecedented intelligence and adaptability [3][8] - Each fingertip can sense a pressure of 3 grams, allowing it to detect even the weight of a paperclip [3][17][20] Technological Advancements - The robot is powered by Helix, an in-house developed visual-language-action model, enabling it to learn and operate autonomously in complex environments [10] - The visual system has been optimized for high-frequency motion control, improving clarity and responsiveness, with a frame rate doubled and latency reduced to a quarter [11][12] - Figure 03 supports 10 Gbps millimeter-wave data offloading, allowing for continuous learning and improvement from a fleet of robots [18] Manufacturing and Scalability - Figure aims to produce 100,000 units over the next four years, establishing the BotQ factory with an initial annual capacity of 12,000 units [8][22] - The design allows seamless transitions between home and commercial applications, with enhanced speed and torque density for faster operations [21][22] User Experience and Safety - The robot features a soft, adaptable design to ensure safety and ease of use, with a 9% reduction in weight compared to its predecessor [19] - It includes a wireless charging system and a robust battery management system, certified by international standards [19][24] - Users can customize the robot's appearance with removable, washable coverings [24]
机器人「看片」自学新技能:NovaFlow从生成视频中提取动作流,实现零样本操控
机器之心· 2025-10-09 02:24
Core Insights - The article discusses the development of NovaFlow, a novel framework for enabling robots to perform complex manipulation tasks without requiring extensive training data or demonstrations, leveraging large video generation models to extract common-sense knowledge from vast amounts of internet video content [2][4][23] Group 1: NovaFlow Framework Overview - NovaFlow aims to decouple task understanding from low-level control, allowing robots to learn from generated videos rather than requiring human demonstrations or trial-and-error learning [4][23] - The framework consists of two main components: the Actionable Flow Generator and the Flow Executor, which work together to translate natural language instructions into executable 3D object flows [8][9] Group 2: Actionable Flow Generation - The Actionable Flow Generator translates user input (natural language and RGB-D images) into a 3D action flow through a four-step process, including video generation, 2D to 3D enhancement, 3D point tracking, and object segmentation [9][12][14] - The generator utilizes state-of-the-art video generation models to create instructional videos, which are then processed to extract actionable 3D object flows [12][14] Group 3: Action Flow Execution - The Flow Executor converts the abstract 3D object flows into specific robot action sequences, employing different strategies based on the type of object being manipulated [15][20] - The framework has been tested on various robotic platforms, demonstrating its effectiveness in manipulating rigid, articulated, and deformable objects [16][18] Group 4: Experimental Results - NovaFlow outperformed other zero-shot methods and even surpassed traditional imitation learning approaches that required multiple demonstration data points, showcasing the potential of extracting common-sense knowledge from generated videos [19][20] - The framework achieved high success rates in tasks involving rigid and articulated objects, as well as more complex tasks with deformable objects, indicating its robustness and versatility [19][20] Group 5: Challenges and Future Directions - Despite its successes, the research highlights limitations in the current open-loop planning system, particularly in the physical execution phase, suggesting a need for closed-loop feedback systems to enhance robustness against real-world uncertainties [23] - Future research will focus on developing systems that can dynamically adjust or replan actions based on real-time environmental feedback, further advancing the capabilities of autonomous robots [23]
元戎启行 发布全新辅助驾驶平台
Shen Zhen Shang Bao· 2025-08-27 07:05
Core Viewpoint - Yuanrong Qixing launched its next-generation assisted driving platform, DeepRoute IO 2.0, which features a self-developed VLA (Vision-Language-Action) model that significantly improves safety and comfort compared to traditional end-to-end models [1] Group 1: Technology and Innovation - The VLA model integrates visual perception, semantic understanding, and action decision-making, making it more adept at handling complex road conditions [1] - DeepRoute IO 2.0 is designed with a "multi-modal + multi-chip + multi-vehicle" adaptation concept, supporting both LiDAR and pure vision versions for customized deployment across various mainstream passenger car platforms [1] - The VLA model addresses the "black box" issue of traditional models by linking and analyzing information to infer causal relationships, and it is inherently integrated with a vast knowledge base, enhancing its generalization ability in dynamic real-world environments [1] Group 2: Commercialization and Partnerships - Yuanrong Qixing has established a solid foundation for mass production and commercialization, securing partnerships with over 10 vehicle models for targeted collaboration [1]
VLA+RL还是纯强化?从200多篇工作中看强化学习的发展路线
具身智能之心· 2025-08-18 00:07
Core Insights - The article provides a comprehensive analysis of the intersection of reinforcement learning (RL) and visual intelligence, focusing on the evolution of strategies and key research themes in visual reinforcement learning [5][17][25]. Group 1: Key Themes in Visual Reinforcement Learning - The article categorizes over 200 representative studies into four main pillars: multimodal large language models, visual generation, unified model frameworks, and visual-language-action models [5][17]. - Each pillar is examined for algorithm design, reward engineering, and benchmark progress, highlighting trends and open challenges in the field [5][17][25]. Group 2: Reinforcement Learning Techniques - Various reinforcement learning techniques are discussed, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which are used to enhance stability and efficiency in training [15][16]. - The article emphasizes the importance of reward models, such as those based on human feedback and verifiable rewards, in guiding the training of visual reinforcement learning agents [10][12][21]. Group 3: Applications in Visual and Video Reasoning - The article outlines applications of reinforcement learning in visual reasoning tasks, including 2D and 3D perception, image reasoning, and video reasoning, showcasing how these methods improve task performance [18][19][20]. - Specific studies are highlighted that utilize reinforcement learning to enhance capabilities in complex visual tasks, such as object detection and spatial reasoning [18][19][20]. Group 4: Evaluation Metrics and Benchmarks - The article discusses the need for new evaluation metrics tailored to large model visual reinforcement learning, combining traditional metrics with preference-based assessments [31][35]. - It provides an overview of various benchmarks that support training and evaluation in the visual domain, emphasizing the role of human preference data in shaping reward models [40][41]. Group 5: Future Directions and Challenges - The article identifies key challenges in visual reinforcement learning, such as balancing depth and efficiency in reasoning processes, and suggests future research directions to address these issues [43][44]. - It highlights the importance of developing adaptive strategies and hierarchical reinforcement learning approaches to improve the performance of visual-language-action agents [43][44].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]