UniVLA
Search documents
Sim2Real,解不了具身智能的数据困境。
自动驾驶之心· 2025-10-03 03:32
以下文章来源于具身智能之心 ,作者具身智能之心 具身智能之心 . 与世界交互,更进一步 点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 然而Physical Intelligence (PI)联合创始人、具身智能领域的先行者Sergey Levine始终坚称:替代数据是叉勺(叉子勺子二合一的产物,既不 如勺子,也不如叉子),真实交互数据不可替代——这究竟是策略局限,还是数据本质的铁律?如今,Genie3携世界模型横空出世,能够 从文本生成可交互的动态环境,甚至驱动在线规划。这是否意味着我们正站在"仿真"与"现实"二元对立终结的前夜?世界模型会成为数据 问题的终极答案,还是仅仅换了一种形式的sim,并依然难逃Sim-to-Real gap的宿命? 本场技术圆桌,我们邀请到国内Sim2Real领域四位杰出青年科学家—— 与他们四位共话前沿,从高保真3D资产构建、神经渲染的物理瓶颈、铰链体结构优化,到VLA模型的解耦设计等方面入手深入探讨:具身 智能的数据之路,究竟通向仿真、现实,还是那个正在觉醒的"世界模型"? 智驾的学术领袖和未来的具身学术领袖,Un ...
FlowVLA:破解 VLA 模型 “物理失真” 难题,机器人世界建模再升级
具身智能之心· 2025-08-29 00:03
Core Viewpoint - The article discusses the limitations of traditional Vision-Language-Action (VLA) models and introduces FlowVLA, a new framework that addresses these issues by implementing a Visual Chain of Thought (Visual CoT) principle, enhancing the model's ability to predict future frames through structured physical reasoning rather than mere pixel replication [5][8][36]. Group 1: Background and Current State - VLA models, particularly those pre-trained as world models, show significant potential in the field of general robotics, primarily through large self-regressive Transformers that learn environmental dynamics from vast video data [6][7]. - Existing models face critical flaws, including task confusion leading to prediction failures, knowledge transfer inefficiencies between passive observation and active control, and entangled learning of dynamics and appearance [7]. Group 2: Contributions of FlowVLA - FlowVLA introduces a new learning framework that emphasizes structured physical reasoning by requiring the model to infer motion dynamics before predicting future frames [8][10]. - The model is designed to unify appearance and motion reasoning within a single self-regressive Transformer, maintaining parameter efficiency and architectural simplicity [9][10]. - Experimental results validate FlowVLA's superior performance across various robotic operation benchmarks, demonstrating enhanced sample efficiency and bridging the gap between pre-training and policy fine-tuning [10][20]. Group 3: Research Content - The Visual CoT reasoning process decomposes the frame prediction into a causal chain of "current frame → optical flow → future frame," allowing the model to separate dynamic and appearance learning [12][14]. - The two-phase training paradigm consists of a pre-training phase focused on world model learning and a fine-tuning phase for adapting to control tasks [15][16]. Group 4: Experimental Analysis - FlowVLA outperforms existing methods in the LIBERO dataset across all task sets, particularly excelling in long-term tasks, showcasing its robust understanding of physical dynamics [20][21]. - In the SimplerEnv dataset, FlowVLA demonstrates strong adaptability to visual domain shifts, achieving significant performance improvements in tasks where other models struggle [22][23]. - The model's sample efficiency is validated, requiring only one-third of the training steps to reach peak performance compared to baseline models, with a 55% higher peak success rate in low-data scenarios [30][32]. Group 5: Key Component Validation - Ablation studies on the LIBERO-10 benchmark highlight the importance of the Visual CoT structure, flow loss, and interleaved sequence format, confirming their critical roles in the model's performance [33][34]. Group 6: Comparison with Related Work - FlowVLA distinguishes itself from traditional VLA models by prioritizing dynamic understanding and establishing a robust world model before adapting to control tasks, thus laying a solid foundation for physical knowledge [35].
重磅直播!RoboTwin2.0:强域随机化双臂操作数据生成器与评测基准集
具身智能之心· 2025-07-15 13:49
Core Viewpoint - The article discusses the challenges and advancements in training dual-arm robots for complex tasks, emphasizing the need for efficient data collection and simulation methods to enhance their operational capabilities [2]. Group 1: Challenges in Dual-Arm Robot Training - Dual-arm robots play a crucial role in collaborative assembly, tool usage, and object handover in complex scenarios, but training them to perform general operations like VLA faces multiple bottlenecks [2]. - The cost and time required to scale up the collection of real demonstration data are high, making it difficult to cover a wide range of tasks, object shapes, and hardware variations [2]. - Existing simulation methods lack efficient and scalable expert data generation techniques for new tasks, and their domain randomization designs are too superficial to accurately simulate the complexities of real environments [2]. Group 2: Advancements and Solutions - The article highlights the introduction of UniVLA, which efficiently utilizes multi-source heterogeneous data to construct a general and scalable action space for robots [5]. - The CVPR champion solution, BridgeVLA, reportedly improves real machine performance by 32%, showcasing advancements in robot navigation and motion control in real-world scenarios [4].
VLA统一架构新突破:自回归世界模型引领具身智能
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the development of a new unified Vision-Language-Action (VLA) model architecture called UniVLA, which enhances the integration of visual, language, and action signals for improved decision-making in embodied intelligence tasks [4][5][13]. Group 1: Model Architecture and Mechanism - UniVLA is based on a fully discrete, autoregressive mechanism that models visual, language, and action signals natively, incorporating world model training to learn temporal information and causal logic from large-scale videos [5][9][14]. - The framework transforms visual, language, and action signals into discrete tokens, creating interleaved multimodal temporal sequences for unified modeling [9][10]. Group 2: Performance and Benchmarking - UniVLA has set new state-of-the-art (SOTA) records across major embodied intelligence benchmarks such as CALVIN, LIBERO, and SimplerEnv, demonstrating its strong performance advantages [18][21]. - In the CALVIN benchmark, UniVLA achieved an average score of 95.5%, outperforming previous models significantly [19]. Group 3: Training Efficiency and Generalization - The post-training stage of the world model significantly enhances downstream decision-making performance without relying on extensive action data, utilizing only vast amounts of video data for efficient learning [14][15]. - The model supports unified training for various tasks, including visual understanding, video generation, and action prediction, showcasing its versatility and data scalability [10][24]. Group 4: Future Directions - The article suggests exploring deeper integration of the UniVLA framework with multimodal reinforcement learning to enhance its perception, understanding, and decision-making capabilities in open-world scenarios [24].
智元机器人联合香港大学推出的UniVLA入选RSS | 投研报告
Zhong Guo Neng Yuan Wang· 2025-05-16 01:43
Market Performance - On May 14, 2025, the CSI 300 index rose by 1.21%, while the machinery sector declined by 0.43%, ranking 29th among all primary industries [2][1] - Within the sub-sectors, semiconductor equipment had the highest increase of 0.79%, whereas engineering machinery experienced the largest drop of 1.96% [2][1] - The top three gainers in individual stocks were Heng Er Da (+20.00%), Zhong Ji Huan Ke (+19.97%), and Da Ye Co. (+12.98%); the top three losers were Magnetic Valley Technology (-8.20%), Xin Yu Ren (-7.46%), and De Ma Technology (-6.19%) [2][1] Company Announcements - New Era's shareholder Wang Chunxiang plans to reduce his stake by 0.15% through block trading or centralized bidding, having previously held 2.12% [3] - Guangge Technology's major shareholder Beijing Jishi Chuangye Investment Fund reduced its stake by 0.27% from 5.00% between May 7 and May 13, 2025 [3] - Fengxing Co.'s major shareholder Jiangxi Taihao Technology Development Co. has reduced its stake by 1.02% from 7.92% through centralized bidding [3] - Zhuozhao Point Glue's shareholder Yinghao (Hainan) Venture Capital Co. has reduced its stake by 0.2914% from 1.2230% through centralized bidding [3] Industry News - Zhiyuan Robotics and the University of Hong Kong launched UniVLA, a new framework for universal strategy learning in robotics, which allows for cross-domain, cross-scenario, and cross-task capabilities [6] - UniVLA's core innovation is the task-centric latent action space, enabling efficient learning from vast amounts of unlabeled video data, achieving state-of-the-art performance with significantly lower computational resources [6] - The model demonstrated an average success rate improvement of 18.5% across four evaluation metrics and achieved state-of-the-art results with only 10% of the data in specific tasks [6] - The first practical quantum-resistant chip "Mi Xin PQC01" was released by Zhengzhou Xinda Yimi Technology Co., featuring 100% domestic production and core technology [7][8] - The chip supports dynamic switching between quantum-resistant and classical algorithms, operates on a 28nm process, and reduces power consumption by 60%, making it suitable for IoT and mobile devices [8]