UniVLA
Search documents
「CV顶会王」李弘扬投身具身智能赛道!
自动驾驶之心· 2025-12-15 00:04
Core Insights - RoboX is focusing on embodied intelligence and has entered the robotics manipulation sector, led by researcher Li Hongyang from the University of Hong Kong and Shanghai AI Laboratory [3][4] - The company has formed a research team of several dozen members, covering areas such as VLA, robotics, autonomous driving, and edge computing chips [4] - Li Hongyang's research has significantly advanced autonomous driving technology, particularly through the UniAD framework, which integrates various tasks into a single end-to-end network [6][7] Research Achievements - Li Hongyang's work on the UniAD framework has outperformed state-of-the-art methods on the nuScenes dataset, showcasing its effectiveness in autonomous driving [6] - His previous method, BEVFormer, was recognized as one of the top 100 AI papers in 2022, establishing a benchmark for visual detection in the industry [7] - The team has created a large-scale real robot manipulation dataset called "AgiBot World," which impacts multiple industry scenarios [7] Future Directions - The upcoming paper titled "UniVLA: Learning to Act Anywhere with Task-centric Latent Actions" introduces a task-centric latent action framework that enhances robot strategy learning across different environments [9][10] - UniVLA reduces reliance on labeled data, achieving optimal performance with minimal data in multi-task benchmarks, and supports efficient transfer from internet videos to real robots [10] - The company aims to establish a full-stack self-research route with a vision to enhance few-shot generalization capabilities for humanoid robots in various applications [10][11] Industry Trends - The field of embodied intelligence is gaining traction, with several experts from academia transitioning into this area, indicating a growing interest and investment in the sector [11]
Sim2Real,解不了具身智能的数据困境。
自动驾驶之心· 2025-10-03 03:32
Core Viewpoint - The article discusses the ongoing debate in the field of embodied intelligence regarding the reliance on simulation efficiency versus real-world data, and the potential of world models to redefine the landscape of data utilization in this domain [4][8]. Group 1: Understanding Sim-to-Real Gap - The "Sim-to-Real gap" refers to the discrepancies between simulated environments and real-world scenarios, primarily due to incomplete simulations that fail to accurately replicate visual and physical details [8]. - Research indicates that the gap exists because simulation models do not fully capture the complexities of the real world, leading to limited generalization capabilities and a focus on specific scenarios [8][11]. - Solutions to bridge this gap involve optimizing data, including designing virtual and real data ratios and leveraging AIGC to generate diverse datasets that balance volume and authenticity [11][12]. Group 2: Data Utilization in Embodied Intelligence - There is a consensus among experts that while real data is ideal for training, the current landscape necessitates a reliance on simulation data due to the scarcity of high-quality real-world datasets in the embodied intelligence field [20][21]. - Simulation data plays a crucial role in foundational model iteration and testing, as it allows for safe and efficient algorithm testing before deploying on real machines [21][24]. - The potential of simulation in scaling reinforcement learning is highlighted, as well-constructed simulators can facilitate large-scale parallel training, enabling models to learn from scenarios that are difficult to capture in real life [24][26]. Group 3: World Models and Future Directions - The article emphasizes the significance of world models in future research, particularly in areas like autonomous driving and embodied intelligence, showcasing their potential in general visual understanding and long-term planning [30][32]. - Challenges remain in automating the generation of simulation data and ensuring the diversity and generalization of actions within simulations, which are critical for advancing the field [28][29]. - The introduction of new modalities, such as force and touch, into world models is suggested as a promising direction for future research, despite current limitations in computational resources [30][31]. Group 4: Reaction to Boston Dynamics Technology - Experts acknowledge the advanced capabilities of Boston Dynamics robots, particularly their smooth execution of complex tasks that require sophisticated motion control [33][37]. - The discussion highlights the importance of hardware and data in the field of embodied intelligence, with Boston Dynamics' approach serving as a benchmark for future developments [37][39]. - The consensus is that the seamless performance of these robots is attributed not only to hardware differences but also to superior motion control techniques that could inform future research in embodied intelligence [39][41].
FlowVLA:破解 VLA 模型 “物理失真” 难题,机器人世界建模再升级
具身智能之心· 2025-08-29 00:03
Core Viewpoint - The article discusses the limitations of traditional Vision-Language-Action (VLA) models and introduces FlowVLA, a new framework that addresses these issues by implementing a Visual Chain of Thought (Visual CoT) principle, enhancing the model's ability to predict future frames through structured physical reasoning rather than mere pixel replication [5][8][36]. Group 1: Background and Current State - VLA models, particularly those pre-trained as world models, show significant potential in the field of general robotics, primarily through large self-regressive Transformers that learn environmental dynamics from vast video data [6][7]. - Existing models face critical flaws, including task confusion leading to prediction failures, knowledge transfer inefficiencies between passive observation and active control, and entangled learning of dynamics and appearance [7]. Group 2: Contributions of FlowVLA - FlowVLA introduces a new learning framework that emphasizes structured physical reasoning by requiring the model to infer motion dynamics before predicting future frames [8][10]. - The model is designed to unify appearance and motion reasoning within a single self-regressive Transformer, maintaining parameter efficiency and architectural simplicity [9][10]. - Experimental results validate FlowVLA's superior performance across various robotic operation benchmarks, demonstrating enhanced sample efficiency and bridging the gap between pre-training and policy fine-tuning [10][20]. Group 3: Research Content - The Visual CoT reasoning process decomposes the frame prediction into a causal chain of "current frame → optical flow → future frame," allowing the model to separate dynamic and appearance learning [12][14]. - The two-phase training paradigm consists of a pre-training phase focused on world model learning and a fine-tuning phase for adapting to control tasks [15][16]. Group 4: Experimental Analysis - FlowVLA outperforms existing methods in the LIBERO dataset across all task sets, particularly excelling in long-term tasks, showcasing its robust understanding of physical dynamics [20][21]. - In the SimplerEnv dataset, FlowVLA demonstrates strong adaptability to visual domain shifts, achieving significant performance improvements in tasks where other models struggle [22][23]. - The model's sample efficiency is validated, requiring only one-third of the training steps to reach peak performance compared to baseline models, with a 55% higher peak success rate in low-data scenarios [30][32]. Group 5: Key Component Validation - Ablation studies on the LIBERO-10 benchmark highlight the importance of the Visual CoT structure, flow loss, and interleaved sequence format, confirming their critical roles in the model's performance [33][34]. Group 6: Comparison with Related Work - FlowVLA distinguishes itself from traditional VLA models by prioritizing dynamic understanding and establishing a robust world model before adapting to control tasks, thus laying a solid foundation for physical knowledge [35].
重磅直播!RoboTwin2.0:强域随机化双臂操作数据生成器与评测基准集
具身智能之心· 2025-07-15 13:49
Core Viewpoint - The article discusses the challenges and advancements in training dual-arm robots for complex tasks, emphasizing the need for efficient data collection and simulation methods to enhance their operational capabilities [2]. Group 1: Challenges in Dual-Arm Robot Training - Dual-arm robots play a crucial role in collaborative assembly, tool usage, and object handover in complex scenarios, but training them to perform general operations like VLA faces multiple bottlenecks [2]. - The cost and time required to scale up the collection of real demonstration data are high, making it difficult to cover a wide range of tasks, object shapes, and hardware variations [2]. - Existing simulation methods lack efficient and scalable expert data generation techniques for new tasks, and their domain randomization designs are too superficial to accurately simulate the complexities of real environments [2]. Group 2: Advancements and Solutions - The article highlights the introduction of UniVLA, which efficiently utilizes multi-source heterogeneous data to construct a general and scalable action space for robots [5]. - The CVPR champion solution, BridgeVLA, reportedly improves real machine performance by 32%, showcasing advancements in robot navigation and motion control in real-world scenarios [4].
VLA统一架构新突破:自回归世界模型引领具身智能
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the development of a new unified Vision-Language-Action (VLA) model architecture called UniVLA, which enhances the integration of visual, language, and action signals for improved decision-making in embodied intelligence tasks [4][5][13]. Group 1: Model Architecture and Mechanism - UniVLA is based on a fully discrete, autoregressive mechanism that models visual, language, and action signals natively, incorporating world model training to learn temporal information and causal logic from large-scale videos [5][9][14]. - The framework transforms visual, language, and action signals into discrete tokens, creating interleaved multimodal temporal sequences for unified modeling [9][10]. Group 2: Performance and Benchmarking - UniVLA has set new state-of-the-art (SOTA) records across major embodied intelligence benchmarks such as CALVIN, LIBERO, and SimplerEnv, demonstrating its strong performance advantages [18][21]. - In the CALVIN benchmark, UniVLA achieved an average score of 95.5%, outperforming previous models significantly [19]. Group 3: Training Efficiency and Generalization - The post-training stage of the world model significantly enhances downstream decision-making performance without relying on extensive action data, utilizing only vast amounts of video data for efficient learning [14][15]. - The model supports unified training for various tasks, including visual understanding, video generation, and action prediction, showcasing its versatility and data scalability [10][24]. Group 4: Future Directions - The article suggests exploring deeper integration of the UniVLA framework with multimodal reinforcement learning to enhance its perception, understanding, and decision-making capabilities in open-world scenarios [24].
智元机器人联合香港大学推出的UniVLA入选RSS | 投研报告
Zhong Guo Neng Yuan Wang· 2025-05-16 01:43
Market Performance - On May 14, 2025, the CSI 300 index rose by 1.21%, while the machinery sector declined by 0.43%, ranking 29th among all primary industries [2][1] - Within the sub-sectors, semiconductor equipment had the highest increase of 0.79%, whereas engineering machinery experienced the largest drop of 1.96% [2][1] - The top three gainers in individual stocks were Heng Er Da (+20.00%), Zhong Ji Huan Ke (+19.97%), and Da Ye Co. (+12.98%); the top three losers were Magnetic Valley Technology (-8.20%), Xin Yu Ren (-7.46%), and De Ma Technology (-6.19%) [2][1] Company Announcements - New Era's shareholder Wang Chunxiang plans to reduce his stake by 0.15% through block trading or centralized bidding, having previously held 2.12% [3] - Guangge Technology's major shareholder Beijing Jishi Chuangye Investment Fund reduced its stake by 0.27% from 5.00% between May 7 and May 13, 2025 [3] - Fengxing Co.'s major shareholder Jiangxi Taihao Technology Development Co. has reduced its stake by 1.02% from 7.92% through centralized bidding [3] - Zhuozhao Point Glue's shareholder Yinghao (Hainan) Venture Capital Co. has reduced its stake by 0.2914% from 1.2230% through centralized bidding [3] Industry News - Zhiyuan Robotics and the University of Hong Kong launched UniVLA, a new framework for universal strategy learning in robotics, which allows for cross-domain, cross-scenario, and cross-task capabilities [6] - UniVLA's core innovation is the task-centric latent action space, enabling efficient learning from vast amounts of unlabeled video data, achieving state-of-the-art performance with significantly lower computational resources [6] - The model demonstrated an average success rate improvement of 18.5% across four evaluation metrics and achieved state-of-the-art results with only 10% of the data in specific tasks [6] - The first practical quantum-resistant chip "Mi Xin PQC01" was released by Zhengzhou Xinda Yimi Technology Co., featuring 100% domestic production and core technology [7][8] - The chip supports dynamic switching between quantum-resistant and classical algorithms, operates on a 28nm process, and reduces power consumption by 60%, making it suitable for IoT and mobile devices [8]