BridgeVLA模型
Search documents
专访中科第五纪黄岩:在具身智能的狂热中,做一位技术实干家
机器之心· 2026-03-27 04:09
Core Viewpoint - The article highlights the rapid advancements and investments in the field of embodied intelligence, emphasizing the innovative approaches taken by Huang Yan and his team at Zhongke Fifth Epoch to address real-world industrial challenges through technology [1][2][3]. Group 1: Industry Trends and Innovations - The embodied intelligence sector has seen unprecedented enthusiasm, with nearly 15 billion yuan in funding achieved within two months [1]. - Huang Yan, a key figure in embodied intelligence, combines academic research with practical applications, focusing on solving data utilization issues in industrial settings [2][3]. - The team has developed a full-stack architecture that addresses the efficiency bottlenecks in data utilization, diverging from the industry's focus on data volume and computational power [3][10]. Group 2: Technical Developments - Huang Yan's research began with a focus on multimodal technologies, leading to significant advancements in reinforcement learning algorithms that enhance the efficiency of visual-language models [5][8]. - The introduction of the FAM series, an ultra-few-shot large model, represents a pioneering effort to overcome data scarcity in the industry [14][22]. - The BridgeVLA model, which aligns input and output in a unified 2D image space, has shown remarkable efficiency in learning 3D operations [18][19]. Group 3: Practical Applications and Results - The FAM model can achieve high reliability in deployment with as few as 3 to 5 real machine demonstration data points, boasting a success rate of nearly 97% in basic tasks [22]. - The EC-Flow framework allows robots to learn from unannotated human operation videos, significantly improving success rates in complex tasks [39][43]. - The team's innovative approach to data synthesis has led to substantial improvements in task success rates, enhancing the robots' capabilities in real-world scenarios [47]. Group 4: Market Position and Future Outlook - Zhongke Fifth Epoch has successfully secured significant funding, reflecting investor confidence in its practical approach to addressing industrial pain points [52][53]. - The company aims to empower various industries with its embodied intelligence solutions, striving towards the vision of deploying millions of robots to serve humanity [59]. - The emphasis on practical applications and real-world adaptability positions Zhongke Fifth Epoch as a leader in the embodied intelligence market, ready to meet the challenges of 2026 and beyond [61][62].
3D VLA新范式!CVPR冠军方案BridgeVLA,真机性能提升32%
具身智能之心· 2025-06-26 14:19
Core Viewpoint - The article discusses the BridgeVLA model developed by the Institute of Automation, Chinese Academy of Sciences, which efficiently combines 3D input projection into 2D images for action prediction, achieving high performance and data efficiency in 3D robotic operation learning [4][6]. Group 1: Model Performance - BridgeVLA achieves a task success rate of 96.8% with only 3 trajectories in basic settings, demonstrating superior performance in various generalization settings compared to baseline models, with a 32% performance improvement [6][25]. - In simulation benchmarks such as RLBench, COLOSSEUM, and GemBench, BridgeVLA outperforms mainstream 3D robotic operation benchmarks, achieving an 88.2% success rate in RLBench, a 7.3% improvement in COLOSSEUM, and a 50% success rate in GemBench [20][25]. Group 2: Model Design and Training - BridgeVLA's training process consists of two phases: 2D heatmap pre-training to enhance spatial perception and 3D action fine-tuning to learn specific robotic operation strategies [15][17]. - The model utilizes a heatmap pre-training method to predict the probability heatmap of target object locations based on textual instructions, enhancing its spatial awareness [16][25]. Group 3: Generalization and Data Efficiency - BridgeVLA demonstrates strong generalization capabilities, effectively handling various disturbances such as unseen objects, lighting conditions, and object types, thanks to the rich visual and linguistic prior knowledge embedded in the pre-trained multimodal model [20][25]. - The model's high data efficiency is highlighted by its ability to achieve nearly the same performance with only 3 trajectories as with 10 trajectories, making it suitable for deployment in real robotic systems [25][26].