各类任务上超越π0！字节跳动推出大型VLA模型GR-3，推动通用机器人策略发展

Core Viewpoint - GR-3, developed by ByteDance, is a large-scale visual-language-action (VLA) model designed to advance general robotics strategies, demonstrating exceptional capabilities in generalization, efficient fine-tuning, and execution of complex tasks [2][7]. Group 1: Performance and Advantages - GR-3 excels in generating action sequences for dual-arm mobile robots based on natural language instructions and environmental observations, outperforming current advanced baseline methods [2][7]. - The model's architecture includes a total of 4 billion parameters, balancing performance and efficiency by optimizing the action generation module [10][12]. Group 2: Core Capabilities and Innovations - GR-3 addresses three major pain points of traditional robots: inability to fully recognize, learn quickly, and perform tasks effectively [7]. - It features a dual-path design combining data-driven approaches with architectural optimization, enabling it to understand abstract instructions and perform precise operations [7][12]. - Key innovations include enhanced generalization capabilities, efficient adaptation with minimal human demonstration data, and stable performance in long-duration and intricate tasks [12][14]. Group 3: Training Methodology - The training strategy employs a "trinity" approach, integrating robot trajectories, visual-language data, and human demonstrations for progressive learning [15][19]. - The model's ability to recognize new objects improved by approximately 40% through joint training with vast internet visual-language datasets [19][23]. Group 4: Hardware Integration - The ByteMini robot, designed for GR-3, features a flexible 7-degree-of-freedom arm and a stable omnidirectional base, enhancing its operational capabilities in various environments [25][26]. - The robot can autonomously generate task combinations and control environmental variables, ensuring effective task execution [21][25]. Group 5: Experimental Validation - GR-3 was tested in three challenging tasks, demonstrating strong adaptability to new environments and abstract instructions with a success rate of 77.1% for understanding new directives [30][38]. - In a long-duration task, GR-3 maintained a success rate of 89% in executing multi-step actions, significantly outperforming previous models [42].