还在卷端到端模型？Embodied-R1另辟蹊径：用“指向”+强化学习实现SOTA性能！

Core Insights - The article discusses the development of Embodied-R1, a new model designed to bridge the "seeing-to-doing gap" in robotics, which has been a long-standing challenge in the field [2][32] - The model introduces a novel intermediate representation called "pointing," which allows complex operational instructions to be translated into visual points, enhancing the robot's ability to understand and execute tasks [3][10] Group 1: Challenges in Robotics - The "seeing-to-doing gap" is primarily caused by data scarcity and morphological heterogeneity, which hinder effective knowledge transfer in robotics [2] - Existing visual-language-action (VLA) models struggle with performance in new environments, often losing zero-shot operational capabilities [2][10] Group 2: Embodied-R1 Model Overview - Embodied-R1 is a 3 billion parameter model that utilizes "pointing" as an intuitive intermediate representation, defining four key capabilities: REG (representational understanding), RRG (spatial region pointing), OFG (functional part pointing), and VTG (visual trajectory generation) [10][12] - The model has demonstrated superior performance in 11 spatial reasoning and pointing tasks, achieving a 56.2% success rate in the SIMPLEREnv simulation and an impressive 87.5% in eight real-world tasks without fine-tuning [10][27] Group 3: Training Methodology - The model employs a two-phase training curriculum, focusing first on spatial reasoning and then on embodied pointing capabilities, utilizing a large dataset of 200,000 samples [15][16] - Reinforcement fine-tuning (RFT) is introduced to address the "multi-solution dilemma" in pointing tasks, allowing the model to develop a generalized understanding rather than memorizing specific answers [17][19] Group 4: Performance Metrics - Embodied-R1 outperforms other models in various benchmarks, achieving state-of-the-art (SOTA) results in REG, RRG, OFG, and VTG tasks [29][30] - The model's trajectory generation quality is the best among all compared models, which is crucial for reliable robot execution [29] Group 5: Robustness and Adaptability - The model exhibits strong robustness against visual disturbances, maintaining performance even under challenging conditions such as poor lighting and background changes [31] - This adaptability is attributed to the "pointing" representation, which enhances the robot's strategic robustness [31] Group 6: Conclusion - The introduction of Embodied-R1 marks a significant advancement in addressing the long-standing "seeing-to-doing gap" in robotics, providing a promising pathway for developing more powerful and generalizable embodied AI systems [32]