通用机器人控制

Search documents
开放世界任务成功率82%!美的攻克机器人泛化控制难题
量子位· 2025-07-15 06:28
Core Viewpoint - The article discusses the development of ChatVLA-2, a visual-language-action model with embodied reasoning capabilities, created through collaboration between Midea AI Research Institute and East China Normal University. This model integrates a dynamic mixture of experts architecture and a dual-stage training process to enhance its reasoning and action execution abilities [1][4]. Model Structure - ChatVLA-2 employs a mixture of experts (MoE) model architecture, allowing for dynamic selection of expert modules to focus on specific task features while capturing shared beneficial features across multiple tasks. This adaptive strategy ensures efficient allocation of computational resources [7]. Training Strategy - The training process consists of two phases: - The first phase activates open-world understanding and reasoning by co-training visual-language data with robotic action data, avoiding bias towards specific skills [13]. - The second phase refines the model's reasoning-following ability by freezing the visual-language model and only training the action experts, significantly enhancing the model's understanding and response to unseen reasoning scenarios [14][15]. Experimental Results - In experiments, ChatVLA-2 demonstrated superior capabilities in mathematical reasoning and spatial reasoning tasks: - In the mathematical matching game, it achieved a reasoning score of 6.0/6, a success rate of 11/13, and an OCR score of 3.58/4, with a math reasoning score of 1.73/2 and an overall success rate of 82.7% in open-world scenarios [19]. - In the toy placement task, it achieved a target recognition score of 0.94 and a manipulation success rate of 81.4%, outperforming similar methods in unfamiliar environments [21]. Conclusion - ChatVLA-2 represents a significant advancement in the field of robotics, providing a new approach to universal robot control by effectively translating reasoning outcomes into actions, thus paving the way for future research in complex scenarios and multimodal interactions [21].