你的模型真的能打吗？操作任务的长尾场景评测来了

Core Viewpoint - The article discusses the introduction of the GM-100 benchmark test, which aims to enhance the evaluation of robotic capabilities through a diverse set of 100 tasks designed to address the limitations of existing datasets and task designs in the field of robotics [1][4]. Group 1: Background and Motivation - The rapid development of robotic learning has led to the emergence of various datasets and task designs, but many focus on common tasks, resulting in a lack of coverage for complex and rare tasks [3][5]. - Existing datasets, such as Open X-Embodiment and Agibot, primarily concentrate on common actions like "pick and grasp," leading to significant biases in trained models and limiting their applicability in real-world scenarios [3][5]. Group 2: GM-100 Benchmark Test - The GM-100 benchmark consists of 100 carefully designed tasks that encompass various interaction scenarios and long-tail behaviors, aiming to provide a comprehensive assessment of robotic agents' capabilities [4][11]. - The tasks are developed based on systematic analysis and insights from human action understanding, ensuring they are executable and sufficiently challenging to differentiate the performance of various models [2][4]. Group 3: Task Design and Data Collection - The task design process involved analyzing previous research to eliminate redundancies and categorize tasks, revealing a significant bias towards common activities [5][9]. - A diverse set of tasks was generated using large language models, with human experts involved in the final selection to ensure high-quality and feasible tasks for current hardware constraints [10][11]. - Data collection for GM-100 was conducted through teleoperation, resulting in a medium-sized dataset with over 13,000 trajectories [13][16]. Group 4: Evaluation Metrics and Results - The evaluation of different baseline models on GM-100 tasks utilized several metrics, including Success Rate (SR), Partial Success Rate (PSR), and action prediction error, to provide a comprehensive performance assessment [22]. - The results indicated that the overall success rate was low, highlighting the inherent challenges of the tasks and the limitations of the training data [22].