你的模型真的能打吗？上交发布了近百项场景的GM-100，操作任务的长尾场景评测来了

Core Viewpoint - The article discusses the limitations of existing robot learning datasets and task designs, emphasizing the need for a more systematic approach to evaluate and enhance robot capabilities through the introduction of the GM-100 benchmark test [2][4]. Group 1: Background and Issues - The rapid development of robot learning has led to numerous datasets and task designs, but many focus on common tasks, lacking coverage for complex and rare tasks [3][5]. - Existing evaluations often rely on a few common tasks, making it difficult to compare different research outcomes fairly [3][5]. Group 2: GM-100 Benchmark Test - The GM-100 benchmark test consists of 100 carefully designed tasks that cover various interaction scenarios and long-tail behaviors, aiming to provide a diverse and challenging task set for evaluating robot capabilities [4][11]. - The tasks were developed through systematic analysis and expansion of existing designs, incorporating insights from human action understanding [4][9]. Group 3: Task Design and Data Collection - The design of GM-100 tasks was based on human action rationality, ensuring a wide range of interaction scenarios and the inclusion of rare but important actions [9][10]. - A medium-sized dataset of over 13,000 trajectories was collected through teleoperation across two different robot platforms, ensuring diverse data for evaluation [11][13][16]. Group 4: Evaluation Metrics - The evaluation of models on GM-100 tasks uses several metrics, including Success Rate (SR), Partial Success Rate (PSR), and action prediction error, to provide a comprehensive assessment of robot performance [22]. - The overall success rate of the benchmark is low, highlighting the inherent challenges of the tasks and the limitations of the current data constraints [22].