更全面的具身智能真机评测来了！CVPR 2026 ManipArena挑战赛邀你打榜

Core Insights - The embodied intelligence sector has experienced explosive growth over the past year, with various impressive robot demonstrations emerging. However, the industry faces a critical question regarding how to assess whether an embodied intelligence model has genuinely improved its generalization capabilities or is merely optimized for specific tasks and scenarios [1][2]. Group 1: Industry Challenges - The lack of a unified, high-standard evaluation system for real-world performance has become a core pain point for the embodied intelligence industry, hindering model iteration efficiency and potentially leading to a misallocation of research resources [1]. - Establishing a scientific, quantifiable, reproducible, and high-fidelity evaluation metric for real-world performance is an urgent industry consensus at this pivotal moment for scaling embodied intelligence [2]. Group 2: ManipArena Initiative - Sun Yat-sen University, in collaboration with various institutions, launched the official competition ManipArena at the CVPR 2026 Embodied AI Workshop to address the evaluation challenges in the industry [3]. - ManipArena offers 20 real-world tasks, including 5 preliminary and 15 final tasks, with a unique framework designed to accurately diagnose model generalization capabilities through controlled environments and layered out-of-distribution (OOD) assessments [5][8]. Group 3: Evaluation Framework - The evaluation framework of ManipArena includes a layered OOD assessment that allows for precise diagnosis of generalization bottlenecks, moving beyond traditional single-score evaluations to a more nuanced understanding of model capabilities [10][11]. - Each task in ManipArena is tested 10 times, with difficulty levels stratified to reflect the model's performance across various scenarios, including in-domain and OOD challenges [11][12]. Group 4: Initial Findings - Preliminary evaluation results indicate that current mainstream visual-language-action (VLA) models exhibit significant generalization weaknesses, particularly when faced with compound out-of-distribution tests [13][14]. - The evaluation data reveal that the similarity of object shapes is more critical than semantic category affiliation for current models, highlighting their fragile generalization capabilities [15]. Group 5: Controlled Environment and Diversity - ManipArena employs a green screen controlled environment to eliminate visual disturbances, ensuring that performance differences reflect true strategy capabilities [16]. - The platform incorporates three levels of systematic diversity parameters to maintain uniform distribution across all dimensions, preventing models from taking shortcuts based on frequency biases [19][20]. Group 6: Task Complexity and Scoring - The tasks in ManipArena are designed to be challenging, with no simple grab-and-go tests, focusing on reasoning as the core consideration [25]. - The competition's scoring mechanism is based on a sub-task partial scoring system, allowing for a more detailed understanding of where models succeed or fail within task pipelines [46]. Group 7: Model Performance Insights - Initial tests of various models, including π₀.₅-Single, π₀.₅-OneModel, and DreamZero, reveal distinct performance boundaries, with π₀.₅-OneModel leading in scores but showing signs of procedural knowledge forgetting in specific tasks [48][51]. - The results indicate that VLA models excel in precision control and semantic understanding, while world models demonstrate advantages in spatial generalization and coarse-grained planning [52]. Group 8: Future Implications - ManipArena serves not only as a competition but also as a high-standard open research platform, encouraging researchers to publish high-level academic papers based on authoritative evaluation results [52]. - The initiative aims to empower the continuous iteration of visual-language-action models and world models, accelerating the industry's transition to large-scale deployment in the real world [52].