Visual-Language-Action Model (VLA)
Search documents
更全面的具身智能真机评测来了!CVPR 2026 ManipArena挑战赛邀你打榜
机器之心· 2026-03-18 07:39
Core Insights - The embodied intelligence sector has experienced explosive growth over the past year, with various impressive robot demonstrations emerging. However, the industry faces a critical question regarding how to assess whether an embodied intelligence model has genuinely improved its generalization capabilities or is merely optimized for specific tasks and scenarios [1][2]. Group 1: Industry Challenges - The lack of a unified, high-standard evaluation system for real-world performance has become a core pain point for the embodied intelligence industry, hindering model iteration efficiency and potentially leading to a misallocation of research resources [1]. - Establishing a scientific, quantifiable, reproducible, and high-fidelity evaluation metric for real-world performance is an urgent industry consensus at this pivotal moment for scaling embodied intelligence [2]. Group 2: ManipArena Initiative - Sun Yat-sen University, in collaboration with various institutions, launched the official competition ManipArena at the CVPR 2026 Embodied AI Workshop to address the evaluation challenges in the industry [3]. - ManipArena offers 20 real-world tasks, including 5 preliminary and 15 final tasks, with a unique framework designed to accurately diagnose model generalization capabilities through controlled environments and layered out-of-distribution (OOD) assessments [5][8]. Group 3: Evaluation Framework - The evaluation framework of ManipArena includes a layered OOD assessment that allows for precise diagnosis of generalization bottlenecks, moving beyond traditional single-score evaluations to a more nuanced understanding of model capabilities [10][11]. - Each task in ManipArena is tested 10 times, with difficulty levels stratified to reflect the model's performance across various scenarios, including in-domain and OOD challenges [11][12]. Group 4: Initial Findings - Preliminary evaluation results indicate that current mainstream visual-language-action (VLA) models exhibit significant generalization weaknesses, particularly when faced with compound out-of-distribution tests [13][14]. - The evaluation data reveal that the similarity of object shapes is more critical than semantic category affiliation for current models, highlighting their fragile generalization capabilities [15]. Group 5: Controlled Environment and Diversity - ManipArena employs a green screen controlled environment to eliminate visual disturbances, ensuring that performance differences reflect true strategy capabilities [16]. - The platform incorporates three levels of systematic diversity parameters to maintain uniform distribution across all dimensions, preventing models from taking shortcuts based on frequency biases [19][20]. Group 6: Task Complexity and Scoring - The tasks in ManipArena are designed to be challenging, with no simple grab-and-go tests, focusing on reasoning as the core consideration [25]. - The competition's scoring mechanism is based on a sub-task partial scoring system, allowing for a more detailed understanding of where models succeed or fail within task pipelines [46]. Group 7: Model Performance Insights - Initial tests of various models, including π₀.₅-Single, π₀.₅-OneModel, and DreamZero, reveal distinct performance boundaries, with π₀.₅-OneModel leading in scores but showing signs of procedural knowledge forgetting in specific tasks [48][51]. - The results indicate that VLA models excel in precision control and semantic understanding, while world models demonstrate advantages in spatial generalization and coarse-grained planning [52]. Group 8: Future Implications - ManipArena serves not only as a competition but also as a high-standard open research platform, encouraging researchers to publish high-level academic papers based on authoritative evaluation results [52]. - The initiative aims to empower the continuous iteration of visual-language-action models and world models, accelerating the industry's transition to large-scale deployment in the real world [52].
告别机器人“断片”!KAIST和UC Berkeley团队让VLA模型拥有记忆 实测成功率翻倍!
机器人大讲堂· 2026-02-16 15:31
Core Insights - The article discusses the limitations of existing Visual-Language-Action (VLA) models in robotics, particularly their lack of "historical memory," which hampers their ability to perform complex tasks that require context [1][4] - A new framework called HAMLET has been introduced, which enhances VLA models by integrating a lightweight memory system, resulting in a significant increase in task success rates [3][17] Group 1: Current Limitations of VLA Models - Current VLA models, such as GR00T N1.5 and CogACT, rely solely on the current visual frame and text instructions, leading to poor performance in tasks requiring context [4] - For example, in a task where a robot must cover a block with a cup, the lack of historical memory results in a success rate of only 37.5% for GR00T N1.5, causing the robot to repeat actions unnecessarily [4][14] - Simply stacking historical frames does not work effectively, as it slows down inference speed by 35% and increases peak memory usage by 3.6 times [4] Group 2: HAMLET Framework - HAMLET addresses the historical memory gap by adding two core components: moment tokens and a lightweight memory module [5][9] - Moment tokens are designed to compress and store scene information for each time step, allowing the model to focus on dynamic changes relevant to the task [6][8] - The memory module uses a two-layer Transformer architecture to filter and integrate these moment tokens, enabling the model to make more informed decisions based on historical context [9][11] Group 3: Performance Improvements - Extensive experiments show that HAMLET significantly improves success rates in long-term tasks, with an average success rate increase of 47.2% compared to baseline models [12][14] - In specific tasks, HAMLET improved the success rate from 12.5% to 66.7% in "Pick-and-Place Twice" and from 37.5% to 83.3% in "Swap Cubes" [14] - HAMLET also maintains high efficiency, with only a 7% increase in inference speed and a 1x increase in memory usage, compared to traditional methods that drastically slow down performance [15] Group 4: Cross-Task Transferability - The memory module of HAMLET demonstrates cross-task transferability, allowing it to improve success rates even when applied to different datasets, indicating a generalizable capability in processing historical information [16] Conclusion - HAMLET effectively resolves the core issue of historical memory in VLA models without requiring extensive retraining or restructuring, marking a significant step towards more capable and versatile robotic systems [17]