泛化能力评估
Search documents
REALM:机器人操作任务的real2sim验证基准
具身智能之心· 2025-12-27 10:03
Core Background and Issues - The Vision-Language-Action (VLA) model enables robots to understand natural language commands and perform manipulation tasks, but evaluating generalization capabilities remains a key challenge due to high costs and poor repeatability in real-world assessments. Existing simulation benchmarks have significant flaws, including limited types of disturbances and lack of high-fidelity visual effects, leading to a disconnect between simulation and real-world performance, known as the "reality-simulation gap" [2]. - To address this, a research team from Czech Technical University and the University of Amsterdam developed REALM, a high-fidelity simulation environment and benchmark aimed at establishing a strong correlation between simulation and real-world performance for large-scale, low-cost evaluation of VLA model generalization capabilities. The core breakthroughs include high-fidelity visual and control-aligned simulation environments, a multi-dimensional disturbance evaluation scheme, and empirically validated real-simulation performance correlation [2]. Related Work and Differentiation Advantages - Existing robotic manipulation generalization benchmarks rely heavily on simulation but have notable limitations. For instance, GemBench and VLABench support a limited number of disturbance types, particularly in behavioral disturbances. SIMPLER has achieved partial control alignment but is limited in skill and object variety and only supports single viewpoints. REALM stands out by covering six visual, eight semantic, and seven behavioral disturbances, supporting seven skills, ten scenes, and over 3,500 objects, while also providing high-fidelity visuals, control alignment, and multi-view capabilities, making it the most comprehensive generalization benchmark to date [3][4]. Benchmark Design Core Elements 1. **Skills and Task Set**: The benchmark is designed around seven core manipulation skills: picking, placing, pushing, rotating, stacking, opening, and closing. It includes two task sets where skills are defined as general capabilities independent of objects and scenes, while tasks are specific instances of skills applied to particular objects and scenes, with a modular framework for expansion [5]. 2. **Disturbance Design**: To test generalization capabilities, 15 types of disturbances are designed, covering three main categories. The REALM-base focuses on eight tasks related to picking and placing skills, while REALM-articulated targets tasks involving articulated objects like cabinet doors [6][8]. 3. **Evaluation Metrics and Control Alignment**: A tiered progression metric replaces binary success rates by breaking down each skill into ordered discrete states, providing a more granular reflection of model performance. Control alignment is optimized by redesigning the robot controller and fine-tuning 14 physical parameters, significantly improving the consistency between simulated and real trajectories [9]. Real-Simulation Alignment and Validation - The validation process confirms that simulation can effectively replace real-world evaluations. Testing involved three VLA models, seven tasks, and five types of disturbances across nearly 800 trajectory sets, using key metrics such as Pearson correlation coefficient, p-values, and Mean Maximum Rank Violation (MMRV). Results indicate a strong linear correlation between simulation and real-world task progression, with MMRV values low and p<0.001 across all settings, demonstrating that simulation reliably predicts real-world performance [11]. Key Experimental Results and Findings 1. **Visual Generalization**: Pure visual disturbances significantly impact model performance, with average RMSD exceeding 0.12. Factors like blurriness and lighting have minimal effects, likely due to the visual diversity in DROID training data. However, changes in viewpoint and scene disturbances have the most significant impact, indicating that while the model can adapt to some visual changes, robustness remains insufficient [14]. 2. **Semantic Generalization**: Despite relying on large-scale pre-trained VLM, semantic disturbances pose substantial challenges, with performance significantly lagging behind other models. The most impactful disturbances relate to world knowledge and human needs, while spatial relationship understanding performed unexpectedly well [17]. 3. **Behavioral Generalization**: Behavioral disturbances require the model to adjust motion strategies, presenting the greatest challenge. The model generalizes well between different skills on the same object but performs poorly across different objects, especially with unseen objects, indicating a lack of adaptability in behavior [18]. 4. **Robustness and Task Completion**: The -FAST model achieved the highest average task progression across all disturbances, leading in success rates for 9 out of 10 tasks. In contrast, GR00T showed significantly lower performance with less interpretable disturbance impacts. All models took an average of 20-30 seconds to complete simple tasks, with high variance, indicating challenges in efficiently and consistently completing tasks in unknown environments [19].