Workflow
DROID
icon
Search documents
探究具身机器人有限泛化能力的本质原因!增强策略依然有效
具身智能之心· 2025-08-12 00:03
Research Background and Core Issues - The development of large-scale robot datasets and high-capacity models has shown strong capabilities in various tasks, but generalization remains limited in scenarios outside the training data distribution [2] - Shortcut learning, where models rely on task-irrelevant features rather than true causal relationships, is a key factor limiting generalization [2] Dataset Diversity and Fragmentation Analysis - The OXE dataset exhibits significantly lower visual and textual diversity compared to visual/multimodal datasets, even with the latest DROID dataset aimed at increasing diversity [4] - The fragmentation of the OXE dataset is evident, with distinct separation among sub-datasets, leading to a lack of overlap and effective division into smaller datasets [8] - The limited diversity is attributed to inherent constraints in the robot data collection process [6] Theoretical Connection Between Dataset Characteristics and Shortcut Learning - A mathematical framework has been established to analyze how multiple sub-datasets lead to correlations that facilitate shortcut learning [15] - The distance between task-irrelevant features across sub-datasets significantly influences shortcut learning, with models tending to rely on visual cues rather than textual instructions [16] Experimental Validation - Experiments indicate that increasing diversity within sub-datasets and reducing differences between them can effectively reduce shortcut dependencies [18] - The introduction of a "bridge" target in experiments significantly improved out-of-distribution (OOD) success rates by breaking false correlations [28] Mitigating Shortcut Learning Through Data Augmentation - Targeted data augmentation strategies can effectively increase sub-dataset diversity and reduce distribution differences, thereby alleviating shortcut learning [29] - Perspective augmentation creates shared visual contexts between sub-datasets, breaking false correlations tied to specific tasks [30] - The results confirm that carefully selected data augmentation strategies can enhance the generalization capabilities of robot policies [34]
从坐标混乱到时空对齐!诺亚和复旦联合提出4D-VLA,提升机器人预训练效率和稳健性
具身智能之心· 2025-07-06 11:54
Core Insights - The article introduces 4D-VLA, a new pretraining method that integrates 3D spatial and historical frame data to enhance model performance in complex scenarios, addressing the limitations of traditional single-frame RGB and text inputs [4][10][18]. Group 1: Limitations of Existing Paradigms - Current mainstream methods like OpenVLA rely solely on single-frame RGB images and text instructions, leading to chaotic target distributions and slow model convergence due to high variance [7][8]. - The lack of complete input information results in significant challenges, such as coordinate system chaos and state chaos, which severely degrade pretraining efficiency [5][9]. Group 2: Proposed Solutions - 4D-VLA utilizes depth maps and camera extrinsics to project each pixel into world coordinates, embedding 3D positional encoding to align visual tokens with robot coordinates, thus reducing ambiguity in coordinate systems [10][18]. - The method includes a controlled experiment to quantify the impact of coordinate chaos on VLA models, demonstrating that the introduction of 3D information significantly improves model robustness and convergence speed [11][17]. Group 3: Experimental Setup and Results - The DROID dataset, comprising 76,000 human demonstration trajectories across various tasks, serves as the foundation for pretraining, while the LIBERO simulation suite is used for downstream evaluation [29][30]. - 4D-VLA outperforms existing methods in various tasks, achieving an average success rate of 88.6% across different evaluation settings, showcasing its superior capability in spatial awareness and generalization [33][39]. Group 4: Real-World Evaluation - In real-world tests, 4D-VLA demonstrated enhanced precision and robustness in tasks involving spatial generalization, robustness to distractors, precise placement, and structured instruction execution [44][49]. - The model maintained high success rates even under unseen camera angles, indicating its ability to adapt to new environments and conditions effectively [57][58].