Workflow
DROID
icon
Search documents
VLA的基础模型与大规模训练任务汇总
具身智能之心· 2025-10-08 02:49
Core Insights - The article summarizes several research papers related to Vision-Language-Action (VLA) models and their training strategies, highlighting advancements in embodied intelligence and robotics [2][3][5][7][9][11][13][15][17][19]. Group 1: Training Strategies and Model Improvements - The paper "Training strategies for efficient embodied reasoning" discusses the use of Chain of Thought (CoT) reasoning to enhance the performance and generalization of VLA models, achieving a threefold increase in reasoning speed compared to standard methods [3]. - "CAST: Counterfactual labels improve instruction following in vision-language-action models" introduces a method to generate counterfactual labels, which significantly improves the instruction-following capabilities of VLA models, with a 27% increase in navigation task success rates [5]. - "RoboBrain: A unified brain model for robotic manipulation" presents a new dataset, ShareRobot, which enhances the planning and trajectory prediction capabilities of robots, leading to state-of-the-art performance in various tasks [7]. Group 2: Dataset Development and Evaluation - The "DROID" dataset is introduced as a large-scale, diverse dataset for robot manipulation, containing 76,000 demonstration trajectories collected over 350 hours, which improves performance and generalization of trained strategies [9]. - "ViSA-Flow" proposes a framework for learning from large-scale video data, achieving state-of-the-art performance in robot skill learning, particularly in low-data scenarios [11]. - The "CORTEXBENCH" benchmark evaluates pre-trained visual representations for embodied AI, revealing that no single representation excels across all tasks, but task-specific adaptations can lead to significant performance improvements [13]. Group 3: Generalist Robot Policies and Learning Frameworks - "Effective tuning strategies for generalist robot manipulation policies" identifies key factors influencing the performance of Generalist Manipulation Policies (GMPs) during fine-tuning, establishing a new benchmark for future research [15]. - The "CACTI" framework focuses on scalable multi-task learning in robotic systems, demonstrating effective training across various kitchen tasks in both real and simulated environments [17]. - "R3m: A universal visual representation for robot manipulation" shows that pre-trained visual representations can enhance data-efficient learning in real-world environments, improving task success rates by over 20% compared to training from scratch [19].
探究具身机器人有限泛化能力的本质原因!增强策略依然有效
具身智能之心· 2025-08-12 00:03
Research Background and Core Issues - The development of large-scale robot datasets and high-capacity models has shown strong capabilities in various tasks, but generalization remains limited in scenarios outside the training data distribution [2] - Shortcut learning, where models rely on task-irrelevant features rather than true causal relationships, is a key factor limiting generalization [2] Dataset Diversity and Fragmentation Analysis - The OXE dataset exhibits significantly lower visual and textual diversity compared to visual/multimodal datasets, even with the latest DROID dataset aimed at increasing diversity [4] - The fragmentation of the OXE dataset is evident, with distinct separation among sub-datasets, leading to a lack of overlap and effective division into smaller datasets [8] - The limited diversity is attributed to inherent constraints in the robot data collection process [6] Theoretical Connection Between Dataset Characteristics and Shortcut Learning - A mathematical framework has been established to analyze how multiple sub-datasets lead to correlations that facilitate shortcut learning [15] - The distance between task-irrelevant features across sub-datasets significantly influences shortcut learning, with models tending to rely on visual cues rather than textual instructions [16] Experimental Validation - Experiments indicate that increasing diversity within sub-datasets and reducing differences between them can effectively reduce shortcut dependencies [18] - The introduction of a "bridge" target in experiments significantly improved out-of-distribution (OOD) success rates by breaking false correlations [28] Mitigating Shortcut Learning Through Data Augmentation - Targeted data augmentation strategies can effectively increase sub-dataset diversity and reduce distribution differences, thereby alleviating shortcut learning [29] - Perspective augmentation creates shared visual contexts between sub-datasets, breaking false correlations tied to specific tasks [30] - The results confirm that carefully selected data augmentation strategies can enhance the generalization capabilities of robot policies [34]
从坐标混乱到时空对齐!诺亚和复旦联合提出4D-VLA,提升机器人预训练效率和稳健性
具身智能之心· 2025-07-06 11:54
Core Insights - The article introduces 4D-VLA, a new pretraining method that integrates 3D spatial and historical frame data to enhance model performance in complex scenarios, addressing the limitations of traditional single-frame RGB and text inputs [4][10][18]. Group 1: Limitations of Existing Paradigms - Current mainstream methods like OpenVLA rely solely on single-frame RGB images and text instructions, leading to chaotic target distributions and slow model convergence due to high variance [7][8]. - The lack of complete input information results in significant challenges, such as coordinate system chaos and state chaos, which severely degrade pretraining efficiency [5][9]. Group 2: Proposed Solutions - 4D-VLA utilizes depth maps and camera extrinsics to project each pixel into world coordinates, embedding 3D positional encoding to align visual tokens with robot coordinates, thus reducing ambiguity in coordinate systems [10][18]. - The method includes a controlled experiment to quantify the impact of coordinate chaos on VLA models, demonstrating that the introduction of 3D information significantly improves model robustness and convergence speed [11][17]. Group 3: Experimental Setup and Results - The DROID dataset, comprising 76,000 human demonstration trajectories across various tasks, serves as the foundation for pretraining, while the LIBERO simulation suite is used for downstream evaluation [29][30]. - 4D-VLA outperforms existing methods in various tasks, achieving an average success rate of 88.6% across different evaluation settings, showcasing its superior capability in spatial awareness and generalization [33][39]. Group 4: Real-World Evaluation - In real-world tests, 4D-VLA demonstrated enhanced precision and robustness in tasks involving spatial generalization, robustness to distractors, precise placement, and structured instruction execution [44][49]. - The model maintained high success rates even under unseen camera angles, indicating its ability to adapt to new environments and conditions effectively [57][58].