Workflow
视觉 - 语言 - 动作模型(VLAs)
icon
Search documents
具身智能迎来ImageNet时刻:RoboChallenge开放首个大规模真机基准测试集
机器之心· 2025-10-15 10:44
Core Insights - RoboChallenge is the world's first large-scale, multi-task benchmark testing platform for robots operating in real physical environments, aimed at providing reliable and comparable evaluation standards for visual-language-action models (VLAs) [1][4][7] - The platform addresses the lack of unified, open, and reproducible benchmark testing methods in the robotics field, enabling researchers to validate and compare robotic algorithms in a standardized environment [4][7] Group 1: Platform Features - RoboChallenge integrates multiple mainstream robots (UR5, Franka Panda, Aloha, ARX-5) to facilitate remote evaluation, providing a large-scale, standardized, and reproducible testing environment [7][14] - The platform employs a standardized API interface, allowing users to call tests without submitting Docker images or model files, thus enhancing accessibility [19] - It features a dual asynchronous control mechanism for precise synchronization of action commands and image acquisition, improving testing efficiency [19] Group 2: Evaluation Methodology - The benchmark testing method focuses on controlling human factors, ensuring visual consistency, validating model robustness, and designing protocols for different evaluation objectives [16] - RoboChallenge introduces a "visual inputs reproduction" method to ensure consistent initial states for each test, enhancing the reliability of evaluations [16] - The Table30 benchmark set includes 30 carefully designed everyday tasks, significantly more than typical industry evaluations, providing a reliable measure of algorithm performance across various scenarios [18][23] Group 3: Community Engagement - RoboChallenge operates on a fully open principle, offering free evaluation services to global researchers and ensuring transparency by publicly sharing task demonstration data and intermediate results [27] - The platform encourages community collaboration through challenges, workshops, and data sharing, promoting joint efforts to address core issues in embodied intelligence [27] Group 4: Future Directions - RoboChallenge aims to expand its capabilities by incorporating mobile robots and dexterous manipulators, enhancing cross-scenario task testing abilities [29] - Future evaluations will extend beyond visual-action coordination to include multi-modal perception and human-robot collaboration, with plans for more challenging benchmarks [29]
XRoboToolkit:延迟低、可扩展、质量高的数据采集框架
具身智能之心· 2025-08-07 00:03
Core Insights - The article discusses the development of XRoboToolkit, a cross-platform framework for robot teleoperation, addressing the increasing demand for large-scale, high-quality robot demonstration datasets due to the rapid advancement of visual-language-action models (VLAs) [3]. Existing Teleoperation Solutions Limitations - Current teleoperation frameworks have various shortcomings, including limited scalability, complex setup processes, and poor data quality [4][5]. XRoboToolkit's Core Design - The framework features a three-layer architecture for cross-platform integration, comprising XR end components, robot end components, and a service layer for real-time teleoperation and stereo vision [4][5]. Data Streaming and Transmission - XRoboToolkit employs an asynchronous callback-driven architecture for real-time data transmission from XR hardware to the client, with a focus on various tracking data formats [7][9]. Robot Control Module - The inverse kinematics (IK) solver is based on quadratic programming (QP) to generate smooth movements, particularly near kinematic singularities, enhancing stability [8][10]. XR Unity Application and Stereo Vision Feedback - The framework has been validated across multiple platforms, demonstrating an average latency of 82ms, significantly lower than the 121.5ms of Open-TeleVision, with a standard deviation of 6.32ms [11][13]. - Data quality was verified through the collection of 100 data points, achieving a 100% success rate in a 30-minute continuous operation [11][14]. Application Interface and Features - The application interface includes five panels for network status, tracking configuration, remote vision, data collection, and system diagnostics, supporting various devices [16]. - Stereo vision capabilities are optimized for depth perception, with the PICO 4 Ultra outperforming in visual quality metrics [16].
Being-H0:从大规模人类视频中学习灵巧操作的VLA模型
具身智能之心· 2025-07-23 08:45
Core Insights - The article discusses the advancements in vision-language-action models (VLAs) and the challenges faced in the robotics field, particularly in complex dexterous manipulation tasks due to data limitations [3][4]. Group 1: Research Background and Motivation - Current large language models and multimodal models have made significant progress, but the robotics sector lacks a transformative moment akin to "ChatGPT" [3]. - Existing VLAs struggle with dexterous tasks due to reliance on synthetic data or limited remote operation demonstrations, especially in fine manipulation due to high hardware costs [3]. - Human videos contain rich real-world operational data, but learning from them presents challenges such as data heterogeneity, hand motion quantization, cross-modal reasoning, and robot control transfer [3]. Group 2: Core Methodology - The article introduces Physical Instruction Tuning, a paradigm that consists of three phases: pre-training, physical space alignment, and post-training, to transfer human hand movement knowledge to robotic operations [4]. Group 3: Pre-training Phase - The pre-training phase uses human hands as ideal manipulators, treating robotic hands as simplified versions, and trains a foundational VLA on large-scale human videos [6]. - The input includes visual information, language instructions, and parameterized hand movements, optimizing the mapping from vision and language to motion [6][8]. Group 4: Physical Space Alignment - Physical space alignment addresses the interference caused by different camera parameters and coordinate systems through weak perspective projection alignment and motion distribution balancing [10][12]. - The model adapts to specific robots by projecting the robot's proprioceptive state into the model's embedding space, generating executable actions through learnable query tokens [13]. Group 5: Key Technologies - The article discusses motion tokenization and cross-modal fusion, emphasizing the need to retain fine motion precision while discretizing continuous movements [14][17]. - The hand movements are decomposed into wrist and finger movements, each tokenized separately, ensuring reconstruction accuracy through a combination of loss functions [18]. Group 6: Dataset and Experimental Results - The UniHand dataset, comprising over 440,000 task trajectories and 1.3 billion frames, supports large-scale pre-training and includes diverse tasks and data sources [21]. - Experimental results show that the Being-H0 model outperforms baseline models in hand motion generation and translation tasks, demonstrating better spatial accuracy and semantic alignment [22][25]. Group 7: Long Sequence Motion Generation - The model effectively generates long sequences of motion (2-10 seconds) using soft format decoding, which helps maintain trajectory stability [26]. Group 8: Real Robot Operation Experiments - In practical tasks like grasping and placing, Being-H0 shows significantly higher success rates compared to baseline models, achieving 65% and 60% success in unseen toy and cluttered scene tasks, respectively [28].