视觉 - 语言 - 动作模型（VLAs） - filings, earnings calls, financial reports, news

视觉 - 语言 - 动作模型（VLAs）

Search documents

具身智能之心· 2025-10-28 04:00

Core Insights - The article presents VLA-0, a novel approach in the field of robot control that utilizes a vision-language-action model (VLA) without modifying the existing structure of the vision-language model (VLM) [1][2][3]. - VLA-0 demonstrates that a simple design can achieve top-tier performance, challenging the notion that complexity equates to better functionality in VLA development [14][21]. Summary by Sections Introduction to VLA-0 - VLA-0 breaks the conventional belief that more complex models yield better results by proposing a "zero modification" approach, allowing VLM to predict actions in text format without altering its architecture [1][2]. Current Challenges in VLA Development - Existing VLA models often sacrifice the inherent advantages of VLMs for added action functionalities, leading to issues such as increased complexity and reduced language comprehension [2][3]. Key Design Features of VLA-0 - VLA-0 retains the original VLM structure and focuses on optimizing input-output and training logic, allowing it to predict actions effectively [3][4]. - The input design includes system prompts, multi-modal observations, and natural language task instructions, ensuring that VLM can understand and process tasks without additional coding [4][5]. Action Decoding Mechanism - VLA-0 innovatively converts continuous actions into text that VLM can generate, enhancing action resolution and avoiding vocabulary conflicts [5][6]. - The training strategy employs masked action augmentation to improve the model's reliance on visual and task information rather than just text sequence continuity [7][8]. Experimental Results - VLA-0 outperforms complex models in both simulated and real-world scenarios, achieving an average success rate of 94.7% in simulations, surpassing all comparable models [10][11]. - In real-world tests, VLA-0 achieved a 60% success rate, significantly higher than the 47.5% of the SmolVLA model, demonstrating its effectiveness in practical applications [11][13]. Conclusions and Future Directions - The findings suggest that simpler designs can lead to superior performance in VLA development, emphasizing the importance of leveraging existing VLM capabilities [14][15]. - Future exploration may include large-scale pre-training, optimization of inference speed, and integration of 3D perception to enhance the model's adaptability and precision in complex environments [18][19][20].

具身智能迎来ImageNet时刻：RoboChallenge开放首个大规模真机基准测试集

机器之心· 2025-10-15 10:44

Core Insights - RoboChallenge is the world's first large-scale, multi-task benchmark testing platform for robots operating in real physical environments, aimed at providing reliable and comparable evaluation standards for visual-language-action models (VLAs) [1][4][7] - The platform addresses the lack of unified, open, and reproducible benchmark testing methods in the robotics field, enabling researchers to validate and compare robotic algorithms in a standardized environment [4][7] Group 1: Platform Features - RoboChallenge integrates multiple mainstream robots (UR5, Franka Panda, Aloha, ARX-5) to facilitate remote evaluation, providing a large-scale, standardized, and reproducible testing environment [7][14] - The platform employs a standardized API interface, allowing users to call tests without submitting Docker images or model files, thus enhancing accessibility [19] - It features a dual asynchronous control mechanism for precise synchronization of action commands and image acquisition, improving testing efficiency [19] Group 2: Evaluation Methodology - The benchmark testing method focuses on controlling human factors, ensuring visual consistency, validating model robustness, and designing protocols for different evaluation objectives [16] - RoboChallenge introduces a "visual inputs reproduction" method to ensure consistent initial states for each test, enhancing the reliability of evaluations [16] - The Table30 benchmark set includes 30 carefully designed everyday tasks, significantly more than typical industry evaluations, providing a reliable measure of algorithm performance across various scenarios [18][23] Group 3: Community Engagement - RoboChallenge operates on a fully open principle, offering free evaluation services to global researchers and ensuring transparency by publicly sharing task demonstration data and intermediate results [27] - The platform encourages community collaboration through challenges, workshops, and data sharing, promoting joint efforts to address core issues in embodied intelligence [27] Group 4: Future Directions - RoboChallenge aims to expand its capabilities by incorporating mobile robots and dexterous manipulators, enhancing cross-scenario task testing abilities [29] - Future evaluations will extend beyond visual-action coordination to include multi-modal perception and human-robot collaboration, with plans for more challenging benchmarks [29]

XRoboToolkit：延迟低、可扩展、质量高的数据采集框架

具身智能之心· 2025-08-07 00:03

Core Insights - The article discusses the development of XRoboToolkit, a cross-platform framework for robot teleoperation, addressing the increasing demand for large-scale, high-quality robot demonstration datasets due to the rapid advancement of visual-language-action models (VLAs) [3]. Existing Teleoperation Solutions Limitations - Current teleoperation frameworks have various shortcomings, including limited scalability, complex setup processes, and poor data quality [4][5]. XRoboToolkit's Core Design - The framework features a three-layer architecture for cross-platform integration, comprising XR end components, robot end components, and a service layer for real-time teleoperation and stereo vision [4][5]. Data Streaming and Transmission - XRoboToolkit employs an asynchronous callback-driven architecture for real-time data transmission from XR hardware to the client, with a focus on various tracking data formats [7][9]. Robot Control Module - The inverse kinematics (IK) solver is based on quadratic programming (QP) to generate smooth movements, particularly near kinematic singularities, enhancing stability [8][10]. XR Unity Application and Stereo Vision Feedback - The framework has been validated across multiple platforms, demonstrating an average latency of 82ms, significantly lower than the 121.5ms of Open-TeleVision, with a standard deviation of 6.32ms [11][13]. - Data quality was verified through the collection of 100 data points, achieving a 100% success rate in a 30-minute continuous operation [11][14]. Application Interface and Features - The application interface includes five panels for network status, tracking configuration, remote vision, data collection, and system diagnostics, supporting various devices [16]. - Stereo vision capabilities are optimized for depth perception, with the PICO 4 Ultra outperforming in visual quality metrics [16].

Being-H0：从大规模人类视频中学习灵巧操作的VLA模型

具身智能之心· 2025-07-23 08:45

Core Insights - The article discusses the advancements in vision-language-action models (VLAs) and the challenges faced in the robotics field, particularly in complex dexterous manipulation tasks due to data limitations [3][4]. Group 1: Research Background and Motivation - Current large language models and multimodal models have made significant progress, but the robotics sector lacks a transformative moment akin to "ChatGPT" [3]. - Existing VLAs struggle with dexterous tasks due to reliance on synthetic data or limited remote operation demonstrations, especially in fine manipulation due to high hardware costs [3]. - Human videos contain rich real-world operational data, but learning from them presents challenges such as data heterogeneity, hand motion quantization, cross-modal reasoning, and robot control transfer [3]. Group 2: Core Methodology - The article introduces Physical Instruction Tuning, a paradigm that consists of three phases: pre-training, physical space alignment, and post-training, to transfer human hand movement knowledge to robotic operations [4]. Group 3: Pre-training Phase - The pre-training phase uses human hands as ideal manipulators, treating robotic hands as simplified versions, and trains a foundational VLA on large-scale human videos [6]. - The input includes visual information, language instructions, and parameterized hand movements, optimizing the mapping from vision and language to motion [6][8]. Group 4: Physical Space Alignment - Physical space alignment addresses the interference caused by different camera parameters and coordinate systems through weak perspective projection alignment and motion distribution balancing [10][12]. - The model adapts to specific robots by projecting the robot's proprioceptive state into the model's embedding space, generating executable actions through learnable query tokens [13]. Group 5: Key Technologies - The article discusses motion tokenization and cross-modal fusion, emphasizing the need to retain fine motion precision while discretizing continuous movements [14][17]. - The hand movements are decomposed into wrist and finger movements, each tokenized separately, ensuring reconstruction accuracy through a combination of loss functions [18]. Group 6: Dataset and Experimental Results - The UniHand dataset, comprising over 440,000 task trajectories and 1.3 billion frames, supports large-scale pre-training and includes diverse tasks and data sources [21]. - Experimental results show that the Being-H0 model outperforms baseline models in hand motion generation and translation tasks, demonstrating better spatial accuracy and semantic alignment [22][25]. Group 7: Long Sequence Motion Generation - The model effectively generates long sequences of motion (2-10 seconds) using soft format decoding, which helps maintain trajectory stability [26]. Group 8: Real Robot Operation Experiments - In practical tasks like grasping and placing, Being-H0 shows significantly higher success rates compared to baseline models, achieving 65% and 60% success in unseen toy and cluttered scene tasks, respectively [28].