Workflow
InternVL3
icon
Search documents
亿级短视频数据突破具身智能Scaling Law!Being-H0提出VLA训练新范式
量子位· 2025-07-24 07:28
Core Viewpoint - The article discusses the advancements in embodied intelligence, particularly focusing on the development of the Being-H0 model, which utilizes human hand movement data to enhance robot action capabilities and address the data scarcity issue in visual-language-action (VLA) models [1][30]. Group 1: Data Scarcity and Solutions - The lack of real-world data is hindering the development of VLA models, with existing data falling short by three orders of magnitude compared to the required scale of over one hundred million training samples [2]. - The research team from Peking University and BeingBeyond proposed a solution by creating a large-scale dataset from human operation videos, achieving a dataset size in the hundreds of millions [3][17]. Group 2: Being-H0 Model and Innovations - Being-H0 is the first large-scale pre-trained VLA model based on human video hand data, utilizing a novel "physical instruction tuning" framework to map human hand movements to robot action spaces [5][10]. - The model is built on the premise that human hand movements serve as the most complete execution template for various robotic end-effectors, allowing robots to benefit from human motion knowledge [6][10]. Group 3: Training Framework - The physical instruction tuning framework consists of three key components: pre-training from millions of human operation videos, physical space alignment to eliminate data source heterogeneity, and post-training for effective skill transfer to real robots [12][13][14]. - The framework addresses the challenges of data heterogeneity between 2D multimodal data and 3D robot action spaces, enhancing the model's ability to learn and generate actions [12]. Group 4: UniHand Dataset - The UniHand dataset, comprising over 150 million human hand gesture action samples, was systematically constructed to meet the training data needs of the physical instruction tuning framework [20][21]. - Even with just 2.5 million samples from this dataset, the model demonstrated significant performance improvements in gesture action prediction and real robot tasks [21]. Group 5: Experimental Validation - Comprehensive real robot experiments validated the effectiveness of the Being-H0 model, showing it outperformed both its base model InternVL3 and NVIDIA's GR00T N1.5 model in various tasks [22][24]. - The experiments confirmed that the data construction strategy significantly enhances the model's ability to learn human action knowledge from video data, leading to improved task success rates [24]. Group 6: Future Directions - The BeingBeyond team is focused on advancing core technologies in embodied intelligence, dexterous manipulation, and full-body motion control, aiming to integrate robots into everyday life [30].
Being-H0:从大规模人类视频中学习灵巧操作的VLA模型
具身智能之心· 2025-07-23 08:45
Core Insights - The article discusses the advancements in vision-language-action models (VLAs) and the challenges faced in the robotics field, particularly in complex dexterous manipulation tasks due to data limitations [3][4]. Group 1: Research Background and Motivation - Current large language models and multimodal models have made significant progress, but the robotics sector lacks a transformative moment akin to "ChatGPT" [3]. - Existing VLAs struggle with dexterous tasks due to reliance on synthetic data or limited remote operation demonstrations, especially in fine manipulation due to high hardware costs [3]. - Human videos contain rich real-world operational data, but learning from them presents challenges such as data heterogeneity, hand motion quantization, cross-modal reasoning, and robot control transfer [3]. Group 2: Core Methodology - The article introduces Physical Instruction Tuning, a paradigm that consists of three phases: pre-training, physical space alignment, and post-training, to transfer human hand movement knowledge to robotic operations [4]. Group 3: Pre-training Phase - The pre-training phase uses human hands as ideal manipulators, treating robotic hands as simplified versions, and trains a foundational VLA on large-scale human videos [6]. - The input includes visual information, language instructions, and parameterized hand movements, optimizing the mapping from vision and language to motion [6][8]. Group 4: Physical Space Alignment - Physical space alignment addresses the interference caused by different camera parameters and coordinate systems through weak perspective projection alignment and motion distribution balancing [10][12]. - The model adapts to specific robots by projecting the robot's proprioceptive state into the model's embedding space, generating executable actions through learnable query tokens [13]. Group 5: Key Technologies - The article discusses motion tokenization and cross-modal fusion, emphasizing the need to retain fine motion precision while discretizing continuous movements [14][17]. - The hand movements are decomposed into wrist and finger movements, each tokenized separately, ensuring reconstruction accuracy through a combination of loss functions [18]. Group 6: Dataset and Experimental Results - The UniHand dataset, comprising over 440,000 task trajectories and 1.3 billion frames, supports large-scale pre-training and includes diverse tasks and data sources [21]. - Experimental results show that the Being-H0 model outperforms baseline models in hand motion generation and translation tasks, demonstrating better spatial accuracy and semantic alignment [22][25]. Group 7: Long Sequence Motion Generation - The model effectively generates long sequences of motion (2-10 seconds) using soft format decoding, which helps maintain trajectory stability [26]. Group 8: Real Robot Operation Experiments - In practical tasks like grasping and placing, Being-H0 shows significantly higher success rates compared to baseline models, achieving 65% and 60% success in unseen toy and cluttered scene tasks, respectively [28].
5700问答对全面评估拷问AI空间感!最新空间智能评测基准来了丨浙大&成电&港中文
量子位· 2025-06-02 04:13
Core Insights - The article discusses the limitations of current Visual Language Models (VLMs) in spatial reasoning and multi-perspective understanding, highlighting the need for improved AI systems that can collaborate effectively with humans [1][3][20]. Group 1: ViewSpatial-Bench Development - A new benchmark system called ViewSpatial-Bench has been developed by research teams from Zhejiang University, University of Electronic Science and Technology of China, and The Chinese University of Hong Kong to evaluate VLMs' spatial reasoning capabilities across multiple perspectives [4][33]. - ViewSpatial-Bench includes 5 different task types and over 5700 question-answer pairs, assessing models from both camera and human perspectives [5][7]. - The benchmark aims to address the fragmented understanding of spatial information in VLMs, which often leads to performance issues in multi-perspective tasks [2][20]. Group 2: Model Performance Evaluation - The evaluation of various leading models, including GPT-4o and Gemini 2.0, revealed that their performance in understanding spatial relationships is still inadequate, with overall accuracy scores being low [19][20]. - The results indicated a significant performance gap between tasks based on camera perspectives and those based on human perspectives, suggesting a lack of unified spatial cognitive frameworks in current VLMs [22][23]. - The Multi-View Spatial Model (MVSM) was introduced to enhance cross-perspective spatial understanding, achieving a 46.24% absolute performance improvement over its backbone model [27][28]. Group 3: Future Directions - The findings highlight the structural imbalance in training data regarding perspective distribution, indicating a need for future data construction and model optimization efforts [26]. - The development of MVSM and ViewSpatial-Bench provides a feasible path for AI systems to achieve human-like spatial cognitive abilities, which is crucial for the next generation of robots and multimodal assistants [34].
全面评估多模态模型视频OCR能力,Gemini 准确率仅73.7%
量子位· 2025-05-30 07:10
Core Viewpoint - The article discusses the challenges and advancements of Multi-Modal Large Language Models (MLLM) in Optical Character Recognition (OCR) for dynamic video content, highlighting the need for improved evaluation frameworks and model capabilities in this area [1][2][5]. Group 1: Model Capabilities and Challenges - MLLM has shown excellent OCR capabilities on static images but faces significant challenges when applied to dynamic video scenarios [1][2]. - The MME-VideoOCR framework aims to systematically evaluate and enhance MLLM's perception, understanding, and reasoning abilities in video OCR [3]. - Current MLLM capabilities are limited by factors such as motion blur, lighting changes, and complex temporal associations in videos, which complicate text recognition [5][21]. Group 2: Data and Task Design - MME-VideoOCR has constructed a detailed task system with 10 major task categories and 25 independent tasks, focusing on high-level capabilities like temporal understanding and complex reasoning [6][15]. - A high-quality, large-scale dataset was created, including 1,464 selected video clips and 2,000 manually annotated question-answer pairs to ensure evaluation accuracy [4][12]. Group 3: Evaluation Findings - An in-depth evaluation of 18 mainstream MLLMs revealed that even the best-performing model, Gemini-2.5 Pro, achieved only a 73.7% accuracy, indicating substantial room for improvement in video OCR tasks [7][20]. - The performance gap between closed-source and open-source models is significant, with many open-source models scoring below 60% in accuracy [20]. Group 4: Key Limitations - MLLMs struggle with tasks requiring long temporal integration and dynamic text understanding, showcasing weaknesses in temporal reasoning capabilities [21]. - There is a tendency for models to overly rely on prior language knowledge rather than effectively utilizing visual information for video text comprehension [22]. Group 5: Optimization Strategies - Providing higher resolution visual inputs and more comprehensive temporal frame coverage is crucial for enhancing MLLM performance in dynamic video scenarios [23]. - However, an increase in visual input may lead to difficulties in focusing on target information, necessitating improved information extraction and processing capabilities [23].