Workflow
GR00T N1.5
icon
Search documents
亿级短视频数据突破具身智能Scaling Law!Being-H0提出VLA训练新范式
量子位· 2025-07-24 07:28
Core Viewpoint - The article discusses the advancements in embodied intelligence, particularly focusing on the development of the Being-H0 model, which utilizes human hand movement data to enhance robot action capabilities and address the data scarcity issue in visual-language-action (VLA) models [1][30]. Group 1: Data Scarcity and Solutions - The lack of real-world data is hindering the development of VLA models, with existing data falling short by three orders of magnitude compared to the required scale of over one hundred million training samples [2]. - The research team from Peking University and BeingBeyond proposed a solution by creating a large-scale dataset from human operation videos, achieving a dataset size in the hundreds of millions [3][17]. Group 2: Being-H0 Model and Innovations - Being-H0 is the first large-scale pre-trained VLA model based on human video hand data, utilizing a novel "physical instruction tuning" framework to map human hand movements to robot action spaces [5][10]. - The model is built on the premise that human hand movements serve as the most complete execution template for various robotic end-effectors, allowing robots to benefit from human motion knowledge [6][10]. Group 3: Training Framework - The physical instruction tuning framework consists of three key components: pre-training from millions of human operation videos, physical space alignment to eliminate data source heterogeneity, and post-training for effective skill transfer to real robots [12][13][14]. - The framework addresses the challenges of data heterogeneity between 2D multimodal data and 3D robot action spaces, enhancing the model's ability to learn and generate actions [12]. Group 4: UniHand Dataset - The UniHand dataset, comprising over 150 million human hand gesture action samples, was systematically constructed to meet the training data needs of the physical instruction tuning framework [20][21]. - Even with just 2.5 million samples from this dataset, the model demonstrated significant performance improvements in gesture action prediction and real robot tasks [21]. Group 5: Experimental Validation - Comprehensive real robot experiments validated the effectiveness of the Being-H0 model, showing it outperformed both its base model InternVL3 and NVIDIA's GR00T N1.5 model in various tasks [22][24]. - The experiments confirmed that the data construction strategy significantly enhances the model's ability to learn human action knowledge from video data, leading to improved task success rates [24]. Group 6: Future Directions - The BeingBeyond team is focused on advancing core technologies in embodied intelligence, dexterous manipulation, and full-body motion control, aiming to integrate robots into everyday life [30].
Being-H0:从大规模人类视频中学习灵巧操作的VLA模型
具身智能之心· 2025-07-23 08:45
Core Insights - The article discusses the advancements in vision-language-action models (VLAs) and the challenges faced in the robotics field, particularly in complex dexterous manipulation tasks due to data limitations [3][4]. Group 1: Research Background and Motivation - Current large language models and multimodal models have made significant progress, but the robotics sector lacks a transformative moment akin to "ChatGPT" [3]. - Existing VLAs struggle with dexterous tasks due to reliance on synthetic data or limited remote operation demonstrations, especially in fine manipulation due to high hardware costs [3]. - Human videos contain rich real-world operational data, but learning from them presents challenges such as data heterogeneity, hand motion quantization, cross-modal reasoning, and robot control transfer [3]. Group 2: Core Methodology - The article introduces Physical Instruction Tuning, a paradigm that consists of three phases: pre-training, physical space alignment, and post-training, to transfer human hand movement knowledge to robotic operations [4]. Group 3: Pre-training Phase - The pre-training phase uses human hands as ideal manipulators, treating robotic hands as simplified versions, and trains a foundational VLA on large-scale human videos [6]. - The input includes visual information, language instructions, and parameterized hand movements, optimizing the mapping from vision and language to motion [6][8]. Group 4: Physical Space Alignment - Physical space alignment addresses the interference caused by different camera parameters and coordinate systems through weak perspective projection alignment and motion distribution balancing [10][12]. - The model adapts to specific robots by projecting the robot's proprioceptive state into the model's embedding space, generating executable actions through learnable query tokens [13]. Group 5: Key Technologies - The article discusses motion tokenization and cross-modal fusion, emphasizing the need to retain fine motion precision while discretizing continuous movements [14][17]. - The hand movements are decomposed into wrist and finger movements, each tokenized separately, ensuring reconstruction accuracy through a combination of loss functions [18]. Group 6: Dataset and Experimental Results - The UniHand dataset, comprising over 440,000 task trajectories and 1.3 billion frames, supports large-scale pre-training and includes diverse tasks and data sources [21]. - Experimental results show that the Being-H0 model outperforms baseline models in hand motion generation and translation tasks, demonstrating better spatial accuracy and semantic alignment [22][25]. Group 7: Long Sequence Motion Generation - The model effectively generates long sequences of motion (2-10 seconds) using soft format decoding, which helps maintain trajectory stability [26]. Group 8: Real Robot Operation Experiments - In practical tasks like grasping and placing, Being-H0 shows significantly higher success rates compared to baseline models, achieving 65% and 60% success in unseen toy and cluttered scene tasks, respectively [28].
Should You Buy Nvidia Stock Before May 28? Wall Street Has a Crystal-Clear Answer for Investors.
The Motley Fool· 2025-05-25 08:15
Core Viewpoint - Nvidia's stock has shown volatility due to external factors like tariffs and export restrictions, but it has rebounded as capital spending forecasts from hyperscale cloud companies improved and regulatory changes occurred [1][2]. Company Overview - Nvidia specializes in accelerated computing, particularly in artificial intelligence (AI), holding over 90% market share in data center GPUs [5]. - The company enhances its GPU offerings with complementary hardware, allowing it to build complete data center solutions, which CEO Jensen Huang claims results in the lowest total cost of ownership [6]. - Nvidia has developed the CUDA software platform over two decades, which supports a wide range of AI applications [7]. Future Prospects - Nvidia is positioned to lead in the next phase of AI development, focusing on self-driving cars and autonomous robots, with platforms like Nvidia Drive and Nvidia Isaac [9][10]. - The recent introduction of the GR00T N1.5 model is expected to strengthen Nvidia's position in the AI ecosystem, while opening NVLink technology to custom chipmakers may create new revenue opportunities [11]. Financial Expectations - Nvidia is set to report its first-quarter fiscal 2026 results, with initial guidance suggesting 53% revenue growth and 49% non-GAAP earnings growth, although analysts have recently lowered their estimates to a 44% earnings increase [12]. - Historical performance indicates that even exceeding earnings expectations may not guarantee a positive market reaction, as seen in previous quarters [13]. - The options market anticipates a price swing of 6 points, indicating expected volatility around the earnings report [13]. Investor Guidance - Investors are advised to monitor the upcoming earnings call for insights on export restrictions, market deals, and semiconductor tariffs [14]. - Long-term investors may consider establishing a small position in Nvidia, while those seeking quick profits should be cautious due to market uncertainties [15].
英伟达让机器人「做梦学习」,靠梦境实现真·从0泛化
量子位· 2025-05-21 10:39
Core Viewpoint - NVIDIA's DreamGen project enables robots to learn new skills through simulated "dreams," significantly improving their task execution success rates without relying heavily on real-world data [2][6][31]. Group 1: DreamGen Project Overview - DreamGen utilizes AI video world models to generate neural trajectories, allowing robots to learn 22 new tasks with minimal real-world video input [6][14]. - The success rate for complex tasks in real robot tests increased from 21% to 45.5%, demonstrating effective generalization from zero [7][25]. - The project is part of NVIDIA's broader GR00T-Dreams initiative, aimed at advancing physical AI capabilities [31]. Group 2: Learning Process and Methodology - The learning process involves four main steps: fine-tuning models, generating virtual data, extracting virtual actions, and training strategies [17][18][20][22]. - The approach allows for the generation of new actions based on a single remote operation data point, achieving zero-shot behavior and environment generalization [23][25]. - Experimental results show that the success rate for learning new actions from single action data improved from 11.2% to 43.2% [25]. Group 3: Performance and Validation - In simulations, the scale of neural trajectories reached 333 times that of human demonstration data, with performance improving logarithmically with trajectory quantity [26]. - Real-world testing on platforms like Fourier GR1 and Franka Emika confirmed significant improvements in task success rates, validating the effectiveness of DreamGen [28]. Group 4: Future Implications - The DreamGen Bench was developed to evaluate the quality of generated data based on instruction adherence and physical realism [29]. - The GR00T-Dreams initiative aims to reduce the development time for robot behavior learning from three months to just 36 hours, enhancing the efficiency of AI training [32][34].