GR00T N1.5

Search documents
亿级短视频数据突破具身智能Scaling Law!Being-H0提出VLA训练新范式
量子位· 2025-07-24 07:28
Core Viewpoint - The article discusses the advancements in embodied intelligence, particularly focusing on the development of the Being-H0 model, which utilizes human hand movement data to enhance robot action capabilities and address the data scarcity issue in visual-language-action (VLA) models [1][30]. Group 1: Data Scarcity and Solutions - The lack of real-world data is hindering the development of VLA models, with existing data falling short by three orders of magnitude compared to the required scale of over one hundred million training samples [2]. - The research team from Peking University and BeingBeyond proposed a solution by creating a large-scale dataset from human operation videos, achieving a dataset size in the hundreds of millions [3][17]. Group 2: Being-H0 Model and Innovations - Being-H0 is the first large-scale pre-trained VLA model based on human video hand data, utilizing a novel "physical instruction tuning" framework to map human hand movements to robot action spaces [5][10]. - The model is built on the premise that human hand movements serve as the most complete execution template for various robotic end-effectors, allowing robots to benefit from human motion knowledge [6][10]. Group 3: Training Framework - The physical instruction tuning framework consists of three key components: pre-training from millions of human operation videos, physical space alignment to eliminate data source heterogeneity, and post-training for effective skill transfer to real robots [12][13][14]. - The framework addresses the challenges of data heterogeneity between 2D multimodal data and 3D robot action spaces, enhancing the model's ability to learn and generate actions [12]. Group 4: UniHand Dataset - The UniHand dataset, comprising over 150 million human hand gesture action samples, was systematically constructed to meet the training data needs of the physical instruction tuning framework [20][21]. - Even with just 2.5 million samples from this dataset, the model demonstrated significant performance improvements in gesture action prediction and real robot tasks [21]. Group 5: Experimental Validation - Comprehensive real robot experiments validated the effectiveness of the Being-H0 model, showing it outperformed both its base model InternVL3 and NVIDIA's GR00T N1.5 model in various tasks [22][24]. - The experiments confirmed that the data construction strategy significantly enhances the model's ability to learn human action knowledge from video data, leading to improved task success rates [24]. Group 6: Future Directions - The BeingBeyond team is focused on advancing core technologies in embodied intelligence, dexterous manipulation, and full-body motion control, aiming to integrate robots into everyday life [30].
Being-H0:从大规模人类视频中学习灵巧操作的VLA模型
具身智能之心· 2025-07-23 08:45
Core Insights - The article discusses the advancements in vision-language-action models (VLAs) and the challenges faced in the robotics field, particularly in complex dexterous manipulation tasks due to data limitations [3][4]. Group 1: Research Background and Motivation - Current large language models and multimodal models have made significant progress, but the robotics sector lacks a transformative moment akin to "ChatGPT" [3]. - Existing VLAs struggle with dexterous tasks due to reliance on synthetic data or limited remote operation demonstrations, especially in fine manipulation due to high hardware costs [3]. - Human videos contain rich real-world operational data, but learning from them presents challenges such as data heterogeneity, hand motion quantization, cross-modal reasoning, and robot control transfer [3]. Group 2: Core Methodology - The article introduces Physical Instruction Tuning, a paradigm that consists of three phases: pre-training, physical space alignment, and post-training, to transfer human hand movement knowledge to robotic operations [4]. Group 3: Pre-training Phase - The pre-training phase uses human hands as ideal manipulators, treating robotic hands as simplified versions, and trains a foundational VLA on large-scale human videos [6]. - The input includes visual information, language instructions, and parameterized hand movements, optimizing the mapping from vision and language to motion [6][8]. Group 4: Physical Space Alignment - Physical space alignment addresses the interference caused by different camera parameters and coordinate systems through weak perspective projection alignment and motion distribution balancing [10][12]. - The model adapts to specific robots by projecting the robot's proprioceptive state into the model's embedding space, generating executable actions through learnable query tokens [13]. Group 5: Key Technologies - The article discusses motion tokenization and cross-modal fusion, emphasizing the need to retain fine motion precision while discretizing continuous movements [14][17]. - The hand movements are decomposed into wrist and finger movements, each tokenized separately, ensuring reconstruction accuracy through a combination of loss functions [18]. Group 6: Dataset and Experimental Results - The UniHand dataset, comprising over 440,000 task trajectories and 1.3 billion frames, supports large-scale pre-training and includes diverse tasks and data sources [21]. - Experimental results show that the Being-H0 model outperforms baseline models in hand motion generation and translation tasks, demonstrating better spatial accuracy and semantic alignment [22][25]. Group 7: Long Sequence Motion Generation - The model effectively generates long sequences of motion (2-10 seconds) using soft format decoding, which helps maintain trajectory stability [26]. Group 8: Real Robot Operation Experiments - In practical tasks like grasping and placing, Being-H0 shows significantly higher success rates compared to baseline models, achieving 65% and 60% success in unseen toy and cluttered scene tasks, respectively [28].
Should You Buy Nvidia Stock Before May 28? Wall Street Has a Crystal-Clear Answer for Investors.
The Motley Fool· 2025-05-25 08:15
Core Viewpoint - Nvidia's stock has shown volatility due to external factors like tariffs and export restrictions, but it has rebounded as capital spending forecasts from hyperscale cloud companies improved and regulatory changes occurred [1][2]. Company Overview - Nvidia specializes in accelerated computing, particularly in artificial intelligence (AI), holding over 90% market share in data center GPUs [5]. - The company enhances its GPU offerings with complementary hardware, allowing it to build complete data center solutions, which CEO Jensen Huang claims results in the lowest total cost of ownership [6]. - Nvidia has developed the CUDA software platform over two decades, which supports a wide range of AI applications [7]. Future Prospects - Nvidia is positioned to lead in the next phase of AI development, focusing on self-driving cars and autonomous robots, with platforms like Nvidia Drive and Nvidia Isaac [9][10]. - The recent introduction of the GR00T N1.5 model is expected to strengthen Nvidia's position in the AI ecosystem, while opening NVLink technology to custom chipmakers may create new revenue opportunities [11]. Financial Expectations - Nvidia is set to report its first-quarter fiscal 2026 results, with initial guidance suggesting 53% revenue growth and 49% non-GAAP earnings growth, although analysts have recently lowered their estimates to a 44% earnings increase [12]. - Historical performance indicates that even exceeding earnings expectations may not guarantee a positive market reaction, as seen in previous quarters [13]. - The options market anticipates a price swing of 6 points, indicating expected volatility around the earnings report [13]. Investor Guidance - Investors are advised to monitor the upcoming earnings call for insights on export restrictions, market deals, and semiconductor tariffs [14]. - Long-term investors may consider establishing a small position in Nvidia, while those seeking quick profits should be cautious due to market uncertainties [15].
英伟达让机器人「做梦学习」,靠梦境实现真·从0泛化
量子位· 2025-05-21 10:39
鹭羽 发自 凹非寺 量子位 | 公众号 QbitAI 「仿生人会梦见电子羊吗?」这是科幻界一个闻名遐迩的问题。 现在英伟达给出答案:Yes!而且还 可以从中学习新技能 。 如下面各种丝滑操作,都没有真实世界数据作为训练支撑。 仅凭文本指令,机器人就完成相应任务。 这是NVIDIA GEAR Lab最新推出的 DreamGen 项目。 它所说的"梦境中学习",是巧妙利用AI视频世界模型生成神经轨迹,仅需少量现实视频,就能让机器人学会执行22种新任务。 在真实机器人测试上,复杂任务的成功率更是从21%显著提升至45.5%,并 首次实现真正意义上的从0开始的泛化 。 英伟达掌门人老黄最近也在Computex 2025演讲上将其作为 GR00T-Dreams 的一部分对外正式进行宣布。 接下来就DreamGen构造我们一一拆解。 在梦境中学习 传统机器人虽已展现出执行复杂现实任务的巨大潜力,但严重依赖人工收集的大规模遥操作数据,成本高且耗时长。 纯粹的计算机仿真合成数据,也由于模拟环境与真实物理世界差距大,机器人所学会的技能难以直接应用到现实。 于是研究团队提出要不试试让机器人在梦境中学习? 这个想法也并非空穴来风, ...