具身智能之心
Search documents
全景视觉的Depth Anything来了!200万数据打造全场景360°空间智能
具身智能之心· 2025-12-30 01:11
Core Insights - The article discusses the launch of Depth Any Panoramas (DAP), a foundational model for panoramic depth estimation, which addresses the challenges of data scarcity and model generalization in spatial intelligence [1][19]. Data and Model Development - DAP is trained on an unprecedented scale of 2 million (2M) panoramic images, significantly surpassing previous datasets like Stanford2D3D and Matterport3D, which had only tens of thousands of images [6][7]. - The model utilizes a three-stage pseudo-labeling pipeline to refine the quality of depth estimation from unlabelled panoramic images, ultimately creating a robust training dataset [10][11]. Performance and Benchmarking - DAP has demonstrated superior performance in various benchmarks, achieving significant reductions in absolute relative error (AbsRel) and root mean square error (RMSE) across indoor and outdoor datasets [14][17]. - In zero-shot testing, DAP outperformed existing models, showcasing its strong generalization capabilities and effective depth prediction in complex environments [13][16]. Technological Innovations - The model incorporates advanced features such as a distance-adaptive range mask head, allowing it to adjust depth perception based on different application scenarios [16]. - DAP employs multi-dimensional geometric optimization techniques to ensure sharp edges and accurate geometric structures in depth maps, addressing common issues like depth holes and structural distortion [16]. Industry Implications - The introduction of DAP marks a significant milestone in panoramic depth estimation, enabling advancements in autonomous driving, robotics, and VR/AR content creation by providing a low-cost method for depth acquisition [19][20]. - The project has been made open-source, allowing broader access to its technology and fostering further innovation in the field of spatial intelligence [20].
万字长文,VLA的架构和模型还有什么痛点?
具身智能之心· 2025-12-30 01:11
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨 具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 ★ 上次VLA模型+真机部署的圆桌受到了行业的一致好评。最近平台的同学也一直在整理对话的文稿,今天就为大家分享下第一部分" VLA的架构和模型 "相关内 容。 张强老师: 好,感谢主持人介绍,大家好,我是张强。我来自北京人形机器人中心,主要研究方向和研究背景都是在做人形机器人,大概从2021年开始做人形机器人。先后在 Fourier、GR-1 和 Embodied机器人,包括我们现在的天工平台上做了一些研究。我主要做的研究方向是运动控制,VLA 和一些基于人形机器人的世界模型和具身智 能大模型,希望大家关注我们的工作,然后今天也很高兴跟各位嘉宾。很高兴接受具身智能之心的邀请,很高兴跟各位嘉宾在一起讨论一下相关的问题,谢谢! 完整内容欢迎加入我们的具身社区获取: 具身智能之心知识星球 主持人: 好,那我们就正式开始,那么欢迎大家来到具身智能之心的圆 ...
英伟达主管!具身智能机器人年度总结
具身智能之心· 2025-12-29 12:50
Core Insights - The robotics field is still in its early stages, as highlighted by Jim Fan, NVIDIA's robotics head, indicating a lack of standardized evaluation metrics and the disparity between hardware advancements and software reliability [1][8][11]. Group 1: Hardware and Software Disparity - Current advancements in robotics hardware, such as Optimus and e-Atlas, outpace software development, leading to underutilization of hardware capabilities [14][15]. - The need for extensive operational teams to manage robots is emphasized, as they do not self-repair and face frequent issues like overheating and motor failures [16][17]. - The reliability of hardware is crucial, as errors can lead to irreversible consequences, impacting the overall patience and scalability of the robotics field [18][19]. Group 2: Benchmarking Challenges - The lack of consensus on benchmarking in robotics is a significant issue, with no standardized hardware platforms or task definitions, leading to everyone claiming to achieve state-of-the-art (SOTA) results [20][21]. - The field must improve reproducibility and scientific standards to avoid treating them as secondary concerns [23]. Group 3: VLA Model Insights - The Vision-Language-Action (VLA) model is currently the dominant paradigm in robotics, but its reliance on pre-trained Vision-Language Models (VLM) presents challenges due to misalignment with physical world tasks [25][49]. - The VLA model's performance does not scale linearly with VLM parameters, as the pre-training objectives do not align with the requirements for physical interactions [26][51]. - Future VLA models should integrate physical-driven world models to enhance their ability to understand and interact with the physical environment [50]. Group 4: Data Importance - Data plays a critical role in shaping model capabilities, with the need for diverse data sources and collection methods being highlighted [31][43]. - The emergence of new hardware and data collection methods, such as Generalist and Egocentric-10K, demonstrates the growing importance of data in the robotics field [36][42]. - The current data collection strategies remain open-ended, with various approaches still being explored [43]. Group 5: Industry Trends - The robotics industry is projected to grow significantly, from $91 billion currently to $25 trillion by 2050, indicating a strong future potential [57]. - Major tech companies, excluding Microsoft and Anthropic, are increasingly investing in robotics software and hardware, reflecting the sector's attractiveness [59].
为什么π系列对行业产生了这么大的影响?
具身智能之心· 2025-12-29 00:04
Core Viewpoint - The article discusses the advancements in the π series within the VLA (Vision-Language-Action) field, highlighting its role in transforming robotic learning paradigms and industry applications through continuous technological breakthroughs [2]. Group 1: Technological Advancements - The π0 model introduces Flow Matching for continuous action trajectory prediction, overcoming traditional discrete action precision limitations, providing a foundation for millimeter-level operations in precision manufacturing and autonomous driving scenarios [3]. - The π0.5 model features heterogeneous task collaborative training and hierarchical reasoning, achieving a 94% success rate in generalizing complex tasks in unfamiliar environments, while reducing data costs by 90% through human video training [3]. - The π0.6 model employs RECAP reinforcement learning for zero-shot generalization and efficient fine-tuning, surpassing human efficiency and precision in real-world applications, facilitating flexible production [3]. Group 2: Industry Impact - The π series models serve as a core reference for numerous VLA models in the industry since 2025, enabling the transition of general robots from laboratory settings to real-world applications in industrial manufacturing and home services [3]. - Companies are building their own demo machines based on the π series, such as for tasks like folding clothes and unpacking, indicating the practical implications of the technology [3]. Group 3: Learning and Development Challenges - Many beginners face difficulties in optimizing data and training VLA models based on the π series, with some spending up to six months without achieving satisfactory results [5]. - The article emphasizes the need for guided learning to help individuals gain practical experience and project work for job applications [6][11]. Group 4: Educational Offerings - The company offers a comprehensive course that covers hardware, data collection, VLA algorithms, evaluation, simulation, deployment of mainstream VLA models, and various real machine experiments [13][14]. - Participants in the course will receive a SO-100 robotic arm, enhancing hands-on learning opportunities [16]. Group 5: Target Audience - The course is aimed at individuals seeking practical experience in the VLA field, including students and professionals transitioning from traditional computer vision, robotics, or autonomous driving sectors [24].
亚马逊团队15分钟单GPU搞定人形机器人步态训练!
具身智能之心· 2025-12-29 00:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Younggyo Seo等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在人形机器人控制领域,强化学习(RL)虽已实现从仿真到现实的迁移,但高维动作空间、强域随机化需求导致训练周期冗长,严重制约迭代效率。 亚马逊 FAR 实验室团队提出的快速强化学习方案 ,以优化后的离线 RL 算法(FastSAC、FastTD3)为核心,通过 "算法调优 - 极简奖励设计 - 大规模并行仿真" 的 三位一体技术体系,首次实现单 GPU 15 分钟训练出鲁棒人形机器人 locomotion 政策,同时支持全身运动追踪任务的快速部署,彻底重构了人形机器人 sim-to-real 的迭代范式。 论文题目:Learning Sim-to-Real Humanoid Locomotion in 15 Minutes FastSAC-Humanoid — Project Page:https://youngg ...
“以人为中心”的具身数采逐渐成为首选,产业玩家格局初现~
具身智能之心· 2025-12-29 00:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 机器人算法这么多?为什么很难走进真实场景? 今年以来,围绕机器人操作任务,国内外团队产出了大量的工作。从physical intelligence 到国内的具身独角 兽、高校,不断在刷新各类指标,提升泛化性能。许多VLA和RL框架也逐渐完善,git上star动辄2k+,不少研 究团队持续在维护使用。 但也有一个很明显的问题,为什么VLA在真实场景中应用的很少?有些机器人在各类展会上的效果还算可 以,但稍微变动场景,就感觉"失明"一样,看着"张牙舞爪"。特别是叠衣服、拆箱子等日常生活中的一些任 务,动作固化、动作不够优雅是常态。 这个问题的根因是模型不够泛化,做过模仿学习的童鞋应该知道,模型如果"泛化性"不够,很难真的应用。 特别是具身机器人的开放场景,可能有N种动作,需要大量的数据喂给模型。 行业内一些数据采集成本高、周期长,难以规模化,定制化属性高。这就引出了一个非常重要的问题,如何 有效获取大规模高质量数据,让模型能够"泛化",能够理解任务对应的行为和操作方式。 从成本和规模上已演变出四条数据路线 目前业界已形成多种具身数据获取方案,不同方案在数据质量、 ...
从长时程推理到精准操纵:LoLA 破解机器人多步任务执行难题
具身智能之心· 2025-12-29 00:04
>> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在机器人操纵与视觉 - 语言 - 动作(VLA)模型研究领域,人类凭借对历史信息的连贯理解与多步动作规划,能轻松完成复杂长时程任务(如制作披萨)。但现有 技术多聚焦于短时任务,在长时程场景中面临时序上下文缺失、状态漂移、资源消耗过大等挑战。 中科院、国科大与微软研究院联合团队提出的 LoLA 框架 ,以 "长时程潜动作学习" 为核心,通过 "多模态编码 - 状态感知融合 - 动作生成" 的三层架构,首次实现 了长时程语言引导机器人操纵的高效执行,为通用机器人在真实场景的落地提供了全新解决方案。 论文题目:LoLA: Long Horizon Latent Action Learning for General Robot Manipulation 核心亮点:长时程多模态融合编码、状态感知潜表征模块、跨平台泛化能力、仿真与真实场景双重验证 点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Xiaofan Wang等 编辑丨具身智能之心 本文只做学术 ...
亚马逊团队15分钟单GPU搞定人形机器人步态训练!Locomotion新方案
具身智能之心· 2025-12-28 10:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Younggyo Seo等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在人形机器人控制领域,强化学习(RL)虽已实现从仿真到现实的迁移,但高维动作空间、强域随机化需求导致训练周期冗长,严重制约迭代效率。 亚马逊 FAR 实验室团队提出的快速强化学习方案 ,以优化后的离线 RL 算法(FastSAC、FastTD3)为核心,通过 "算法调优 - 极简奖励设计 - 大规模并行仿真" 的 三位一体技术体系,首次实现单 GPU 15 分钟训练出鲁棒人形机器人 locomotion 政策,同时支持全身运动追踪任务的快速部署,彻底重构了人形机器人 sim-to-real 的迭代范式。 论文题目:Learning Sim-to-Real Humanoid Locomotion in 15 Minutes FastSAC-Humanoid — Project Page:https://youngg ...
为什么π系列对行业产生了这么大的影响?
具身智能之心· 2025-12-28 03:42
Core Viewpoint - The article discusses the advancements in the π series within the VLA (Vision-Language-Action) field, highlighting its role in transforming robotic learning paradigms and industry applications through continuous technological breakthroughs [2]. Summary by Sections Technological Advancements - The π series introduces significant innovations such as Flow Matching for continuous action trajectory prediction, which overcomes traditional discrete action precision limitations, providing a foundation for millimeter-level operations in precision manufacturing and autonomous driving by 2024 [3]. - By April 2025, π0.5 will implement heterogeneous task collaborative training and hierarchical reasoning, achieving a 94% success rate in generalizing complex tasks in unfamiliar environments, while reducing data costs by 90% through human video training, addressing industry data scarcity issues [3]. - The π0.6 model, expected by November 2025, will utilize RECAP reinforcement learning for zero-shot generalization and efficient fine-tuning, surpassing human efficiency and accuracy in real-world applications, achieving a 100% task completion rate and rapid reconfiguration within minutes, thus facilitating flexible production [3]. Industry Impact - The π series models are set to guide general robotics from laboratory settings to real-world applications in industrial manufacturing and home services, becoming a core reference for numerous VLA models in the industry since 2025 [3]. - Companies are building their own demo machines based on the π series, such as for tasks like folding clothes and unpacking, indicating a significant industry response to advancements in physical intelligence [3]. Learning and Development Challenges - Many learners face difficulties in effectively utilizing the π series for data and VLA model training, with some spending up to six months without achieving satisfactory results [5]. - There is a demand for guided learning experiences, as many students express a desire for mentorship to enhance their project portfolios for job applications [6][11]. Educational Initiatives - The company "具身智能之心" has replicated π0, π0.5, ACT, and GR00T methods to address the lack of real machines and project guidance for learners [7]. - A new course titled "VLA Small Class Course for Practical and Job-Oriented Learning" has been developed in collaboration with VLA experts to help students effectively learn and apply VLA technologies [8][13]. - The course covers a comprehensive curriculum including hardware, data collection, VLA algorithms, simulation, and various real machine experiments, aimed at providing a robust learning experience [13][14]. Course Details - The course includes a complimentary SO-100 robotic arm for participants, enhancing hands-on learning opportunities [17]. - It targets individuals seeking practical experience in the VLA field, including students and professionals transitioning from traditional fields [25]. - The course is set to commence on December 30, 2025, with a structured schedule for subsequent sessions [26][28].
REALM:机器人操作任务的real2sim验证基准
具身智能之心· 2025-12-27 10:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Jai Bardhan等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 核心背景与问题 Vision-Language-Action(VLA)模型让机器人能够理解自然语言指令并执行操纵任务,但泛化能力评估一直是关键挑战——真实世界评估成本高、可重复性差,而 现有仿真基准存在明显缺陷:扰动类型有限、缺乏高保真视觉效果和真实的机器人控制对齐,导致仿真与真实世界性能脱节(即"现实-仿真差距")。 为解决这一问题, 捷克理工大学,阿姆斯特丹大学的研究团队 构建了REALM:一个高保真仿真环境与基准,核心目标是建立仿真与真实世界性能的强相关性,实现 大规模、低成本的VLA模型泛化能力评估。其核心突破在于三点:高保真视觉与控制对齐的仿真环境、覆盖多维度扰动的系统评估方案、经实证验证的真实-仿真 性能关联性。 相关工作与差异化优势 现有机器人操纵泛化基准多依赖仿真,但存在显著局限:GemBench、 ...