Workflow
具身智能之心
icon
Search documents
智源评测:用数据解码机器人足球赛中的具身智能
具身智能之心· 2025-09-12 00:05
以下文章来源于BAAI具身智能 ,作者BAAI具身智能 BAAI具身智能 . 北京智源人工智能研究院(BAAI)具身智能团队,致力于推动人类社会向更智能、高效和人性化的方向发展,推动技术创新和产业升级的同时,为解决现实 世界问题提供新的视角和解决方案。 编辑丨 BAAI具身智能 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在2025世界人形机器人运动会(WHRG)上,众多团队展现了具身智能算法与机器人本体深度融合的最新成果,机器人本体的自由度、稳定性和控 制能力明显提升,具身 智能算法赋予感知、推理、规划和决策能力助其在动态环境中能够挑战更复杂的任务。正 因人形机器人已发展为涵盖本体与 智能模型的复杂系统,如何科学系统地 评估其综合能力 ,已成为当前行业发展的关键瓶颈 。 传统的结果导向评价,如简单的输赢或任务完成情 况,已难以充分反映 具身智能在支撑 机器人 本体处于 复杂、动态和强对抗环境下的 性能 表现。以 足球比赛 为例,其中涌现的各种现象 ...
港大团队首发具身表征新范式,构建任务自适应感知框架
具身智能之心· 2025-09-12 00:05
编辑丨机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 本文的共同第一作者为香港大学 InfoBodied AI 实验室的博士生孙力和吴杰枫,合作者为刘瑞哲,陈枫。通讯作者为香港大学数据科学研究院及电机电子工程系助 理教授杨言超。InfoBodied AI 实验室近年来在 CVPR,ICML,Neurips,ICLR 等顶会上有多项代表性成果发表,与国内外知名高校,科研机构广泛开展合作。 出发点与研究背景 在具身智能中,策略学习通常需要依赖场景表征(scene representation)。然而,大多数现有多任务操作方法中的表征提取过程都是任务无关的(task-agnostic): 无论具身智能体要 "关抽屉" 还是 "堆积木",系统提取的特征的方式始终相同(利用同样的神经网络参数)。 想象一下,一个机器人在厨房里,既要能精准抓取易碎的鸡蛋,又要能搬运重型锅具。传统方法让机器人用同一套 "眼光" 观察不同的任务场景,这会使得场景表 征中包含大 ...
机器人走进工厂矿场,外滩这场机器人职业技能赛有意义!
具身智能之心· 2025-09-12 00:05
Core Viewpoint - The AI Science and Technology Competition showcased the practical applications of robotics in industrial inspection and emergency rescue, highlighting the advancements in embodied intelligence and its potential to enhance human capabilities in hazardous environments [2][9]. Group 1: Event Overview - The AI Science and Technology Competition featured a "Robot Vocational Skills Performance Competition" held on September 10, organized by Ant Group, with participation from four embodied intelligence manufacturers [2]. - The competition included various challenging tasks simulating real industrial and rescue scenarios, demonstrating the robots' capabilities and earning applause from the audience [2][3]. Group 2: Robot Performances - The first robot, Qiteng, successfully navigated a "dangerous terrain crossing" task, showcasing its rapid response and strong algorithmic foundation, which is crucial for exploration in remote areas [3][6]. - The team of Shuangying Aviation and Qiuzhi Technology presented a robotic dog that excelled in industrial inspection tasks, performing six complex actions with high precision, and later successfully "rescued" a simulated baby in a rescue scenario [5][9]. - The final robot, Zhongke Huiling, tackled a simulated mining explosion task, achieving millimeter-level precision in inserting explosives, demonstrating effective real-time correction and collaboration capabilities [7][10]. Group 3: Expert Insights - Experts emphasized that industrial inspection and emergency rescue are the most valuable application scenarios for robots, with current robotic capabilities being mature but still facing challenges in fine manipulation [6][9]. - The competition highlighted the importance of practical applications of technology, with a focus on real-world problems and scenarios, aiming to drive industry collaboration and innovation [9]. Group 4: Competition Impact - The competition attracted over 8,000 teams and nearly 20,000 participants from around 20 countries and regions, providing a platform for innovators and companies to showcase their advancements in AI hardware and applications [9]. - The event underscored the commitment to advancing robotics from mere demonstrations to practical industrial applications, aligning technology development with human needs [9].
当我们再说具身大小脑的时候究竟在说什么?
具身智能之心· 2025-09-11 05:53
Core Viewpoint - The exploration towards Artificial General Intelligence (AGI) highlights embodied intelligence as a key direction, focusing on the interaction and adaptation of intelligent agents within physical environments [1][3]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, driving advancements in embodied brain and cerebellum technologies [3]. - Major domestic companies like Huawei, JD.com, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build an ecosystem for embodied intelligence, while international firms like Tesla and investment institutions in the U.S. are focusing on foundational models and humanoid robot prototypes [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to imitate human tasks but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization through sequence modeling [6][7]. - The fourth stage, emerging in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing, addressing limitations in feedback and future prediction capabilities [9][11][12]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, and healthcare [14]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills [17]. Educational Initiatives - A comprehensive curriculum has been developed to assist learners in mastering the full spectrum of embodied intelligence algorithms, covering topics from basic tasks to advanced models like VLA and its integrations [14][20].
库克挤爆牙膏!5999元iPhone17上高刷,新款耳机能测心率+同传
具身智能之心· 2025-09-11 02:07
Core Viewpoint - The article discusses the recent Apple Spring event, highlighting the launch of the iPhone 17 series, AirPods Pro 3, and Apple Watch Series 11, emphasizing design, performance upgrades, and new features across the product lineup [2][14][100]. iPhone 17 Series - The iPhone 17 series includes four models with prices ranging from 5999 yuan to 9999 yuan, with the standard model now featuring adaptive 120Hz ProMotion display [14][24]. - The A19 chip in the iPhone 17 offers a 20% performance improvement over the A18, with a 3nm process and enhanced AI capabilities [22][23]. - The camera system features a 48MP dual-camera setup and an upgraded 1800MP Center Stage front camera, enhancing photo and video capabilities [25][28]. - Battery life is extended, with the iPhone 17 capable of 30 hours of video playback and quick charging options [36]. iPhone 17 Air - The iPhone 17 Air is the thinnest iPhone yet, measuring 5.6mm and weighing 165g, featuring a 6.5-inch 120Hz display [39][44]. - It is powered by the A19 Pro chip, with a peak performance three times that of the A18 Pro, and includes advanced wireless connectivity with WiFi-7 and Bluetooth 6 support [46][49]. - The camera system mirrors that of the iPhone 17, and it exclusively uses eSIM technology [58]. iPhone 17 Pro/Pro Max - The Pro models feature enhanced materials for better heat dissipation and a more robust design, with the Pro Max offering up to 39 hours of video playback [71][75]. - The camera capabilities are significantly upgraded, with up to 8x optical zoom and support for ProRAW and ProRes video formats [81][84]. AirPods Pro 3 - The new AirPods Pro 3 feature double the active noise cancellation of the previous generation and are designed for fitness enthusiasts with heart rate monitoring capabilities [89][90]. - They also support real-time translation and have a battery life of 6-10 hours depending on the mode [98]. Apple Watch Series 11 - The Series 11 is the thinnest and most comfortable Apple Watch yet, starting at 2999 yuan, and now supports 5G connectivity [101][105]. - New health features include high blood pressure notifications and sleep quality scoring, with a battery life of 24 hours [110][120]. - The lightweight SE 3 model also supports 5G and includes new health monitoring features [122][128]. Conclusion - The article concludes with a reflection on the significance of these product launches and their potential impact on the market, inviting readers to share their thoughts on which product they find most appealing [135].
西湖大学最新!ARFM:结合VLA模仿学习与强化学习的优势
具身智能之心· 2025-09-11 02:07
Core Viewpoint - The article discusses the limitations of current visual-language-action (VLA) models in complex tasks and introduces the Adaptive Reinforcement Flow Matching (ARFM) method to enhance their performance by integrating reinforcement learning (RL) capabilities with flow matching advantages [1][2][4]. Summary by Sections Current Status of VLA Models - VLA models based on flow matching have shown excellent performance in general robotic manipulation tasks, validated by large-scale pre-trained systems like RT-1 and PaLM-E, but they struggle with action precision in complex downstream tasks due to reliance on imitation learning [4][5]. Existing Solutions and Limitations - Previous attempts to fine-tune VLA models using offline RL methods, such as ReinboT, have been limited in effectiveness due to the indirect guidance of action prediction, highlighting the need for more effective offline RL fine-tuning methods [4][5]. Main Contributions - The ARFM method is introduced as a novel offline RL post-training approach specifically designed for VLA flow models, addressing the challenges of data quality extraction and improving the efficiency of offline RL fine-tuning [6][7]. Methodological Innovation - ARFM incorporates an adaptive scaling factor in the loss function to balance the advantages of RL while controlling gradient variance, leading to improved generalization, robustness against disturbances, and few-shot learning capabilities [6][8]. Experimental Validation - Extensive experiments on the LIBERO simulation benchmark and the UR5 robotic arm platform demonstrate that ARFM outperforms existing methods in various aspects, including generalization ability, robustness to dynamic disturbances, and efficiency in few-shot learning [6][8][29]. Core Algorithm Design - The ARFM framework is built around energy-weighted loss to integrate RL signals and an adaptive mechanism to ensure training stability, effectively overcoming the limitations of traditional imitation learning and existing offline RL fine-tuning methods [8][11]. Experimental Setup - The experiments utilized the LIBERO benchmark platform, which includes four core task suites, and real-world scenarios with the UR5 robotic arm, focusing on various manipulation tasks under different conditions [29][30]. Key Experimental Results - ARFM demonstrated superior performance in multi-task learning, action perturbation robustness, few-shot learning efficiency, and continual learning capabilities compared to baseline models, confirming its practical value in real-world robotic applications [32][35][38]. Conclusion - The ARFM method effectively balances the retention of RL advantage signals and the control of flow loss gradient variance, leading to enhanced performance in VLA flow models across various tasks and conditions, showcasing its applicability in real-world scenarios [49][47].
上交发布U-Arm:突破成本壁垒,实现超低成本通用机械臂遥操作
具身智能之心· 2025-09-11 02:07
本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yanwen Zou等 编辑丨具身智能之心 研究背景与核心需求 在双机械臂策略学习中,大规模高质量的真实世界操作数据一直是瓶颈——相比仿真或纯人类数据,真实 机械臂数据对训练鲁棒政策的直接适用性最强。而当前获取这类数据的主要方式仍是 人类演示 ,这就需要 可靠的遥操作接口支撑。 现有演示接口主要分两类: 正是为解决"高兼容性"与"低成本"的矛盾,U-ARM应运而生:目标是打造一款开源、超低成本、易适配的 主从遥操作系统,让研究者能快速为各类商用机械臂搭建数据收集 pipeline。 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 现有方案的痛点与U-ARM的定位 为更清晰体现U-ARM的价值,可先对比现有主流遥操作设备的核心特性(如Table 1所示): | Device | Price (USD) | Motion Sickness Free | Easy Bimanual Operation | Lo ...
π0.5开源前,国内也开源了一个强大的端到端统一基础模型!具备强泛化和长程操作
具身智能之心· 2025-09-11 02:07
Core Viewpoint - The article discusses the release of π0.5 and WALL-OSS, highlighting their advancements in embodied intelligence and the significance of these models in the robotics industry, particularly in enhancing task execution in complex environments [1][3][5]. Group 1: Model Capabilities - π0.5 demonstrates enhanced generalization capabilities through heterogeneous task collaborative training, enabling robots to perform long-term, fine-grained operations in new household environments [3][5]. - WALL-OSS achieves embodied perception through large-scale multimodal pre-training, allowing seamless integration of instruction reasoning, sub-goal decomposition, and fine-grained action synthesis within a single differentiable framework [8][18]. - The model exhibits high success rates in complex long-term manipulation tasks, showcasing robust instruction-following abilities and understanding of complex scenarios, surpassing existing baseline models [8][18][28]. Group 2: Training and Data - The training process for WALL-OSS involves discrete, continuous, and joint phases, requiring only RTX 4090-level computational power for training and inference deployment [14][15]. - A multi-source dataset centered on embodied tasks was constructed, addressing the lack of large-scale, aligned VLA supervision and current visual language models' spatial understanding gaps [20][22]. - The dataset includes thousands of hours of data, focusing on both short-range operation tasks and long-range reasoning tasks, ensuring comprehensive training for the model [20][22][24]. Group 3: Experimental Analysis - Experimental analysis on embodied visual question answering and six robotic operation tasks focused on language instruction understanding, reasoning, and generalization, as well as planning and execution of long-term, multi-stage tasks [25][31]. - WALL-OSS significantly outperformed its original baseline model in object grounding, scene captioning, and action planning tasks, demonstrating its enhanced scene understanding capabilities [27][28]. - The model's ability to follow novel instructions without task-specific fine-tuning was validated, achieving 85% average task progress on known object instructions and 61% on novel object instructions [29][31]. Group 4: Industry Impact - The advancements in WALL-OSS and π0.5 are positioned to address existing limitations in visual language models and embodied understanding, paving the way for more capable and versatile robotic systems [5][8][20]. - The company, established in December 2023, focuses on developing a general embodied intelligence model using real-world data, aiming to create robots with fine operational capabilities [39]. - The recent completion of a nearly 1 billion yuan A+ round of financing indicates strong investor confidence in the company's direction and potential impact on the industry [39].
当老师给我指了VLA作为研究方向后......
具身智能之心· 2025-09-10 11:00
VLA科研背景与介绍 VLA,Vision-Language-Action模型,是具身智能领域的新范式,从给定的语言指令和视觉信号,直接生成出机 器人可执行的动作。这种范式打破了以往只能在单个任务上训练大的局限性,提供了机器人模型往更加通用,场 景更加泛化的方向发展。VLA模型在学术界和工业界的重要性主要体现在其将视觉信息、语言指令和行动决策 有效整合,显著提升了机器人对复杂环境的理解和适应能力。 VLA打破了传统方法的单任务局限,使得机器人能够在多样化的场景中自主决策,灵活应对未见过的环境,广 泛应用于制造业、物流和家庭服务等领域。此外,VLA模型已成为研究热点,推动了多个前沿项目的发展,如 pi0、RT-2、OpenVLA、QUAR-VLA和HumanVLA,这些研究促进了学术界与工业界的合作。其适应性体现在能 够应用于机械臂、四足机器人和人形机器人等多种平台,为各类智能机器人的发展提供了广泛的潜力和实际应用 价值,成为智能机器人领域的关键驱动力。 从产业角度看,国内外具身智能领域正处于蓬勃发展阶段,Unitree、智元、星海图、银河通用、逐际动力等团 队从实验室走向商业化,华为、京东、腾讯等科技巨头也积 ...
厦门大学曹刘娟团队FastVGGT:四倍速度提升,打破VGGT推理瓶颈并降低累积误差!
具身智能之心· 2025-09-10 06:18
Core Viewpoint - The article introduces FastVGGT, a training-free acceleration method that optimizes the VGGT model by addressing the redundancy in global attention mechanisms, achieving up to 4 times faster inference while maintaining reconstruction accuracy and mitigating cumulative error issues in 3D visual tasks [26]. Group 1: Main Contributions - FastVGGT enables VGGT to process 1000 input images in a single forward pass on a single GPU with 80GB VRAM, an improvement from 300 images previously [5]. - The method achieves a 4× speedup in inference time for 1000 image tasks while effectively reducing cumulative error [5][18]. - FastVGGT maintains high reconstruction quality, with improvements in metrics such as Chamfer Distance (CD) from 0.471 to 0.425 [18]. Group 2: Bottleneck Analysis - The analysis identifies that the global attention mechanism in VGGT has significant redundancy, leading to unnecessary computations [6][7]. - Cumulative error is exacerbated in long sequences due to the global attention mechanism, which amplifies minor errors over time [6]. Group 3: Methodology - Token merging strategies are introduced to optimize the redundancy in VGGT's attention calculations, including reference frame constraints, key token retention, and region-based sampling [9][11]. - The token merging process reduces the number of tokens involved in attention calculations, while token unmerging ensures the integrity of dense 3D reconstruction outputs [15]. Group 4: Experimental Results - FastVGGT demonstrated a significant reduction in inference time and improved reconstruction quality across various datasets, including ScanNet-50, 7Scenes, and NRGBD [22]. - In point cloud reconstruction tasks, FastVGGT achieved a 4× speedup in inference time while maintaining reconstruction accuracy [18][22]. - The method also showed improvements in absolute trajectory error (ATE) and relative pose error (RPE) metrics, indicating enhanced performance in long sequence inference [24].