具身智能之心 - filings, earnings calls, financial reports, news

具身智能之心

Search documents

具身智能之心· 2025-09-11 02:07

Research Background and Core Needs - The bottleneck in dual robotic arm strategy learning is the lack of large-scale, high-quality real-world operational data, which is more applicable for training robust policies compared to simulation or purely human data. Currently, the main method for obtaining such data is through human demonstrations, necessitating a reliable teleoperation interface [4]. Existing Demonstration Interfaces - Existing demonstration interfaces are categorized into two types. U-ARM aims to resolve the conflict between "high compatibility" and "low cost" by creating an open-source, ultra-low-cost, and easily adaptable master-slave teleoperation system, enabling researchers to quickly set up data collection pipelines for various commercial robotic arms [5]. Pain Points of Existing Solutions and U-ARM's Positioning - Current mainstream teleoperation devices face issues such as kinematic singularities, workspace limitations, and insufficient precision, requiring complex post-processing. High-cost solutions like ALOHA and GELLO are effective but prohibitively expensive, with ALOHA costing over $50,000 and GELLO at $270. U-ARM fills the gap between ultra-low cost and high compatibility, with a single-arm cost of $50.5 for 6DoF and $56.8 for 7DoF, while ensuring usability without motion sickness and easy bimanual operation [6][9]. U-ARM System Design - U-ARM's hardware design is based on standardized configurations that most commercial 6/7 degree-of-freedom robotic arms follow. It offers three configurations to adapt to different commercial robotic arms, ensuring compatibility and cost optimization [10][14]. Mechanical Structure and Motor Modification - U-ARM's components are made from PLA 3D printing with a minimum wall thickness of 4mm for durability. The design addresses common issues in low-cost 3D printed arms by using a dual-axis fixation for joints to alleviate high radial loads. The motor modification involves removing the internal gearbox to reduce resistance, allowing for smoother movement while maintaining stability [13][16]. Algorithm Design - The algorithm ensures smooth motion and adaptability by calibrating the encoder and implementing filtering and interpolation techniques to avoid jitter during operation. This is crucial for maintaining the accuracy of the teleoperation system [16][17]. Experimental Validation and Results Analysis - The experiments included "simulation adaptation" and "real-world comparison" to validate U-ARM's adaptability and efficiency advantages. U-ARM was tested against Joycon in performing typical desktop tasks, showing a significant efficiency improvement of 39% in task completion time [18][24]. Efficiency and Success Rate - U-ARM achieved an average task completion time of 17.7 seconds with a success rate of 75.8%, while Joycon had an average time of 29.04 seconds and a success rate of 83%. The efficiency gain is attributed to U-ARM's design, which allows for more natural and rapid movements, despite a slight trade-off in precision during fine operations [24].

π0.5开源前，国内也开源了一个强大的端到端统一基础模型！具备强泛化和长程操作

具身智能之心· 2025-09-11 02:07

Core Viewpoint - The article discusses the release of π0.5 and WALL-OSS, highlighting their advancements in embodied intelligence and the significance of these models in the robotics industry, particularly in enhancing task execution in complex environments [1][3][5]. Group 1: Model Capabilities - π0.5 demonstrates enhanced generalization capabilities through heterogeneous task collaborative training, enabling robots to perform long-term, fine-grained operations in new household environments [3][5]. - WALL-OSS achieves embodied perception through large-scale multimodal pre-training, allowing seamless integration of instruction reasoning, sub-goal decomposition, and fine-grained action synthesis within a single differentiable framework [8][18]. - The model exhibits high success rates in complex long-term manipulation tasks, showcasing robust instruction-following abilities and understanding of complex scenarios, surpassing existing baseline models [8][18][28]. Group 2: Training and Data - The training process for WALL-OSS involves discrete, continuous, and joint phases, requiring only RTX 4090-level computational power for training and inference deployment [14][15]. - A multi-source dataset centered on embodied tasks was constructed, addressing the lack of large-scale, aligned VLA supervision and current visual language models' spatial understanding gaps [20][22]. - The dataset includes thousands of hours of data, focusing on both short-range operation tasks and long-range reasoning tasks, ensuring comprehensive training for the model [20][22][24]. Group 3: Experimental Analysis - Experimental analysis on embodied visual question answering and six robotic operation tasks focused on language instruction understanding, reasoning, and generalization, as well as planning and execution of long-term, multi-stage tasks [25][31]. - WALL-OSS significantly outperformed its original baseline model in object grounding, scene captioning, and action planning tasks, demonstrating its enhanced scene understanding capabilities [27][28]. - The model's ability to follow novel instructions without task-specific fine-tuning was validated, achieving 85% average task progress on known object instructions and 61% on novel object instructions [29][31]. Group 4: Industry Impact - The advancements in WALL-OSS and π0.5 are positioned to address existing limitations in visual language models and embodied understanding, paving the way for more capable and versatile robotic systems [5][8][20]. - The company, established in December 2023, focuses on developing a general embodied intelligence model using real-world data, aiming to create robots with fine operational capabilities [39]. - The recent completion of a nearly 1 billion yuan A+ round of financing indicates strong investor confidence in the company's direction and potential impact on the industry [39].

当老师给我指了VLA作为研究方向后......

具身智能之心· 2025-09-10 11:00

Group 1 - VLA (Vision-Language-Action) model represents a new paradigm in embodied intelligence, enabling robots to generate executable actions from language instructions and visual signals, thus enhancing their understanding and adaptability in complex environments [1][3] - The VLA model breaks the limitations of traditional single-task training, allowing robots to make autonomous decisions in diverse scenarios, which is applicable in manufacturing, logistics, and home services [3][5] - The VLA model has become a research hotspot, driving the development of several cutting-edge projects such as pi0, RT-2, OpenVLA, QUAR-VLA, and HumanVLA, fostering collaboration between academia and industry [3][5] Group 2 - The embodied intelligence sector is experiencing rapid growth, with teams like Unitree, Zhiyuan, Xinghaitu, and Yinhai General transitioning from laboratories to commercialization, while tech giants like Huawei, JD.com, and Tencent are actively investing in this field [5] - The course on VLA research aims to equip students with comprehensive skills in academic research, including theoretical foundations, experimental design, and paper writing, focusing on independent research capabilities [13][15] - The curriculum emphasizes identifying research opportunities and innovative points, guiding students to develop their research ideas and complete preliminary experiments [14][15] Group 3 - The course covers the technical evolution of the VLA paradigm, from early grasp pose detection to recent advancements like Diffusion Policy and multimodal foundational models, focusing on end-to-end mapping from visual input and language instructions to robotic actions [8][9] - Core challenges in embodied intelligence, such as cross-domain generalization and long-term planning, are analyzed, along with strategies to combine large language model reasoning with robotic control systems [9] - The course aims to help students master the latest research methods and technical frameworks in embodied intelligence, addressing limitations and advancing towards true general robotic intelligence [9][15]

具身智能

Vision-Language-Action (VLA)模型

Vision-Language-Action (VLA)模型

厦门大学曹刘娟团队FastVGGT：四倍速度提升，打破VGGT推理瓶颈并降低累积误差！

具身智能之心· 2025-09-10 06:18

Core Viewpoint - The article introduces FastVGGT, a training-free acceleration method that optimizes the VGGT model by addressing the redundancy in global attention mechanisms, achieving up to 4 times faster inference while maintaining reconstruction accuracy and mitigating cumulative error issues in 3D visual tasks [26]. Group 1: Main Contributions - FastVGGT enables VGGT to process 1000 input images in a single forward pass on a single GPU with 80GB VRAM, an improvement from 300 images previously [5]. - The method achieves a 4× speedup in inference time for 1000 image tasks while effectively reducing cumulative error [5][18]. - FastVGGT maintains high reconstruction quality, with improvements in metrics such as Chamfer Distance (CD) from 0.471 to 0.425 [18]. Group 2: Bottleneck Analysis - The analysis identifies that the global attention mechanism in VGGT has significant redundancy, leading to unnecessary computations [6][7]. - Cumulative error is exacerbated in long sequences due to the global attention mechanism, which amplifies minor errors over time [6]. Group 3: Methodology - Token merging strategies are introduced to optimize the redundancy in VGGT's attention calculations, including reference frame constraints, key token retention, and region-based sampling [9][11]. - The token merging process reduces the number of tokens involved in attention calculations, while token unmerging ensures the integrity of dense 3D reconstruction outputs [15]. Group 4: Experimental Results - FastVGGT demonstrated a significant reduction in inference time and improved reconstruction quality across various datasets, including ScanNet-50, 7Scenes, and NRGBD [22]. - In point cloud reconstruction tasks, FastVGGT achieved a 4× speedup in inference time while maintaining reconstruction accuracy [18][22]. - The method also showed improvements in absolute trajectory error (ATE) and relative pose error (RPE) metrics, indicating enhanced performance in long sequence inference [24].

token merging

Token Unmerging

Artificial Intelligence

Artificial Intelligence

FastVGGT

VGGT

上海交大发布U-Arm：突破成本壁垒，实现超低成本通用机械臂遥操作系统

具身智能之心· 2025-09-10 03:31

点击下方卡片，关注" 具身智能之心 "公众号作者丨 Yanwen Zou等编辑丨具身智能之心本文只做学术分享，如有侵权，联系删文 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。研究背景与核心需求在双机械臂策略学习中，大规模高质量的真实世界操作数据一直是瓶颈——相比仿真或纯人类数据，真实机械臂数据对训练鲁棒政策的直接适用性最强。而当前获取这类数据的主要方式仍是人类演示，这就需要可靠的遥操作接口支撑。现有演示接口主要分两类：正是为解决"高兼容性"与"低成本"的矛盾，U-ARM应运而生：目标是打造一款开源、超低成本、易适配的主从遥操作系统，让研究者能快速为各类商用机械臂搭建数据收集 pipeline。现有方案的痛点与U-ARM的定位为更清晰体现U-ARM的价值，可先对比现有主流遥操作设备的核心特性（如Table 1所示）：末端执行器轨迹记录设备（如DexCap、UMI、OpenTelevision）：虽轻便易用，但收集的数据常出现运动学奇点、超出机械臂工作空间、精度不 ...

大赛报名中|2025无锡国际人工智能创新应用大赛，66万奖金聚焦具身智能赛道

具身智能之心· 2025-09-10 00:03

赛事官网：https://cvmart.net/cv_landing/list/wuxi2025 扫码查看详情和报名点击下方卡片，关注" 具身智能之心 "公众号 2025无锡国际人工智能创新应用大赛于8月25日正式开赛，大赛面向全球开放算法赛道和具身智能创新应用赛道双赛道，召唤广大算法开发者、创新团队、科研院所和企业共同参与这场具身AI大赛，实现人工智能技术创新与应用。本次大赛聚焦具身智能领域，算法赛道参赛者将使用极市平台和DISCOVERSE具身仿真平台进行算法开发角逐。具身智能创新应用赛道面向具身智能创新应用企业、具身生态链企业、智能终端企业、具备创新思维的创业团队、科研院所团队和个人，基于具身智能进行创新和应用开发，提出并实现具有创新性和实用价值的解决方案。现在报名，与全球顶尖人才和具身智能生态企业同台竞技，共同开创具身智能的崭新时代！赛题概况机器人原料识别赛题：考验算法的视觉识别和仿真环境中的操作效率；积木拼装挑战赛题：考验精准控制和空间规划算法，决赛更设置机械臂真机比赛环节，让虚拟算法在现实世界中接受检验。【算法赛道】机器人原料识别赛题：机器人原料识别算法致力于 ...

3个月，为大家梳理清了整个具身技术路线......

具身智能之心· 2025-09-10 00:03

在通往通用人工智能（AGI）的探索中，具身智能逐渐成为关键方向之一。相比于传统的预设动作序列不同，具身智能强调智能体与物理环境的交互与适应，聚焦于如何让智能体具备在物理世界中感知环境、理解任务、执行动作并反馈学习的能力。而具身智能领域最重要的两个部分：大脑和小脑构成了具身机器人最重要的模块，如果类比于人，大脑负责思考感知（主导语义理解和任务规划），小脑负责执行（高精度的运动执行）。国内外相关领域产业分析近2年，许多具身明星团队陆续出来创业，成立了多家非常有价值的公司。星海图、银河通用、逐际动力等团队陆续从实验室走向商业和工业界，推动具身本体和大小脑技术的不断进步。国外方面，Tesla/Figure AI在工业与物流机器人应用上持续推进，而美国投资机构也积极支持 Wayve、 Apptronik 等公司落地自动驾驶与仓储机器人。总体而言，国内企业以产业链投资与综合平台驱动具身智能落地，国外科技巨头则侧重基础模型、模拟环境与类人机器人原型研发，双方在该领域正加速进入关键竞赛阶段。具身智能的技术演进国内传统大厂，华为于2024年底启动"全球具身智能产业创新中心"，与乐聚机器人、大族机器人等企 ...

光刻机巨头ASML，108亿控股了一家大模型公司

具身智能之心· 2025-09-10 00:03

编辑丨量子位点击下方卡片，关注" 具身智能之心 "公众号 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。光刻机巨头 ASML ，也来投大模型了。就在刚刚，荷兰半导体设备巨头ASML正式成为法国AI明星公司 Mistral AI 的第一大股东，一口气砸下13亿欧元（约108亿元人民币）真金白银。这次ASML领投Mistral AI 的C轮融资总额17亿欧元（约142亿元人民币），直接把这家成立时长两年半的公司估值推高到100亿欧元（约835 亿元人民币），一举成为欧洲最值钱的AI公司。更有意思的是，ASML不仅要掏钱，还要求进董事会占一个席位。全球唯一能造EUV光刻机的半导体巨头，正式和"欧洲的OpenAI"深度绑定了。 Mistral的开挂之路根据知情人士透露，这笔交易的谈判过程相当低调，双方都签了保密协议。美国银行作为ASML的财务顾问，在整个过程中发挥了重要作用。就在几周前，彭博社报道就透露Mistral AI的估值可能达到140亿美元（约119亿欧元或1000亿元 ...

CoRL 2025 | SafeBimanual: 基于扩散的安全双手操作轨迹优化

具身智能之心· 2025-09-10 00:03

点击下方卡片，关注" 具身智能之心 "公众号作者丨 Haoyuan Deng等编辑丨具身智能之心本文只做学术分享，如有侵权，联系删文 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。 1.前言双手操作（Bimanual Manipulation）是机器人在家庭服务、制造业以及医疗等场景中不可或缺的能力。相比单臂操作，双臂机器人能够通过协调配合完成更复杂的任务，例如烹饪、组装和物品搬运。 2.简介目前的扩散式策略生成方法虽然能够在高维动作空间中实现稳定的动作建模与生成，但其核心问题是缺乏安全性意识。现有方法往往只关注如何高效完成任务，却没有在轨迹生成过程中引入明确的物理安全约束，从而导致一系列危险行为： Figure1 双手操纵不安全模式分类忽视物理约束：现有扩散策略直接从去噪分布中采样动作，缺乏对双臂空间位置与动态关系的约束，容易导致机械臂轨迹交叉、两臂动作错位、末端执行器不一致等情况。存在危险交互：常见的风险模式包括双臂夹具互相碰撞、在操作刚性物体时发生撕裂、夹具对物体 ...

π0.5开源了！！！

具身智能之心· 2025-09-09 06:45

点击下方卡片，关注" 具身智能之心 "公众号编辑丨具身智能之心本文只做学术分享，如有侵权，联系删文 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。 π0.5开源了！！！ π0.5模型是π0的升级版本，通过知识隔离（ knowledge insulation）训练获得更强的开放世界泛化能力！今天看到了项目主页上更新了0.5的信息。项目链接：https://github.com/Physical-Intelligence/openpi | πο-ALOHA- | Inference | πο model fine-tuned on public | gs://openpi- | | --- | --- | --- | --- | | pen-uncap | | ALOHA data: can uncap a pen | assets/checkpoints/pi0_aloha_pen_uncap | | | | που το 19 model fine-tuned for the | | | ...

Previous Next