具身智能之心

Search documents
具身智能之心人形机器人交流群成立啦~
具身智能之心· 2025-08-31 02:33
具身智能之心人形机器人交流群来啦!欢迎从事人形运控、VLA模型、数采、硬件等相关方向的同学 加入。 添加小助理微信AIDriver005,备注昵称+人形+加群。注意:有备注才能通过哦~ ...
直播分享!“具身数据困境”:仿真技术、真实数据与世界模型的碰撞交融
具身智能之心· 2025-08-29 16:03
Core Viewpoint - The article discusses the intersection of simulation technology, real data, and world models in the context of embodied intelligence, highlighting the ongoing debate about the importance of simulation versus real data and the potential breakthroughs in world modeling [3][11]. Group 1: Roundtable Discussion - The roundtable focuses on the "data dilemma" in embodied intelligence, featuring four young scientists who explore the boundaries between simulation and real interaction, as well as the technological advancements in world models like Genie [3][11]. - Sergey Levine's assertion that real data is irreplaceable is examined, questioning whether this is a strategic choice or an inevitable path in AI evolution [11]. Group 2: Key Participants - Li Hongyang, an assistant professor at the University of Hong Kong, leads the OpenDriveLab and has made significant contributions to end-to-end autonomous driving solutions, including the award-winning UniAD [4]. - Zhao Hao, an assistant professor at Tsinghua University, specializes in computer vision related to robotics and has co-founded over ten startups since 2009 [5]. - Gu Jiayuan, an assistant professor at ShanghaiTech University, focuses on generalizable robotic decision-making models and has received multiple awards for his research [6][7]. - Mu Yao, an assistant professor at Shanghai Jiao Tong University, has published extensively in top conferences and has received numerous academic honors [7].
HA-VLN:具备动态多人互动的视觉语言导航基准与排行榜
具身智能之心· 2025-08-29 16:03
作者丨 Yifei Dong等 点击下方 卡片 ,关注" 具身智能之心 "公众号 动机 :传统VLN系统大多忽视了人类动态和部分可观测性,而现实世界中的导航场景往往涉及动态的人类活动,如人群移动、个人空间需求等。因此,提出了人类 感知的视觉语言导航(HA-VLN)任务,要求智能体在遵循自然语言指令的同时,能够应对动态的人类活动,预测人类运动,尊重个人空间,并调整路径以避免碰 撞。 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 主要贡献 研究背景 人类感知的视觉语言导航任务 任务动机与概述 编辑丨视觉语言导航 HAPS 2.0数据集 动机 :现有的模拟器要么忽视人类行为,要么将人类建模为静态障碍。HA-VLN模拟器通过在离散和连续的3D环境中放置多个动态移动的人类,解决了社会意识导 航中的长期挑战。它具有高保真度的运动、多人互动和现实世界的复杂性,如群体聚会、自发运动和个人空间限制。 概述 :HA-VLN模拟器基于HAPS 2.0数据集,利用486个运动序列,涵盖了室内和室外活动。它提供了两个互补模块 ...
ReconVLA:基于重建式VLA模型的机器人感知方法
具身智能之心· 2025-08-29 16:03
Core Viewpoint - The article discusses the rapid development of Vision-Language-Action (VLA) models and introduces a new model called ReconVLA, which aims to enhance the precision of robotic actions by improving visual attention and focus on target objects [2][3][27]. Summary by Sections Introduction - Existing VLA models struggle with visual attention in complex scenes, leading to errors in object manipulation. Traditional methods to improve visual localization have not significantly enhanced attention distribution [6]. Model Overview - ReconVLA introduces a reconstructive approach to visual localization, where the model first reconstructs the gaze region before predicting actions. This implicit supervision forces the model to focus on the correct object, improving action precision [8][11][14]. Methodology - The framework consists of two branches: visual reconstruction and action prediction. The model uses a frozen visual tokenizer to encode the gaze region and employs a diffusion transformer for denoising and reconstruction [13][16]. - A large-scale dataset with over 100,000 trajectories and 2 million samples was created to pre-train the model, enhancing its visual generalization and implicit grounding capabilities [19]. Performance Results - In simulations, ReconVLA achieved a near 95% success rate in long-term tasks, outperforming existing methods. The model demonstrated strong transferability to unseen objects, maintaining over 40% success rates even with novel items [9][26]. - The model's performance in real-world tasks, such as stacking bowls and placing fruits, showed significant improvements over previous models, achieving up to 90% success in specific tasks [25]. Contributions - ReconVLA is the first model to utilize a gaze region reconstruction paradigm, significantly enhancing visual attention and action prediction accuracy. The extensive pre-training on diverse datasets has established a solid foundation for its performance in various tasks [14][27]. Conclusion - The study highlights the limitations of current VLA models in visual focus and presents ReconVLA as a solution that effectively directs attention to key objects, paving the way for more reliable multi-modal robotic control [27].
OpenHelix 团队新作!Long-VLA:深入探究端到端VLA模型的长时瓶颈和有效解决方案
具身智能之心· 2025-08-29 05:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 我们提出了Long-VLA,是首个专门针对长时任务设计的端到端 VLA 模型。其核心创新在于引入阶段感知的 输入掩码,将子任务划分为"移动阶段"和"交互阶 段",并在不同阶段动态调整视觉模态输入,使模型能够在移动时关注全局空间线索,在交互时聚焦局部精细感知。通过这种方式,Long-VLA 在保持统—架构 和端到端学习优势的 同时,有效解决了技能链问题。实验结果显示,无论在仿真环境还是真实机器人平台上,Long-VLA 都显著超越现有方法,确立了新的性能 基准,在机器人长时任务研究中具有突破意义。 标题:Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation 链接:https://arxiv.org/abs/ ...
四足机械狗+单臂,低成本开启你的具身学习之旅
具身智能之心· 2025-08-29 04:00
Core Viewpoint - Xdog is a low-cost, multifunctional quadruped robotic dog and robotic arm development platform designed for embodied developers, featuring a comprehensive curriculum for research and learning in robotics [1][2]. Group 1: Hardware Overview - Xdog integrates a robotic dog and robotic arm, with advanced functionalities such as voice control, sim2real, real2sim, target recognition and tracking, autonomous grasping, and reinforcement learning gait control [2][5]. - The robotic dog measures 25cm x 20cm x 30cm and weighs 7.0kg, with a maximum speed of 7.2 km/h and a maximum rotation speed of 450 degrees per second [3][11]. - The main control chip is Allwinner H616, featuring a quad-core 1.6GHz CPU, 4GB RAM, and 32GB storage [4][5]. Group 2: Technical Specifications - The robotic dog has a battery capacity of 93.24Wh, providing approximately 120 minutes of operational time and a standby time of about 6 hours [5][11]. - The robotic arm can reach a maximum height of 0.85m and has a grasping range of 0.4m around its base [7]. - The depth camera features active dual infrared and structured light technology, with a depth output resolution of 1280 × 800 @ 30 fps and a working distance of 0.2m - 10m [14]. Group 3: Software and Functionality - The system supports various control methods including voice control, keyboard control, visual control, and reinforcement learning for autonomous movement [15][17]. - Development is based on ROS1, with Python as the primary programming language, and it is recommended to use a GPU of at least 2080ti for inference [16][24]. - The platform allows for advanced functionalities such as collaborative control of the robotic arm and dog for target following, and autonomous grasping capabilities [19][20]. Group 4: Educational Curriculum - The curriculum includes hands-on training in ROS project creation, Mujoco simulation, and reinforcement learning principles, among other topics [22][23]. - Courses cover the setup and usage of the Xdog system, including network configuration, camera parameter adjustments, and advanced algorithms for object recognition and tracking [22][23]. - The teaching team consists of experienced instructors responsible for project management, technical support, and algorithm training [22]. Group 5: Delivery and Support - The delivery cycle is set to be completed within three weeks after payment, with a one-year warranty for after-sales service [25][26]. - The product includes hardware and accompanying courses, with no returns or exchanges allowed for non-quality issues [26].
具身智能之心人形机器人交流群成立啦~
具身智能之心· 2025-08-29 04:00
具身智能之心人形机器人交流群来啦!欢迎从事人形运控、VLA模型、数采、硬件等相关方向的同学 加入。 添加小助理微信AIDriver005,备注昵称+人形+加群。注意:有备注才能通过哦~ ...
Long-VLA:西湖大学与阿里达摩院联合打造,全球首个支持长周期操作的端到端VLA模型
具身智能之心· 2025-08-29 04:00
Core Viewpoint - Long-VLA is the first end-to-end VLA model specifically designed for long-horizon tasks in robot manipulation, addressing the skill chain problem by introducing phase-aware input masks to dynamically adjust visual modalities during different task phases [2][4][14]. Technical Introduction - Existing technologies for long-horizon tasks can be categorized into three types: end-to-end unified models, task decomposition methods, and input-adaptive modular methods, each with limitations in handling long and complex tasks [3][4]. - Long-VLA combines the advantages of task decomposition within a unified architecture and dynamically adjusts perception modalities through input-level masking, effectively addressing the skill chain issue [4][6]. Model Description - Long-VLA's core design includes three key components: task phase division, input-level adaptation strategy, and unified end-to-end training. Tasks are divided into "movement phases" and "interaction phases," with a newly annotated L-CALVIN dataset to support this division [6][8]. - The input adaptation strategy employs a binary masking mechanism to dynamically adjust attention inputs, enhancing task continuity and mitigating phase distribution differences [6][8]. Experimental Results - In the optimized CALVIN environment, Long-VLA significantly outperformed baseline models in long-horizon tasks, demonstrating stability across ten consecutive sub-tasks [8][10]. - In real-world scenarios involving sorting and cleaning tasks, Long-VLA showed superior performance under varying conditions, confirming its robustness and generalization capabilities [10][12]. - Long-VLA achieved an average task length improvement over baseline methods, with notable increases in performance metrics [13]. Conclusion - This research establishes a balance between end-to-end training and long-horizon adaptability, laying the groundwork for further exploration in robot long-horizon task execution [14].
今晚直播|星海图 X Hugging Face!开源生态如何引领具身智能的未来?
具身智能之心· 2025-08-29 00:05
Core Viewpoint - The article emphasizes the importance of open-source ecosystems in accelerating the development and implementation of embodied intelligence, highlighting collaborations among various industry players and developers [1]. Group 1 - The collaboration between Starry Sea Map and Hugging Face aims to foster a vibrant developer community and explore open-source models and datasets [1][2]. - A live discussion featuring Thomas Wolf, co-founder of Hugging Face, and Zhao Xing, chief scientist of Starry Sea Map, will take place to discuss the future of embodied intelligence and the open-source ecosystem [3][6]. - The live event is scheduled for August 29 at 19:00 [4][10].
传统SLAM的定位导航和具身目标导航有什么区别?
具身智能之心· 2025-08-29 00:03
目标驱动导航,赋予机器人自主完成导航目标 具身导航作为具身智能的核心领域,涉及语言理解、环境感知、路径规划三大技术支柱。目标驱动导航(Goal-Oriented Navigation)通过赋予机器人自主决策能 力,是具身导航中最具代表性的方向。 目标驱动导航要求智能体在陌生的三维环境中,仅凭目标描述(如坐标、图片、自然语言)等,即可自主完成环境探索与 路径规划。 与传统视觉语言导航(VLN)依赖显式指令不同,目标驱动导航系统需要实现从"听懂指令走对路"到"看懂世界自己找路"的跃迁:当人类下达"去厨房拿可乐"的指 令时,机器人需自主完成语义解析(识别厨房空间特征与可乐视觉属性)、环境建模(构建家居场景的空间拓扑)以及动态决策(避开移动的人类或宠物),这 背后凝聚着计算机视觉、强化学习与3D语义理解的交叉突破。 目标驱动导航技术已在多个垂直领域实现产业化落地。在终端配送场景中,该技术与社交导航算法结合,使机器人具备应对动态环境和人际交互的能力:美团无 人配送车通过动态路径重规划在复杂城市环境中执行递送任务,Starship Technologies的园区配送机器人已在欧美高校和社区部署。在医疗、酒店及餐饮场景,嘉 ...