具身智能之心

Search documents
VLA/强化学习/VLN方向的论文辅导招募!
具身智能之心· 2025-08-14 12:00
辅导老师:积极活跃在具身学术领域,有idea。 感兴趣的同学可以添加微信oooops-life咨询,或者直接扫码,备注具身论文辅导咨询。 具身智能之心1v1论文辅导来啦!现在有3个vla、强化学习、sim2real方向的名额,主要面向A会和B会。 主要会议:CVPR、ICCV、ECCV、ICLR、CoRL、ICML、ICRA等; ...
VLA/VLA+触觉/VLA+RL/具身世界模型等!国内首个具身大脑+小脑算法实战教程
具身智能之心· 2025-08-14 06:00
在通往通用人工智能(AGI)的探索中,具身智能逐渐成为关键方向之一。相比于传统的预设动作序列不同,具身智能 强调智能体与物理环境的交互与适应,聚焦于如何让智能体具备在物理世界中感知环境、理解任务、执行动作并反馈学 习的能力。 而具身智能领域最重要的两个部分:大脑和小脑构成了具身机器人最重要的模块,如果类比于人,大脑负责思考感知 (主导语义理解和任务规划),小脑负责执行(高精度的运动执行)。 国内外相关领域产业分析 近2年,许多具身明星团队陆续出来创业,成立了多家非常有价值的公司。星海图、银河通用、逐际动力等团队陆续从 实验室走向商业和工业界,推动具身本体和大小脑技术的不断进步。 国内传统大厂,华为于2024年底启动"全球具身智能产业创新中心",与乐聚机器人、大族机器人等企业合作,共同建设 具身智能大脑、小脑等关键技术;京东自2025年5月以来连续投资智元机器人、千寻智能、逐际动力等多家公司,以强 化其在物流科技与家庭服务场景中的效率与服务能力。此外,腾讯、蚂蚁集团、小米等科技巨头也积极通过战略投资与 合作布局,加快构建具身智能产业生态。 国外方面,Tesla/Figure AI在工业与物流机器人应用上持续推进 ...
学会see和act:机器人操作中的任务感知视角规划
具身智能之心· 2025-08-14 00:03
Research Background and Motivation - Existing visual-language-action (VLA) models in multi-task robotic operations rely on fixed viewpoints and shared visual encoders, limiting 3D perception and causing task interference, which affects robustness and generalization [2][3] - Fixed viewpoints are particularly problematic in complex scenes, where occlusion can lead to incomplete scene understanding and inaccurate action predictions [2] - The limitations of shared encoders are evident in tasks with significant visual and semantic differences, restricting model generalization and scalability [2] Core Method: TAVP Framework - The Task-Aware View Planning (TAVP) framework integrates active view planning with task-specific representation learning, featuring the TaskMoE module and MVEP strategy [3] TaskMoE: Task-Aware Mixture of Experts Module - Designed to enhance multi-task accuracy and generalization through two key innovations [5] MVEP: Multi-View Exploration Policy - Aims to select K viewpoints that maximize the capture of operation target-related information, improving action prediction accuracy [6] Training Strategy - The training process consists of three phases: 1. Phase 1: Train TAVP's fixed viewpoint variant using three default viewpoints [7] 2. Phase 2: Optimize MVEP based on the fixed viewpoint model using the PPO algorithm [8] 3. Phase 3: Fine-tune the entire TAVP model excluding MVEP, using the same loss functions as in Phase 1 [8] Key Results - TAVP outperforms fixed viewpoint dense models (RVT2, ARP, ARP+) in success rates across all tasks, with a 56% increase in challenging tasks and an average success rate improvement from 84.9% to 86.7% [13][14] Ablation Study - Removing TaskMoE results in a decrease in average success rate from 86.67% to 85.56%, highlighting its importance in multi-task representation learning [15][18] Sensitivity Analysis - Increasing the number of viewpoints (K) significantly improves success rates, especially in occlusion-prone tasks [16][17] Efficiency and Generalization Analysis - TAVP achieves a higher average success rate (86.67%) compared to ARP+ (84.90%), with a slight increase in inference delay of approximately 10.7% [20]
英伟达为机器人推出懂推理的“大脑”!升级版Cosmos世界模型来了
具身智能之心· 2025-08-14 00:03
Core Viewpoint - Nvidia is significantly advancing its robotics development infrastructure, focusing on the integration of AI and computer graphics to enhance robotic capabilities and reduce training costs [17][20][21]. Group 1: Product and Technology Updates - Nvidia introduced the upgraded Cosmos world model at the SIGGRAPH conference, which is designed to generate synthetic data that adheres to real-world physics [2][3]. - The upgrade emphasizes planning capabilities and generation speed, with enhancements across software and hardware, including the new Omniverse library and RTX PRO Blackwell servers [4][8]. - The new Cosmos Reason model features 70 billion parameters and reasoning capabilities, aiding robots in task planning [6][10]. - Cosmos Transfer-2 and its lightweight version accelerate the conversion of virtual scenes into training data, significantly reducing the time required for this process [12][13]. Group 2: Integration of AI and Graphics - Nvidia's AI research vice president highlighted the powerful synergy between simulation capabilities and AI system development, which is rare in the industry [5]. - The combination of Cosmos and Omniverse aims to create a realistic and scalable "virtual parallel universe" for robots to safely experiment and evolve [22][23]. - The integration of real-time rendering, computer vision, and physical simulation is essential for building this virtual environment [23]. Group 3: Market Strategy and Collaborations - Nvidia is strategically positioning itself in the robotics sector, recognizing the trend of merging computer graphics with AI as a transformative force in the industry [20][21]. - The company is collaborating with various Chinese firms, including Alibaba Cloud and several robotics companies, to expand its influence in the domestic market [26][27]. - Nvidia's approach mirrors its previous strategies, where it provided computational resources to emerging AI companies, indicating a similar trajectory in the robotics field [25][26].
想做具身方向,师兄建议我去这里......
具身智能之心· 2025-08-14 00:03
Core Insights - The article emphasizes the value of a responsive community that addresses members' needs and provides support for technical and job-seeking challenges in the field of embodied intelligence [1][3][17]. Group 1: Community and Support - The community has successfully created a closed loop across various domains including industry, academia, job seeking, and Q&A exchanges, facilitating timely solutions to problems faced by members [3][17]. - Members have received job offers from leading companies in the embodied intelligence sector, showcasing the community's effectiveness in supporting career advancement [1][3]. - The community offers a platform for sharing specific challenges and solutions, such as data collection and model deployment, enhancing practical application in projects [1][3]. Group 2: Educational Resources - The community has compiled over 30 technical routes for newcomers, significantly reducing the time needed for research and learning [4][17]. - It provides access to numerous open-source projects, datasets, and mainstream simulation platforms relevant to embodied intelligence, aiding both beginners and advanced practitioners [17][20]. - Members can engage in roundtable discussions and live sessions with industry experts, gaining insights into the latest developments and challenges in the field [4][20]. Group 3: Job Opportunities and Networking - The community has established a job referral mechanism with multiple leading companies, ensuring members receive timely job recommendations [11][20]. - Members are encouraged to connect with peers and industry leaders, fostering a collaborative environment for knowledge sharing and professional growth [20][45]. - The community actively supports members in preparing for job applications and interviews, enhancing their employability in the competitive job market [20][45].
保持精度,提升速度!Spec-VLA:首个专为VLA推理加速设计的推测解码框架
具身智能之心· 2025-08-14 00:03
Core Viewpoint - The article discusses the introduction of the Spec-VLA framework, which utilizes speculative decoding to accelerate the inference process of Vision-Language-Action (VLA) models, achieving significant speed improvements without the need for fine-tuning the VLA validation model [2][6]. Group 1: Spec-VLA Framework - Spec-VLA is the first speculative decoding framework specifically designed for accelerating VLA inference [2]. - The framework demonstrates a 42% acceleration compared to the OpenVLA baseline model, achieved by training only the draft model [6]. - The proposed mechanism enhances the acceptance length by 44% while maintaining the task success rate [2]. Group 2: Technical Details - The article highlights the challenges posed by the large parameter scale and autoregressive decoding characteristics of Vision-Language Models (VLMs) [2]. - Speculative decoding (SD) allows large language models (LLMs) to generate multiple tokens in a single forward pass, effectively speeding up inference [2]. - The framework employs a relaxed acceptance mechanism based on the relative distances represented by action tokens in VLA models [2]. Group 3: Live Broadcast Insights - The live broadcast covers key topics such as speculative decoding as an acceleration method for large language models, an introduction to VLA models, and detailed implementation aspects of the Spec-VLA framework [7].
端到端模型!GraphCoT-VLA:面向模糊指令的操作任务的VLA模型
具身智能之心· 2025-08-13 00:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Helong Huang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 写在前面&出发点 VLA模型已成为机器人操作领域的关键范式。然而,现有VLA模型在处理模糊语言指令和未知环境状态时存在明显局限性。此外,它们的感知能力很大程度上局 限于静态二维观测,缺乏对机器人与环境之间三维交互的建模能力。为解决这些挑战,这里提出了GraphCoT-VLA,一种高效的端到端模型。为增强模型对模糊 指令的理解能力并改进任务规划,我们设计了结构化的思维链(Chain-of-Thought)推理模块,该模块整合了高层任务理解与规划、失败任务反馈以及对未来物 体位置和机器人动作的低层想象推理。此外,我们构建了可实时更新的3D姿态-物体图,用于捕捉机器人关节的空间配置和物体在三维空间中的拓扑关系,使模 型能更好地理解和处理它们之间的交互。我们进一步集成了dropout混合推理策略,以实现高效的控制输出。在 ...
近2000人了!这个具身社区偷偷做了这么多事情了......
具身智能之心· 2025-08-13 00:04
能让学习变得有趣,一定是件了不起的事情。能推动行业发展,就更伟大了!1个月前,在和朋友聊天的时候 说过,我们的愿景是让AI与具身智能教育走进每个有需要的同学。 具身智能之心知识星球,截止到目前已经完成了产业、学术、求职、问答交流等多个领域的闭环。几个运营的 小伙伴每天都在复盘,什么样的社区才是大家需要的?花拳绣腿的不行、华而不实的不行、没人交流的也不 行、找不到工作的更不行。 于是我们就给大家准备了学术领域最前沿的内容、大佬级别圆桌、开源的代码方案、最及时的求职信息...... 星球内部为大家梳理了近30+技术路线,无论你是要找benchmark、还是要找综述和学习入门路线,都能极大 缩短检索时间。星球还为大家邀请了数十个具身领域嘉宾,都是活跃在一线产业界和工业界的大佬(经常出现 的顶会和各类访谈中哦)。欢迎随时提问,他们将会为大家答疑解惑。 除此之外,还为大家准备了很多圆桌论坛、直播,从本体、数据到算法,各类各样,逐步为大家分享具身行业 究竟在发生什么?还有哪些问题! 星球还和多家具身公司建立了岗位内推机制,欢迎大家随时艾特我们。第一时间将您的简历送到心仪公司的手 上。 针对入门者,我们整理了许多为小白入门 ...
VLA还是VTLA?这家企业用“超人类触觉”技术颠覆机器人未来!
具身智能之心· 2025-08-13 00:04
虽然触觉传感器如此重要,但还有很多问题没有解决,比如分辨率不高,实时性保证不了、买过来没多久 就坏了、质量不行等。然而,我们发现现场有一家触觉传感器硬件公司同时在分辨率、实时性、耐用性与 成本平衡方面取得了最优,这家公司就是 " 戴盟机器人 "。 这几天去WRC25逛了一圈,看到了各家具身机器人公司的产品和功能。说实话相比于去年,硬件和技术上 真的是有较大提升。还看到了多家没去WAIC25现场的公司,总体结论是现阶段的本体已经基本能够满足一 些场景的需求,反而是感知大脑,有点落后于硬件。 现场看到了很多相关的技术,特别是VLA模型。VLA作为新一代端到端视觉语言动作模型,是各家公司与 研究机构重点关注的。不过在展示过程中我们也发现了一个明显的问题,视觉虽然能提供丰富的环境信 息,在涉及物理交互(如抓取、操作物体)时,无法精确感知物体的材质、硬度、摩擦力等属性。特别是 在工业装配、医疗手术、家庭服务等场景中,机器人需要执行高精度任务,如果不小心用力过度将会产生 不良的后果。 就在近几天,戴盟机器人也(Daimon Robotics)宣布完成亿元级天使++轮融资,由招商局创投领投,东方 嘉富、架桥资本跟投。本轮融 ...
AI如何一步步「看懂」时空结构?一篇综述解析通往四维世界的五大层次
具身智能之心· 2025-08-13 00:04
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 4D 空间智能重建 是计算机视觉领域的核心挑战,其目标在于从视觉数据中还原三维空间的动态演化过程。这一技术通过整合静态场景结构与时空动态变化,构建 出具有时间维度的空间表征系统,在虚拟现实、数字孪生和智能交互等领域展现出关键价值。 当前研究主要围绕两大技术维度展开: 基础重建层面聚焦深度估计、相机定位、动态点云等底层视觉要素的精准提取;高阶理解层面则致力于解析场景组件的时 空关联与物理约束。 arXiv: https://arxiv.org/abs/2507.21045 Project Page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence 他们提出了一种新的分析视角,将已有方法按照空间智能的建构深度划分为五个递进的层次: 这种多维度的空间建模能力正成为新一代人工智能发展的基础设施——无论是构建具身智能的环境认 ...