Workflow
具身智能之心
icon
Search documents
直击CVPR现场:中国玩家展商面前人从众,腾讯40+篇接收论文亮眼
具身智能之心· 2025-06-18 10:41
Core Insights - The article highlights the significant participation of Chinese companies in CVPR 2025, showcasing their technological advancements and commitment to AI development [4][9][46] - Key trends identified include a focus on multimodal and 3D generation technologies, with Gaussian Splatting emerging as a prominent technique [8][15][17] Group 1: Event Overview - CVPR 2025 has gained increased attention and social engagement, with a record number of Chinese enterprises participating [2][4] - The conference is recognized as a leading event in the field of computer vision, with the acceptance of papers indicating cutting-edge technological trends [12][13] Group 2: Research Trends - Multimodal and 3D generation are highlighted as popular research directions, with Gaussian Splatting being a frequently mentioned keyword in accepted papers [8][15][17] - A total of 2878 papers were analyzed, revealing high-frequency terms such as "Multimodal" (75 occurrences) and "Diffusion Model" (153 occurrences) [16] Group 3: Chinese Companies' Participation - Chinese companies, particularly Tencent, have shown deep involvement, with Tencent alone having over 40 accepted papers across various research areas [33][34] - The participation of Chinese firms in sponsorship and workshops indicates their commitment to the conference and the broader AI landscape [36][38] Group 4: Technological Advancements - Tencent's investment in AI research is substantial, with R&D spending exceeding 70.686 billion RMB in 2024, reflecting a strong commitment to technological innovation [46] - The company has also made significant strides in patent applications, with over 85,000 applications filed globally [46] Group 5: Talent Attraction - The presence of Chinese companies at top conferences serves to attract talent, emphasizing the importance of technical recognition over salary for top-tier professionals [47] - Tencent's diverse application scenarios, including WeChat and gaming, provide a robust ecosystem that supports ongoing technological development [49][50]
工业界和学术界在具身智能数据采集上有哪些方案?
具身智能之心· 2025-06-18 10:41
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 具身智能的数据采集方案有哪些? 具身领域,机器人运动控制大部分都是用 RL 进行训练,而机械臂操作任务一般使用模仿学习方 式。其中数据采集部分则成为了核心,直接决定了后期模型的性能。今天我们一起看看有哪些数据 采集方式,以及优缺点。所有内容出自具身智能之心知识星球,欢迎和近200家具身公司和机构一起 交流。 3)合成数据 不依赖本体,采集成本低,但相关前处理较为麻烦。需要搭建和真实场景类似的仿真环境,需要处 理sim2real和real2sim问题。 4)互联网数据 采集方式 1)遥操采集 遥操采集依赖本体,成本较高。但前处理和后处理较为简单,质量也最高。 2)开放场景采集 不依赖于本体,需要一定的前处理后处理。采集成本低,不受限于机械臂可达的环境,一次采集后 续可以映射到很多本体上。但采集数据和真实部署存在一定gap,传感器信息可能不全。 互联 ...
ForceVLA:通过力感知MoE增强接触丰富操作的VLA模型
具身智能之心· 2025-06-18 10:41
Research Background and Problem Statement - The article discusses the limitations of existing Vision-Language-Action (VLA) models in robot manipulation, particularly in tasks requiring fine control of force under visual occlusion or dynamic uncertainty. These models heavily rely on visual and language cues while neglecting force sensing, which is crucial for precise physical interactions [4]. Core Innovations - **ForceVLA Framework**: A novel end-to-end operational framework that incorporates external force sensing as a primary modality within the VLA system. It introduces the Force-aware Mixture of Experts (FVLMoE) module, which dynamically integrates pre-trained visual-language embeddings with real-time 6-axis force feedback during action decoding, enabling robots to adapt to subtle contact dynamics [6][8]. - **FVLMoE Module**: This module allows for context-aware routing across modality-specific experts, enhancing the physical interaction capabilities of VLA systems by dynamically processing and integrating force, visual, and language features [7][8]. - **ForceVLA-Data Dataset**: A new dataset created to support the training and evaluation of force-aware operational strategies, containing synchronized visual, proprioceptive, and force-torque signals across five contact-rich tasks. The dataset will be open-sourced to promote community research [9]. Methodology Details - **Overall Architecture**: Built on the π₀ framework, ForceVLA integrates visual, language, proprioceptive, and 6-axis force feedback to generate actions through a conditional flow matching model. Visual inputs and task instructions are encoded into context embeddings, which are combined with proprioceptive and force cues to predict action trajectories [11]. - **FVLMoE Module Design**: The module processes force features as an independent input after visual-language processing, using a sparse mixture of experts layer to dynamically select the most suitable expert for each token, enhancing the integration of multimodal features [12][14]. Experimental Results - **Performance Evaluation**: The evaluation was conducted on five contact-rich tasks, with task success rates as the primary metric. The results showed that ForceVLA achieved an average success rate of 60.5%, significantly outperforming the π₀-base model without force feedback, which had a success rate of 37.3% [25]. - **Ablation Studies**: The experiments demonstrated that the adaptive fusion achieved by the FVLMoE module led to an 80% success rate, validating the importance of integrating force after visual-language model encoding [23][26]. - **Multi-task Evaluation**: ForceVLA exhibited excellent multi-task capabilities, achieving an average success rate of 67.5% across various tasks, with a 100% success rate in the plug insertion task, showcasing its ability to leverage multimodal cues in shared strategies [27].
还不知道发什么方向论文?别人已经投稿CCF-A了......
具身智能之心· 2025-06-18 03:03
Group 1 - The core viewpoint of the article is the launch of a mentoring program for students aiming to publish papers in top conferences such as CVPR and ICRA, building on last year's successful outcomes [1] - The mentoring directions include multimodal large models, VLA, robot navigation, robot grasping, embodied generalization, embodied synthetic data, end-to-end embodied intelligence, and 3DGS [2] - The mentors have published papers in top conferences like CVPR, ICCV, ECCV, ICLR, RSS, ICML, and ICRA, indicating their rich guiding experience [3] Group 2 - Students are required to submit a resume and must come from a domestic top 100 university or an international university ranked within QS 200 [4][5]
从扭秧歌到跑半马:机器人离「iPhone时刻」还有多远?
具身智能之心· 2025-06-17 12:53
作者丨 机器之心 编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 从春晚舞台上扭秧歌、转手绢,到稳健完整跑完半程马拉松…… 过去半年,一系列炫酷的表演,把人们对 机器人的认知从想象拉进了现实。 但当 AI 圈、车圈、互联网圈大佬们纷纷跻身到具身智能时,每个人都绕不开以下几个灵魂拷问:具身智能 还有哪些技术瓶颈?到底怎么落地?应该先从哪些场景开始落地?要解决用户哪些真实需求?能够做到怎 样的量产成本…… 在行业的「iPhone 时刻」真正到来前,没有人能够精准给出上述问题的全部答案。 把创新技术转化为具有商业价值的实际产品,固然需要长期的探索实践。如何尽可能缩短这条探索路径的 周期、降低成本,反而是现在具身智能赛道玩家更加关注的话题。 去年以来,面向具身智能机器人的计算开发平台,成为国内外平台型企业争相布局的全新赛道。英伟达推 出 Jetson Thor,高通、英特尔紧随其后。在国内,脱胎于地平线的地瓜机器人,去年亮相的 RDK S100 算控 一体化开发者套件也在本月正式发布。所有企业的目标只有一个,「征服」每一个机器人开发者和厂商。 在一众产 ...
正在筹划做一个万人的具身社区!
具身智能之心· 2025-06-17 12:53
这几天刚和团队小伙伴沟通完后期工作建设,探讨究竟要做一个什么样的具身社区?其中一个答 案比较符合我们的思路,那就是一个能够凝聚行业人群、遇到问题能够快速响应、影响到整个行 业的地方。 我们目标是3年内打造一个万人聚集的具身社区,这里也非常欢迎优秀的同学加入我们 (目前已经 有华为天才少年、具身领域研究前沿的几个大佬加入)。 我们和多家具身公司搭建了学术+产品 +招聘完整的桥梁和链路,同时内部在教研板块也基本形成了闭环(课程 + 硬件 + 问答)。社区里 也能看到很多最新的行业观点、技术输出。现在本体是怎么样的?有哪些不足?数据采集的成功 率和有效率怎么提升?sim2real怎么做的有效点?这些都是我们一直关注的。 前面一直在想怎么帮助刚入门的小白快速收拢技术栈,社区内部也为大家整理了一系列配套内 容,完整的入门路线。 已经从事相关研究的同学,我们也给大家提供了很多有价值的产业体系和项目方案。 还有源源不断的求职、岗位分享哦,欢迎和我们一起打造完整的具身生态。 具身智能之心知识星球 社区创建的出发点是给大家提供一个具身相关的技术交流平台,交流学术和工程上的问题。星球 内部的成员来自国内外知名高校实验室、具身相关 ...
迈向通用具身智能:具身智能的综述与发展路线
具身智能之心· 2025-06-17 12:53
Core Insights - The article discusses the development of Embodied Artificial General Intelligence (AGI), defining it as an AI system capable of completing diverse, open-ended real-world tasks with human-level proficiency, emphasizing human interaction and task execution abilities [3][6]. Development Roadmap - A five-level roadmap (L1 to L5) is proposed to measure and guide the development of embodied AGI, based on four core dimensions: Modalities, Humanoid Cognitive Abilities, Real-time Responsiveness, and Generalization Capability [4][6]. Current State and Challenges - Current embodied AI capabilities are between levels L1 and L2, facing challenges across four dimensions: Modalities, Humanoid Cognition, Real-time Response, and Generalization Capability [6][7]. - Existing embodied AI models primarily support visual and language inputs, with outputs limited to action space [8]. Core Capabilities for Advanced Levels - Four core capabilities are defined for achieving higher levels of embodied AGI (L3-L5): - Full Modal Capability: Ability to process multi-modal inputs beyond visual and textual [18]. - Humanoid Cognitive Behavior: Includes self-awareness, social understanding, procedural memory, and memory reorganization [19]. - Real-time Interaction: Current models struggle with real-time responses due to parameter limitations [19]. - Open Task Generalization: Current models lack the internalization of physical laws, which is essential for cross-task reasoning [20]. Proposed Framework for L3+ Robots - A framework for L3+ robots is suggested, focusing on multi-modal streaming processing and dynamic response to environmental changes [20]. - The design principles include a multi-modal encoder-decoder structure and a training paradigm that promotes cross-modal deep alignment [20]. Future Challenges - The development of embodied AGI will face not only technical barriers but also ethical, safety, and social impact challenges, particularly in human-machine collaboration [20].