Workflow
具身智能之心
icon
Search documents
国内最大的具身社区,开学季招生了!!!
具身智能之心· 2025-09-02 00:03
我们内部分享了大量的VLA、强化学习、世界模型、VLN、数采、遥操、仿真等相关的内容。涵 盖目前所有主流的方法论,并形成了技术路线。适合入门进阶的同学做进一步提升。 社区内的成员主要分布在头部的具身公司、一些互联网企业的具身实验室、top的高校实验室、还 有一些传统的机器人公司。形成工业界+学术界互补的态势。如果您真的有需要,想要做系统提 升、和更多的同行业人员交流,欢迎加入。 开学季大额优惠,微信扫码加入~~~ "具身智能之心知识星球"是我们一直在维护的具身社区,目前集视频 + 图文 + 学习路线 + 问答 + 求职交流为一体,是一个综合类的具身社区,近2000人了。我们期望未来2年内做到近万人的规 模。给大家打造一个交流+技术分享的聚集地,是许多初学者和进阶的同学经常逛的地方。 社区内部经常为大家解答各类实用问题:如何使用设备?如何有效采集数据?如何部署VA、VLA 模型等。是采集背景太复杂还是数据比较dirty? 快速解答,方便大家应用到项目中。 一个社区能在大家最需要帮助的时候解决问题,无疑是非常有价值的。具身智能之心知识星球 (国内首个具身全栈技术社区),目前已经完成了产业、学术、求职、问答交流等多 ...
上海交大具身导航中的感知智能、社会智能和运动智能全面综述
具身智能之心· 2025-09-02 00:03
Core Insights - The article presents the TOFRA framework, which decomposes the embodied navigation process into five key stages: Transition, Observation, Fusion, Reward-policy construction, and Action execution, providing a structured analysis for embodied navigation research [2][14] - It systematically integrates research findings from computer vision, classical robotics, and bionics in the context of embodied navigation, highlighting the complementary nature of these fields in sensing intelligence, social intelligence, and motion intelligence [2][3] - The article identifies four core challenges in the field of embodied navigation: adaptive spatiotemporal scale, joint optimization, system integrity, and data task generalization, guiding future research directions [2][3] Group 1: Research Background - Embodied Artificial Intelligence (EAI) emphasizes self-perception and interaction with humans or the environment as a pathway to Artificial General Intelligence (AGI) [2] - The core feature of embodied navigation is its egocentric perception and distributed computing capabilities, contrasting with traditional navigation methods that rely on predefined maps or external localization [2][3] Group 2: Intelligence Types - Sensing Intelligence: Achieved through multimodal self-centered perception, allowing for spatial cognition without complete reliance on pre-built global maps [3][4] - Social Intelligence: Enables understanding of high-level semantic instructions from humans, supporting complex task execution beyond predefined waypoints [10][11] - Motion Intelligence: Involves the ability to perform flexible and adaptive physical interactions in complex environments, not limited to fixed paths [10][11] Group 3: TOFRA Framework - Transition (T): Involves predicting the next state using internal sensors and various methods, including dynamics modeling and end-to-end neural networks [14][20] - Observation (O): Focuses on how robots perceive the environment through external sensors, forming an understanding of the external world [27][28] - Fusion (F): Combines internal state predictions with external perceptions to achieve optimal state estimation using classical Bayesian methods and neural networks [45][48] Group 4: Action Execution - Action execution involves the robot utilizing motion skills to complete the action sequences generated by the policy, including basic skills and complex skill combinations [60][61] - The article discusses the evolution of action execution from basic motion skills to complex combinations and morphological cooperation, highlighting the advancements in motion intelligence [60][68] Group 5: Application Scenarios - The TOFRA framework is applied to three typical navigation scenarios: embodied autonomous driving, indoor navigation, and complex terrain navigation, detailing how to integrate the framework's stages for efficient navigation systems [74][75][76]
具身智能之心合伙人招募来啦!具身数采/算法/仿真/硬件多个方向
具身智能之心· 2025-09-01 10:00
课程讲师招募 具身智能之心课程讲师招募开始啦!如果您是大模型/多模态大模型、Diffusion、VLA、VLA+RL、sim2real、 端到端、具身交互、视觉语言导航、强化学习、机器人运动规划、机器人框架、抓取点预测与位姿估计、导航 建图、触觉感知、大模型部署与量化感知推理、机器人仿真等方向,欢迎加入我们; 主要工作:开发具身相关的视频课程,负责群内答疑等; 待遇丰厚(底部添加微信了解),除了现金激励,我们共享全行业具身资源、职位等。 科研辅导老师 待遇优厚,高于行业水平,既能发论文,又能赚零花钱! 机器人硬件开发合伙人 如果您正在从事机械臂抓取系统、双足机器人、四足机器人、轮式机器人、大模型部署等软硬件的开发工作, 期望和我们一起推动具身教育的发展,欢迎联系我们; 我们将会提供合伙人的身份,一起开创更大的具身教育场景,推动行业发展。 联系我们 具身智能相关方向科研辅导老师招募开始啦!如果您是diffusion policy、VLA、VLA+强化、sim2real、强化学 习、具身仿真、具身感知、具身交互、视觉语言导航、目标导航、触觉感知、大模型/多模态大模型、大模型 量化、机械臂抓取、位姿估计、大模型部署 ...
穆尧团队最新!Discrete Diffusion VLA离散扩散引入VLA,支持精确动作建模和一致性训练
具身智能之心· 2025-09-01 10:00
更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 当机器人看到 "把勺子放在毛巾上" 的指令,如何精准理解图像中的物体位置、解析语言含义,并生成连贯动作?视觉 - 语言 - 动作(VLA)模型正是解决这一问 题的核心技术,但当前方案却陷入两难:自回归模型像 "念课文" 一样逐字生成动作,速度慢还改不了错;连续扩散模型虽能处理复杂动作,却要在主模型外 "外 挂" 模块,训练难、兼容性差。 Discrete Diffusion VLA 提出的 "离散扩散视觉 - 语言 - 动作模型",直接打破了这一困局!它把离散扩散技术首次引入 VLA 动作解码,用一个 Transformer 就统一 了视觉、语言、动作三模态——既不用额外训练扩散模块,又能像 "做拼图" 一样并行解码动作,还能通过 "先拼简单块、再补复杂处" 的策略修正错误。 在 Franka Panda 机械臂的 LIBERO 任务中成功率飙到 96.3%,Google 机器人面对场景变化时视觉匹配率达 71.2%,WidowX 机器人在真实 - 模拟迁移场景也拿下 49.3% 的整体成功率,全方 ...
RLinf开源!首个面向具身智能“渲训推一体化”的大规模强化学习框架
具身智能之心· 2025-09-01 04:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 导语: 清华大学、北京中关村学院、无问芯穹联合北大、伯克利等机构重磅开源RLinf:首个面向具身智能的"渲训 推一体化"大规模强化学习框架。 代码链接 :https://github.com/RLinf/RLinf Hugging Face链接 :https://huggingface.co/RLinf 使用文档链接 :https://rlinf.readthedocs.io/en/latest/ 人工智能正在经历从"感知"到"行动"的跨越式发展,融合大模型的具身智能被认为是人工智能的下一发展阶段,成 为学术界与工业界共同关注的话题。在大模型领域,随着o1/R1系列推理模型的发布,模型训练的重心逐渐从数据 驱动的预训练/后训练转向奖励驱动的强化学习(Reinforcement Learning, RL)。OpenAI预测强化学习所需要的算 力甚至将超过预训练。与此同时,能够将大规模算力高效利用的RL infra的重要性也日益凸显,近期也涌现出一批 优秀的框架,极大地促进了该领域的发展。 然而,当前框架对具身智能的支持仍然受限。相比推理大模型这一类纯大脑模型, ...
最新综述!多模态融合与VLM在具身机器人领域中的方法盘点
具身智能之心· 2025-09-01 04:02
Core Insights - The article discusses the transformative impact of Multimodal Fusion and Vision-Language Models (VLMs) on robot vision, enabling robots to evolve from simple mechanical executors to intelligent partners capable of understanding and interacting with complex environments [3][4][5]. Multimodal Fusion in Robot Vision - Multimodal fusion integrates various data types such as RGB images, depth information, LiDAR point clouds, language, and tactile data, significantly enhancing robots' perception and understanding of their surroundings [3][4][9]. - The main fusion strategies have evolved from early explicit concatenation to implicit collaboration within unified architectures, improving feature extraction and task prediction [10][11]. Applications of Multimodal Fusion - Semantic scene understanding is crucial for robots to recognize objects and their relationships, where multimodal fusion greatly improves accuracy and robustness in complex environments [9][10]. - 3D object detection is vital for autonomous systems, combining data from cameras, LiDAR, and radar to enhance environmental understanding [16][19]. - Embodied navigation allows robots to explore and act in real environments, focusing on goal-oriented, instruction-following, and dialogue-based navigation methods [24][26][27][28]. Vision-Language Models (VLMs) - VLMs have advanced significantly, enabling robots to understand spatial layouts, object properties, and semantic information while executing tasks [46][47]. - The evolution of VLMs has shifted from basic models to more sophisticated systems capable of multimodal understanding and interaction, enhancing their applicability in various tasks [53][54]. Future Directions - The article identifies key challenges in deploying VLMs on robotic platforms, including sensor heterogeneity, semantic discrepancies, and the need for real-time performance optimization [58]. - Future research may focus on structured spatial modeling, improving system interpretability, and developing cognitive VLM architectures for long-term learning capabilities [58][59].
时代2025 AI百人榜出炉:梁文锋、王兴兴等入选,华人影响力爆棚
具身智能之心· 2025-09-01 04:02
Core Viewpoint - The article highlights the influential figures in the AI field as recognized by Time magazine in its 2025 list, emphasizing the increasing representation of Chinese individuals and their contributions to AI technology [2][5]. Group 1: Leaders - Ren Zhengfei, founder of Huawei, has driven long-term investments in AI, launching the Ascend series AI chips and the MindSpore deep learning framework, establishing a competitive edge in the AI ecosystem [8]. - Liang Wenfeng, CEO of DeepSeek, has led the company to prominence in AI technology, releasing the R1 model that competes with OpenAI's latest offerings, showcasing China's capabilities in AI with minimal computational resources [11]. - Huang Renxun, co-founder and CEO of NVIDIA, transformed the company into a leading AI computing firm, with its CUDA platform and high-performance GPUs being essential for advancements in deep learning [14]. - Wei Zhejia, chairman and CEO of TSMC, has positioned the company as a key player in AI chip manufacturing, ensuring the production of powerful AI processors through strategic decisions [17]. Group 2: Innovators - Peng Jun, CEO of Pony.ai, has been pivotal in the commercialization of autonomous driving, achieving large-scale operations of Robotaxi services in major Chinese cities by 2025 [25]. - Edwin Chen, founder and CEO of Surge AI, has built a successful data labeling company, generating over $1 billion in revenue by 2024, with a valuation exceeding $25 billion during fundraising [28]. Group 3: Shapers - Li Feifei, Stanford professor and CEO of World Labs, is a key figure in human-centered AI research, having created the ImageNet project, which revolutionized computer vision [31][32]. - Xue Lan, Tsinghua University professor, has contributed significantly to AI governance and public policy, influencing the development of ethical standards and regulations in AI [35][36]. Group 4: Other AI Figures - Elon Musk, founder of xAI, has been influential in developing autonomous driving technologies and brain-machine interfaces [40]. - Sam Altman, CEO of OpenAI, has led the company in releasing groundbreaking AI products, significantly advancing generative AI technology [42]. - Andy Jassy, president and CEO of Amazon, has laid the groundwork for AI advancements through AWS and is actively promoting generative AI innovations [51].
吴恩达最新来信:是时候关注并行智能体了
具身智能之心· 2025-09-01 04:02
编辑丨量子位 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 人多,好办事。agent多,照样好办事! 在最新的Andrew's Letters中, 吴恩达 老师 就指出: 并行智能体正在成为提升AI能力的新方向。 信中,他描绘了这样的一些场景: 在这些场景中,多个agent协作,就像一支高效的agent team同时处理不同任务,速度快、效率高。 此外, 大语言模型token成本的不断下降,也让多个agent并行处理的方法变得可行。 多个agent 并行抓取分析网页, 更快速地生 成深度研究报告 。 多个agent 协同处理代码库的不同部分, 加快编程任务完成速度。 多个agent 在后台并行工作, 同时由一个 监督agent向用户提供反馈, 实现并行异步控制。 但就像网友指出的:如何协调多个agent呢? 这为我们理解AI能力的提升提供了新视角—— 不仅仅依靠更多的数据和算力,更重要的是让多个智能体 协同并行 工作。 并行智能体才是未来 以往,当我 ...
开课倒计时!3个月搞透具身大脑+小脑算法
具身智能之心· 2025-08-31 02:33
Core Insights - The exploration towards Artificial General Intelligence (AGI) highlights embodied intelligence as a key direction, focusing on the interaction and adaptation of intelligent agents within physical environments [1] - The development of embodied intelligence is marked by the evolution of technology from low-level perception to high-level task understanding and generalization [6][9] Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, transitioning from laboratories to commercial and industrial applications [3] - Major domestic companies like Huawei, JD.com, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a comprehensive ecosystem for embodied intelligence, while international players like Tesla and investment firms are focusing on foundational models and humanoid robot prototypes [5] Technological Evolution - The evolution of embodied intelligence technology has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6] - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but faced challenges in generalization and performance in multi-target scenarios [6] - The third stage introduced Diffusion Policy methods, enhancing stability and generalization through sequence modeling [7] - The fourth stage, emerging in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing to overcome current limitations [8] Product Development and Market Growth - The advancements in embodied intelligence have led to the development of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, and healthcare [9] - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills [12] Educational Initiatives - A comprehensive curriculum has been developed to assist learners in mastering the full spectrum of embodied intelligence algorithms, covering topics from basic tasks to advanced models like VLA and its integrations [9][12]
最新综述!多模态融合与VLM在具身机器人领域中的方法盘点
具身智能之心· 2025-08-31 02:33
Core Viewpoint - The article discusses the advancements in multimodal fusion and vision-language models (VLMs) in robot vision, emphasizing their role in enhancing robots' perception and understanding capabilities in complex environments [4][5][56]. Multimodal Fusion in Robot Vision Tasks - Semantic scene understanding is a critical task in visual systems, where multimodal fusion significantly improves accuracy and robustness by integrating additional information such as depth and language [9][11]. - Current mainstream fusion strategies include early fusion, mid-level fusion, and late fusion, evolving from simple concatenation to more sophisticated interactions within a unified architecture [10][12][16]. Applications of Multimodal Fusion - In autonomous driving, 3D object detection is crucial for accurately identifying and locating pedestrians, vehicles, and obstacles, with multimodal fusion enhancing environmental understanding [15][18]. - The design of multimodal fusion involves addressing when to fuse, what to fuse, and how to fuse, with various strategies impacting performance and computational efficiency [16][17]. Embodied Navigation - Embodied navigation allows robots to explore and act in real environments, focusing on autonomous decision-making and dynamic adaptation [23][25][26]. - Three representative methods include goal-directed navigation, instruction-following navigation, and dialogue-based navigation, showcasing the evolution from perception-driven to interactive understanding [25][26][27]. Visual Localization and SLAM - Visual localization determines a robot's position, which is challenging in dynamic environments; recent methods leverage multimodal fusion to improve performance [28][30]. - SLAM (Simultaneous Localization and Mapping) has evolved from geometric-driven to semantic-driven approaches, integrating various sensor data for enhanced adaptability [30][34]. Vision-Language Models (VLMs) - VLMs have progressed significantly, focusing on semantic understanding, 3D object detection, embodied navigation, and robot operation, with various fusion methods being explored [56][57]. - Key innovations in VLMs include large-scale pre-training, instruction fine-tuning, and structural optimization, enhancing their capabilities in cross-modal reasoning and task execution [52][53][54]. Future Directions - Future research should focus on structured spatial modeling, improving system interpretability and ethical adaptability, and developing cognitive VLM architectures for long-term learning [57][58].