Workflow
具身智能之心
icon
Search documents
大模型这个坑,还有哪些可以发论文的点?
具身智能之心· 2025-07-05 02:25
Core Insights - The article emphasizes the rapid development of large language models (LLMs) and multimodal models, focusing on enhancing model efficiency, expanding knowledge capabilities, and improving reasoning performance as key research areas in artificial intelligence [1][2]. Course Objectives - The course aims to systematically explore cutting-edge optimization methods for large models, addressing challenges in parameter-efficient computation, dynamic knowledge expansion, and complex reasoning [1][2]. Enrollment Details - The course will accept 6 to 8 participants per session [3]. Target Audience - The course is designed for master's and doctoral students in the field of large models, individuals seeking to enhance their resumes for graduate studies abroad, and professionals in artificial intelligence looking to deepen their understanding of algorithm theory and research skills [4]. Course Outcomes - Participants will gain insights into classic and cutting-edge papers, coding implementations, and methods for writing and submitting research papers, thereby developing a clearer understanding of the subject matter [3][4]. Enrollment Requirements - Basic requirements include familiarity with deep learning/machine learning, basic knowledge of large model algorithms, proficiency in Python, and experience with PyTorch [5]. Course Structure - The course spans 12 weeks of online group research, followed by 2 weeks of paper guidance, and includes a maintenance period of 10 weeks for paper development [10]. Learning Requirements - Participants are expected to engage actively in discussions, complete assignments on time, and maintain academic integrity throughout the course [12]. Course Outline - The curriculum covers various topics, including model pruning, quantization, dynamic knowledge expansion, and advanced reasoning paradigms, with a focus on practical applications and coding [16][18].
图像目标导航的核心究竟是什么?
具身智能之心· 2025-07-04 12:07
Research Background and Core Issues - Image goal navigation requires two key capabilities: core navigation skills and direction information calculation based on visual observation and target image comparison [2] - The research focuses on whether this task can be efficiently solved through end-to-end training of complete agents using reinforcement learning (RL) [2] Core Research Content and Methods - The study explores various architectural designs and their impact on task performance, emphasizing implicit correspondence computation between images [3][4] - Key architectures discussed include Late Fusion, ChannelCat, SpaceToDepth + ChannelCat, and Cross-attention [4] Main Findings - Early patch-level fusion methods (like ChannelCat and Cross-attention) are more critical than late fusion methods (Late Fusion) for supporting implicit correspondence computation [8] - The performance of different architectures varies significantly under different simulator settings, particularly the "Sliding" setting [8][10] Performance Metrics - The success rate (SR) and success path length (SPL) metrics are used to evaluate the performance of various models [7] - For example, when Sliding=True, ChannelCat (ResNet9) achieved an SR of 83.6%, while Late Fusion only reached 13.8% [8] Transferability of Abilities - Some learned capabilities can transfer to more realistic environments, especially when including the weights of the perception module [10] - Training with Sliding=True and then fine-tuning in a Sliding=False environment improved SR from 31.7% to 38.5% [10] Relationship Between Navigation and Relative Pose Estimation - A correlation exists between navigation performance and relative pose estimation accuracy, indicating the importance of direction information extraction in image goal navigation [12] Conclusion - Architectural designs that support early local fusion (like Cross-attention and ChannelCat) are crucial for implicit correspondence computation [15] - The simulator's Sliding setting significantly affects performance, but transferring perception module weights can help retain some capabilities in real-world scenarios [15] - Navigation performance is related to relative pose estimation ability, confirming the core role of direction information extraction in image goal navigation [15]
ArtGS:3DGS实现关节目标精准操控,仿真/实物双验证性能SOTA!
具身智能之心· 2025-07-04 09:48
Group 1 - The core challenge in robotics is joint target manipulation, which involves complex kinematic constraints and limited physical reasoning capabilities of existing methods [3][4] - The proposed ArtGS framework integrates 3D Gaussian Splatting (3DGS) with visual-physical modeling to enhance understanding and interaction with joint targets, ensuring physically consistent motion constraints [3][4][20] - ArtGS consists of three key modules: static Gaussian reconstruction, VLM-based skeletal inference, and dynamic 3D Gaussian joint modeling [4] Group 2 - Static 3D Gaussian reconstruction utilizes 3D Gaussian splatting to create high-fidelity 3D scenes from multi-view RGB-D images, representing the scene as a collection of 3D Gaussian spheres [5] - VLM-based skeletal inference employs a fine-tuned visual-language model (VLM) to estimate joint parameters, generating target views to assist in visual question answering [6][8] - Dynamic 3D Gaussian joint modeling implements impedance control for interaction with the environment, optimizing joint parameters through differential rendering [10] Group 3 - Experimental validation shows that ArtGS significantly outperforms baseline methods in joint parameter estimation, with lower angular error (AE) and origin error (OE) [12] - In simulation, ArtGS achieves a manipulation success rate ranging from 62.4% to 90.3%, which is substantially higher than other methods like TD3 and Where2Act [14] - Real-world experiments demonstrate a perfect success rate of 10/10 for drawer operations and 9/10 for cabinet operations, indicating the effectiveness of the optimized version of ArtGS [14][17] Group 4 - Ablation studies reveal that even with initial axis estimation errors exceeding 20°, ArtGS can still enhance operation success rates through 3DGS optimization [19] - ArtGS exhibits cross-embodiment adaptability, accurately reconstructing various robotic arms, particularly excelling in gripper rendering details [19][20] - The core contribution of ArtGS lies in transforming 3DGS into a visual-physical model for joint targets, ensuring spatiotemporal consistency in differentiable operation trajectories [20] Group 5 - Future directions for ArtGS include expanding capabilities to handle more complex scenarios and improving modeling and manipulation of multi-joint, high-dynamic targets [21]
港大强化学习驱动连续环境具身导航方法:VLN-R1
具身智能之心· 2025-07-04 09:48
Core Viewpoint - The article presents the VLN-R1 framework, which utilizes large vision-language models (LVLM) for continuous navigation in real-world environments, addressing limitations of previous discrete navigation methods [5][15]. Research Background - The VLN-R1 framework processes first-person video streams to generate continuous navigation actions, enhancing the realism of navigation tasks [5]. - The VLN-Ego dataset is constructed using the Habitat simulator, providing rich visual and language information for training LVLMs [5][6]. - The importance of visual-language navigation (VLN) is emphasized as a core challenge in embodied AI, requiring real-time decision-making based on natural language instructions [5]. Methodology - The VLN-Ego dataset includes natural language navigation instructions, historical frames, and future action sequences, designed to balance local details and overall context [6]. - The training method consists of two phases: supervised fine-tuning (SFT) to align action predictions with expert demonstrations, followed by reinforcement fine-tuning (RFT) to optimize model performance [7][9]. Experimental Results - In the R2R task, VLN-R1 achieved a success rate (SR) of 30.2% with the 7B model, significantly outperforming traditional models without depth maps or navigation maps [11]. - The model demonstrated strong cross-domain adaptability, outperforming fully supervised models in the RxR task with only 10K samples used for RFT [12]. - The design of predicting future actions was found to be crucial for performance, with the best results obtained by predicting six future actions [14]. Conclusion and Future Work - VLN-R1 integrates LVLM and reinforcement learning fine-tuning, achieving state-of-the-art performance in simulated environments and showing potential for small models to match larger ones [15]. - Future research will focus on validating the model's generalization capabilities in real-world settings and exploring applications in other embodied AI tasks [15].
传统导航和具身目标导航到底有啥区别?
具身智能之心· 2025-07-04 09:48
Core Viewpoint - The article discusses the evolution of robot navigation technology from traditional mapping and localization to large model-based navigation, which includes visual language navigation (VLN) and goal navigation. VLN focuses on following instructions, while goal navigation emphasizes understanding the environment to find paths independently [1][4]. Group 1: Visual Language Navigation (VLN) - VLN is fundamentally a task of following instructions, which involves understanding language commands, perceiving the environment, and planning movement strategies. The VLN robot system consists of a visual language encoder, environmental history representation, and action strategy modules [2]. - The key challenge in VLN is how to effectively compress information from visual and language inputs, with current trends favoring the use of large-scale pre-trained visual language models and LLMs for instruction breakdown and task segmentation [2][3]. - The learning of the strategy network has shifted from extracting patterns from labeled datasets to distilling effective planning information from LLMs, which has become a recent research focus [3]. Group 2: Goal Navigation - Goal navigation extends VLN by requiring agents to autonomously explore and plan paths in unfamiliar 3D environments based solely on target descriptions, such as coordinates or images [4]. - Unlike traditional VLN that relies on explicit instructions, goal-driven navigation systems must transition from "understanding commands to finding paths" by autonomously parsing semantics, modeling environments, and making dynamic decisions [6]. Group 3: Commercial Applications and Demand - Goal-driven navigation technology has been industrialized in various verticals, such as terminal delivery, where it combines with social navigation algorithms to handle dynamic environments and human interactions. Examples include Meituan's delivery robots and Starship Technologies' campus delivery robots [8]. - In sectors like healthcare, hospitality, and food service, companies like 嘉楠科技, 云迹科技, and Aethon have deployed service robots for autonomous delivery, enhancing service response efficiency [8]. - The development of humanoid robots has led to an increased focus on the adaptability of navigation technology, with companies like Unitree and Tesla showcasing advanced navigation capabilities [9]. Group 4: Knowledge and Learning Challenges - Both VLN and goal navigation require knowledge across multiple domains, including natural language processing, computer vision, reinforcement learning, and graph neural networks, making it a challenging learning path for newcomers [10].
最新综述:从物理模拟器和世界模型中学习具身智能
具身智能之心· 2025-07-04 09:48
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Xiaoxiao Long等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 出发点与工作背景 本综述聚焦具身智能在机器人研究中的前沿进展,指出实现强大具身智能的关键在于物理模拟器与世界模 型的整合。物理模拟器提供可控高保真环境用于训练评估机器人智能体,世界模型则赋予机器人环境内部 表征能力以支持预测规划与决策。 文中系统回顾了相关最新进展,分析了两者在增强机器人自主性、适应性和泛化能力上的互补作用,探讨 了外部模拟与内部建模的相互作用以弥合模拟训练与现实部署的差距。此外,还提及维护了一个包含最新 文献和开源项目的资源库,网址为https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey, 旨在为具身 AI 系统的发展提供全面视角并明确未来挑战。 一些介绍 随着人工智能与机器人技术的发展,智能体与物理世界的交互成为研 ...
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-03 13:36
职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支撑,并 探索其在自动驾驶和通用机器人领域的统一应用潜力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对 ...
卡耐基梅隆大学!Human2LocoMan:通过人类预训练学习多功能四足机器人操控
具身智能之心· 2025-07-03 13:36
Core Insights - The article presents a novel framework called Human2LocoMan for enhancing quadrupedal robot manipulation through human pretraining, addressing the challenges of autonomous multi-functional operations in complex environments [4][38] - The framework utilizes a modular cross-entity Transformer architecture (MXT) to facilitate effective data collection and transfer learning from human demonstrations to robotic strategies [8][38] Group 1: Framework and Methodology - The Human2LocoMan framework integrates human data collection via extended reality (XR) technology, allowing for the mapping of human actions to robotic movements, thereby enhancing the robot's operational capabilities [7][10] - A unified reference framework is established to align actions between humans and the LocoMan robot, addressing the significant differences in dynamics and control systems between the two entities [12][10] - The MXT architecture is designed to share a common Transformer backbone while maintaining entity-specific markers, enabling effective transfer learning across different robotic platforms [16][8] Group 2: Experimental Results - The experiments demonstrated an average success rate improvement of 41.9% and an 79.7% enhancement in out-of-distribution (OOD) scenarios when using the proposed framework compared to baseline methods [4][8] - Pretraining with human data resulted in a 38.6% overall success rate increase and an 82.7% improvement in OOD scenarios, showcasing the effectiveness of human data in enhancing robotic performance [8][38] - The data collection efficiency was highlighted, with over 50 robot trajectories and 200 human trajectories collected within 30 minutes, indicating the framework's potential for rapid data acquisition [26][38] Group 3: Comparative Analysis - The MXT architecture outperformed state-of-the-art (SOTA) imitation learning methods in various tasks, demonstrating superior success rates and task scores, particularly in scenarios with limited data [30][34] - The modular design of MXT facilitated better generalization and reduced overfitting compared to other architectures, such as HPT, which struggled with severe overfitting issues [36][39] - The framework's ability to maintain high performance in long-sequence tasks indicates its robustness and effectiveness in real-world applications [36][38]
具身智能,到了交卷的时刻了。。。
具身智能之心· 2025-07-03 08:22
Core Viewpoint - Embodied intelligence has emerged as a significant technological keyword in recent years, transitioning from obscurity to frenzy and now to a more rational phase, with companies focusing on practical applications rather than mere demonstrations [2]. Group 1: Technological Developments - The upgrade in sensory capabilities and multimodal integration is crucial for the development of embodied technology, with tactile perception becoming a key focus, especially in dexterous manipulation [2]. - The integration of multimodal sensor fusion technology allows robots to process various types of information simultaneously, enhancing environmental perception accuracy and comprehensiveness [2]. - Large model-driven algorithms are improving robots' understanding of the world, particularly in humanoid robotics, by enhancing perception capabilities and promoting autonomous learning and decision-making [2]. Group 2: Industry Needs and Challenges - There is an urgent need for lightweight model designs that support low computational power, multimodal capabilities, and cross-platform functionality to facilitate industry implementation [2]. - The construction of simulation environments and data ecosystems is vital for embodied intelligence, providing efficient training platforms through the simulation of physical world phenomena [2]. - Aligning simulation data with real-world applications remains a significant challenge for researchers [2]. Group 3: Community and Resources - The "Embodied Intelligence Heart" knowledge community serves as a platform for technical exchange among nearly 200 companies and research institutions in the field [3][8]. - The community offers a wealth of resources, including open-source projects, datasets, and learning pathways for various aspects of embodied intelligence [8][15][17]. - Members of the community can access job postings, industry reports, and educational materials to enhance their knowledge and career prospects in embodied intelligence [8][17][19].
全球首个自动驾驶VLA综述重磅发布:VLA自驾模型全面拆解~
具身智能之心· 2025-07-03 08:22
自动驾驶开发者社区,关注自动驾驶、计算机视觉、感知融合、BEV、部署落地、定位规控、领域方案等,坚持为领域输出最前沿的技术方向! 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 今天自动驾驶之心为大家分享 麦吉尔大学、清华大学、小米公司 和威斯康辛麦迪逊的研究团队 最新的工作! 面向自动驾驶的视觉-语言-动作模 型综述! 如果您有相关工作需要分享,请在文末联系我们! 以下文章来源于自动驾驶之心 ,作者Sicong Jiang等 自动驾驶之心 . 自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一步咨询 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 当视觉(Vision)、语言(Language)和行动(Action)三大能力在一个模型中融合,自动驾驶的未来将走向何方? 近日,来自麦吉尔大学、清华大学、小米公司和威斯康辛麦迪逊的研究团队联合发布了全球首篇针对自动驾驶领域的视觉-语言-行动(Vision-Language-Action, VLA)模型的全面综述。这篇题为《A Survey on Vision-Languag ...