Workflow
具身智能之心
icon
Search documents
数据、算法和本体,小白入门很难绕开任何一个部分......
具身智能之心· 2025-06-28 07:48
硬件部分:预算足的实验室有经费购买20-30w的本体,预算不足的同学依赖3D打印自己制作机械 臂或者采购性价比高的硬件平台,甚至在仿真里面做,研究比较受限。 我们的具身社区针对这三个大的模块做了比较充足的分享,包括数据采集方案、本体、仿真以及 算法部分,同时也给大家提供了几款高性价比的机械臂平台,助力研究。 社区目标是3年内打造一个万人聚集的地方,这里也非常欢迎优秀的同学加入我们(目前已经有很 多具身研究前沿的学者加入我们了)!我们和多家具身公司搭建了学术+产品+招聘完整的桥梁和 链路,同时内部在教研板块也基本形成了闭环(课程 + 硬件 + 问答)。社区里也能看到很多最新 的行业观点、技术输出。现在本体是怎么样的?有哪些不足?数据采集的成功率和有效率怎么提 升?sim2real怎么做的有效点?这些都是我们一直关注的。 入门具身离不开3个要素,数据+算法+本体,说实话很多同学只懂算法,甚至说懵懵懂!数据的采 集更是需要经验,遥操和retargeting方案,很多人采集不到真实有效的数据。本体更是许多同学触 不可及的东西,高性价比的平台和仿真是很多同学入门的第一步。 数据部分:遥操采集依赖本体,成本较高。但前处理 ...
具身的秋招马上要开始了,去哪里抱团呀?
具身智能之心· 2025-06-28 07:48
Core Viewpoint - The article emphasizes the rapid advancements in AI technologies, particularly in autonomous driving and embodied intelligence, which have significantly influenced the industry and investment landscape [1]. Group 1: AutoRobo Knowledge Community - AutoRobo Knowledge Community is established as a platform for job seekers in the fields of autonomous driving, embodied intelligence, and robotics, currently hosting nearly 1,000 members from various companies [2]. - The community provides resources such as interview questions, industry reports, salary negotiation tips, and resume optimization services to assist members in their job search [2][3]. Group 2: Recruitment Information - The community regularly shares job openings in algorithms, development, and product roles, including positions for campus recruitment, social recruitment, and internships [3][4]. Group 3: Interview Preparation - A compilation of 100 interview questions related to autonomous driving and embodied intelligence is available, covering essential topics for job seekers [6]. - Specific areas of focus include sensor fusion, lane detection algorithms, and multi-modal 3D object detection, among others [7][12]. Group 4: Industry Reports - The community offers access to various industry reports that provide insights into the current state, development trends, and market opportunities within the autonomous driving and embodied intelligence sectors [13][14]. - Reports include analyses of successful and failed interview experiences, which can serve as valuable learning tools for candidates [15]. Group 5: Salary Negotiation and Professional Development - The community provides resources on salary negotiation techniques and shares foundational books related to robotics, autonomous driving, and AI to enhance members' professional knowledge [17][18].
第一篇具身领域论文应该怎么展开?
具身智能之心· 2025-06-27 09:41
Core Viewpoint - The article promotes a comprehensive tutoring service for students facing challenges in research paper writing, particularly in cutting-edge fields such as multimodal large models, embodied intelligence, and robotics [2][3][4]. Group 1: Tutoring Services Offered - The service includes one-on-one customized guidance in various advanced research areas, including multimodal large models, visual-language navigation, and robot navigation [3][4]. - The tutoring team consists of PhD researchers from prestigious institutions like CMU, Stanford, and MIT, with experience in top-tier conference reviews [4]. - The tutoring process covers the entire research paper lifecycle, from topic selection to experimental design, coding, writing, and submission strategies [4]. Group 2: Target Audience and Benefits - The service targets students struggling with research topics, data modeling, and feedback from advisors, offering a solution to enhance their academic performance [2][5]. - The first 50 students to consult can receive a free matching with a dedicated tutor for in-depth analysis and tailored advice on conference and journal submissions [5]. - The focus is not only on publishing papers but also on the practical application and value of research outcomes in industrial and academic contexts [4].
ICCV 2025不完全汇总(具身/自驾/3D视觉/LLM/CV等)
具身智能之心· 2025-06-27 09:41
【视频+解析】DriveArena: A Controllable Generative Simulation Platform for Autonomous Driving Boost 3D Reconstruction using Diffusion-based Intrinsic Estimation Epona: Autoregressive Diffusion World Model for Autonomous Driving SynthDrive: Scalable Real2Sim2RealSensor Simulation Pipeline for High-Fidelity Asset Generation and Driving DataSynthesis StableDepth:Scene-Consistent andScale-Invariant Monocular Depth CoopTrack: ExploringEnd-to-End Learning for EfficientCooperative Sequential Perception U-ViLAR: Uncertai ...
机器人顶会RSS 2025奖项公布!
具身智能之心· 2025-06-27 08:36
作者丨 机器之心 编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 恭喜获奖者。 RSS(Robotics: Science and Systems,机器人科学与系统会议)是机器人领域顶级学术会议,自 2005 年起每年举办一次,该会议旨在促进机器人领域的科学研究和 技术应用的发展。 今年大会已于 6 月 21 日至 25 日在美国洛杉矶举行。杰出 Demo 论文奖、杰出系统论文奖、杰出学生论文奖、杰出论文奖多个奖项已经公布。 地址:https://roboticsconference.org/program/awards/ 杰出 Demo 论文奖 论文标题:Demonstrating MuJoCo Playground 论文链接:https://www.roboticsproceedings.org/rss21/p020.pdf 论文主页:https://playground.mujoco.org/ 机构:UC 伯克利、Google ...
保姆级具身智能实战:从零基础到强化学习与Sim2Real
具身智能之心· 2025-06-27 08:36
Core Viewpoint - The article discusses the unprecedented turning point in AI development, highlighting the rise of embodied intelligence and its potential to revolutionize various industries, including manufacturing, healthcare, and space exploration [1]. Group 1: Embodied Intelligence - Embodied intelligence is defined as AI systems that not only possess a "brain" but also have the capability to perceive and interact with the physical world [1]. - Major tech companies like Tesla, Boston Dynamics, OpenAI, and Google are actively investing in this transformative field [1]. Group 2: Technical Challenges - Achieving true embodied intelligence presents significant technical challenges, requiring advanced algorithms and a deep understanding of physical simulation, robot control, and perception fusion [2]. Group 3: MuJoCo's Role - MuJoCo (Multi-Joint dynamics with Contact) is identified as a critical technology for embodied intelligence, serving as a high-fidelity training environment for robot learning [4]. - It allows researchers to conduct millions of trials in a virtual environment, significantly speeding up the learning process and reducing costs associated with physical hardware [6]. Group 4: MuJoCo's Advantages - MuJoCo features advanced contact dynamics algorithms, supports parallel computation, and provides a variety of sensor models, making it a standard tool in both academia and industry [6][7]. - Major tech companies utilize MuJoCo for their robot research, indicating its importance in the field [7]. Group 5: Practical Training - A comprehensive MuJoCo development course is offered, focusing on practical applications and theoretical foundations, covering topics from physical simulation to deep reinforcement learning [8][9]. - The course is structured into six modules, each with specific learning objectives and practical projects, ensuring a solid grasp of embodied intelligence technologies [10][12]. Group 6: Project Examples - The course includes projects such as intelligent robotic arm control, vision-guided grasping systems, and multi-robot collaboration, allowing participants to apply their knowledge in real-world scenarios [14][21]. Group 7: Target Audience and Outcomes - The course is suitable for individuals with programming or algorithm backgrounds looking to enter the field of embodied robotics, as well as graduate and undergraduate students focused on robotics and reinforcement learning [27]. - Upon completion, participants will have a complete skill set in embodied intelligence, including technical, engineering, and innovative capabilities [28].
清华大学最新综述!具身AI中多传感器融合感知:背景、方法、挑战
具身智能之心· 2025-06-27 08:36
Core Insights - The article emphasizes the significance of embodied AI and multi-sensor fusion perception (MSFP) as a critical pathway to achieving general artificial intelligence (AGI) through real-time environmental perception and autonomous decision-making [3][4]. Group 1: Importance of Embodied AI and Multi-Sensor Fusion - Embodied AI represents a form of intelligence that operates through physical entities, enabling autonomous decision-making and action capabilities in dynamic environments, with applications in autonomous driving and robotic swarm intelligence [3]. - Multi-sensor fusion is essential for robust perception and accurate decision-making in embodied AI systems, integrating data from various sensors like cameras, LiDAR, and radar to achieve comprehensive environmental awareness [3][4]. Group 2: Limitations of Current Research - Existing AI-based MSFP methods have shown success in fields like autonomous driving but face inherent challenges in embodied AI applications, such as the heterogeneity of cross-modal data and temporal asynchrony between different sensors [4][7]. - Current reviews often focus on single tasks or research areas, limiting their applicability to researchers in related fields [7][8]. Group 3: Structure and Contributions of the Research - The article organizes MSFP research from various technical perspectives, covering different perception tasks, sensor data types, popular datasets, and evaluation standards [8]. - It reviews point-level, voxel-level, region-level, and multi-level fusion methods, focusing on collaborative perception among multiple embodied agents and infrastructure [8][21]. Group 4: Sensor Data and Datasets - Various sensor types are discussed, including camera data, LiDAR, and radar, each with unique advantages and challenges in environmental perception [10][12]. - The article presents several datasets used in MSFP research, such as KITTI, nuScenes, and Waymo Open, detailing their modalities, scenarios, and the number of frames [12][13][14]. Group 5: Perception Tasks - Key perception tasks include object detection, semantic segmentation, depth estimation, and occupancy prediction, each contributing to the overall understanding of the environment [16][17]. Group 6: Multi-Modal Fusion Methods - The article categorizes multi-modal fusion methods into point-level, voxel-level, region-level, and multi-level fusion, each with specific techniques to enhance perception robustness [21][22][23][24][28]. Group 7: Multi-Agent Fusion Methods - Collaborative perception techniques are highlighted as essential for integrating data from multiple agents and infrastructure, addressing challenges like occlusion and sensor failures [35][36]. Group 8: Time Series Fusion - Time series fusion is identified as a key component of MSFP systems, enhancing perception continuity across time and space through various query-based fusion methods [38][39]. Group 9: Multi-Modal Large Language Model (LLM) Fusion - The integration of multi-modal data with LLMs is explored, showcasing advancements in tasks like image description and cross-modal retrieval, with new datasets designed to enhance embodied AI capabilities [47][50].
保姆级分享!ALOHA:低成本双臂机器人结合模仿学习经典工作
具身智能之心· 2025-06-27 08:36
Core Viewpoint - The article discusses the ALOHA system, a low-cost open-source hardware system designed for bimanual teleoperation, emphasizing its potential to perform precise manipulation tasks using affordable components and advanced learning algorithms [4][5][8]. Group 1: ALOHA System Overview - ALOHA is a low-cost system costing less than $20,000, designed to enable precise manipulation tasks using two low-cost robotic arms and 3D-printed components [7][8]. - The system utilizes end-to-end imitation learning to perform tasks by collecting real demonstrations from a custom remote operation interface [8][10]. Group 2: Challenges in Imitation Learning - Imitation learning faces challenges such as compounding errors, where small prediction errors accumulate, leading to significant deviations from expert behavior [9][12]. - The article highlights the difficulty of modeling complex physical interactions in tasks, suggesting that learning policies directly from demonstrations is more effective than modeling the entire environment [9][12]. Group 3: Action Chunking with Transformers (ACT) - The ACT algorithm addresses compounding errors by predicting sequences of actions rather than single steps, improving performance in tasks with high complexity [12][13]. - The algorithm has demonstrated an 80-90% success rate in tasks with only 10 minutes of demonstration data [12]. Group 4: Hardware Specifications - The ALOHA system is built on principles of low cost, versatility, user-friendliness, repairability, and ease of construction, utilizing ViperX 6-DoF robotic arms [17][18]. - The system is designed to perform various tasks, including precise, contact-based, and dynamic operations [20][22]. Group 5: Data Collection and Training - The system collects human demonstrations to train the policy, focusing on the leader robot's joint positions to capture the operator's intent and force feedback [23][25]. - The training process involves using a conditional variational autoencoder (CVAE) to model human data and improve learning from noisy demonstrations [33][55]. Group 6: Experimental Results - The article presents experimental results showing that action chunking and temporal ensembling significantly enhance the performance of the ACT algorithm [52][54]. - The necessity of high-frequency control is emphasized, with findings indicating that a control frequency of 50Hz allows for more precise and agile task execution [56].
3D VLA新范式!CVPR冠军方案BridgeVLA,真机性能提升32%
具身智能之心· 2025-06-26 14:19
Core Viewpoint - The article discusses the BridgeVLA model developed by the Institute of Automation, Chinese Academy of Sciences, which efficiently combines 3D input projection into 2D images for action prediction, achieving high performance and data efficiency in 3D robotic operation learning [4][6]. Group 1: Model Performance - BridgeVLA achieves a task success rate of 96.8% with only 3 trajectories in basic settings, demonstrating superior performance in various generalization settings compared to baseline models, with a 32% performance improvement [6][25]. - In simulation benchmarks such as RLBench, COLOSSEUM, and GemBench, BridgeVLA outperforms mainstream 3D robotic operation benchmarks, achieving an 88.2% success rate in RLBench, a 7.3% improvement in COLOSSEUM, and a 50% success rate in GemBench [20][25]. Group 2: Model Design and Training - BridgeVLA's training process consists of two phases: 2D heatmap pre-training to enhance spatial perception and 3D action fine-tuning to learn specific robotic operation strategies [15][17]. - The model utilizes a heatmap pre-training method to predict the probability heatmap of target object locations based on textual instructions, enhancing its spatial awareness [16][25]. Group 3: Generalization and Data Efficiency - BridgeVLA demonstrates strong generalization capabilities, effectively handling various disturbances such as unseen objects, lighting conditions, and object types, thanks to the rich visual and linguistic prior knowledge embedded in the pre-trained multimodal model [20][25]. - The model's high data efficiency is highlighted by its ability to achieve nearly the same performance with only 3 trajectories as with 10 trajectories, making it suitable for deployment in real robotic systems [25][26].
今年大火的目标导航到底是什么?从目标搜索到触达有哪些路线?
具身智能之心· 2025-06-26 14:19
Core Viewpoint - Goal-Oriented Navigation empowers robots to autonomously complete navigation tasks based on goal descriptions, marking a significant shift from traditional visual language navigation systems [2][3]. Group 1: Technology Overview - Embodied navigation is a core area of embodied intelligence, relying on three technical pillars: language understanding, environmental perception, and path planning [2]. - Goal-Oriented Navigation requires robots to explore and plan paths in unfamiliar 3D environments using only goal descriptions such as coordinates, images, or natural language [2]. - The technology has been industrialized in various verticals, including delivery, healthcare, and hospitality, enhancing service efficiency [3]. Group 2: Technological Evolution - The evolution of Goal-Oriented Navigation can be categorized into three generations: - First Generation: End-to-end methods focusing on reinforcement learning and imitation learning, achieving breakthroughs in Point Navigation and closed-set image navigation tasks [5]. - Second Generation: Modular methods that explicitly construct semantic maps, breaking tasks into exploration and goal localization [5]. - Third Generation: Integration of large language models (LLMs) and visual language models (VLMs) to enhance knowledge reasoning and open vocabulary target matching [7]. Group 3: Challenges and Learning Path - The complexity of embodied navigation, particularly Goal-Oriented Navigation, necessitates knowledge from multiple fields, making it challenging for newcomers to enter the domain [9]. - A new course has been developed to address these challenges, focusing on quick entry, building a research framework, and combining theory with practice [10][11][12]. Group 4: Course Structure - The course will cover the theoretical foundations and technical lineage of Goal-Oriented Navigation, including task definitions and evaluation benchmarks [15]. - It will also delve into the Habitat simulation ecosystem, end-to-end navigation methodologies, modular navigation architectures, and LLM/VLM-driven navigation systems [16][18][20][22]. - A significant project will focus on the reproduction of VLFM algorithms and their deployment in real-world scenarios [24].