具身AI

Search documents
Kitchen-R :高层任务规划与低层控制联合评估的移动操作机器人基准
具身智能之心· 2025-08-25 00:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 写在前面&出发点 1)基准的重要性 基准在自然语言处理(如GLUE)、计算机视觉(如Visual Genome)中广泛用于评估模型进展;在机器人领域,基于模拟器的基准(如Behavior-1K)同样常见,兼 具模型评估与训练功能,且需准确模拟低层动作以支持真实机器人的结果迁移。 2)现有基准的割裂问题 近年来,大语言模型(LLMs)和视觉语言模型(VLMs)被广泛用于机器人任务规划与指令遵循,但现有基准存在明显缺陷: 3)Kitchen-R的核心价值 基准是机器人学和具身AI领域评估进展的核心工具,但当前基准存在显著割裂: 高层语言指令遵循类基准 常假设低层执行完美,而 低层机器人控制类基准 仅依赖 简单单步指令。这种割裂导致无法全面评估"任务规划+物理执行"均关键的集成系统。 为填补该空白,这里提出 Kitchen-R基准 ——一个在仿真厨房环境 ...
首程控股联合万勋科技,赋能全国首个“机器人+”自动充电体验站落地成都
Ge Long Hui· 2025-08-20 04:15
近日,首程控股(0697.HK)与其被投企业万勋科技合作,于成都环贸ICD地下停车场推出全国首个向社 会公众全面开放的"机器人+"自动充电快闪体验站,并同步启动「全城放手充」互动体验活动。该站突 破技术瓶颈,实现不限车型的无感补能,标志着自动充电技术从封闭测试正式迈入规模化商业应用阶 段。 此次自动充电站的推出,不仅是技术展示,更是首程控股"资本+场景+技术"一体化模式的典型实践。 公司借助停车场景资源,推动高适应性机器人充电解决方案的规模化应用,一方面提升停车服务的科技 附加值和用户体验,另一方面也为未来智慧城市的服务型机器人基建打下基础。 公司表示,将继续利用自身在机器人行业中的产业资源与优势,推进机器人创新技术在停车、园区等场 景的规模化应用,强化其作为中国领先智能基础设施服务商的定位。 据了解,项目采用"仿生柔韧臂+具身AI"技术,以流体驱动和软材料仿生肌肉模拟人臂动作,结合多传 感器融合系统,保障在复杂光温条件下的稳定运作。该方案不仅安全性高、适应性强,还显著降低了部 署与运维成本,支持在多种商用场景中快速推广。 目前,体验活动仍在持续进行中,该充电站将长期面向公众开放,欢迎广大新能源车主前往试用体 ...
ChatGPT见顶后,AI新战场世界模型:中国已经先行一步!
老徐抓AI趋势· 2025-07-31 01:03
Core Viewpoint - The article discusses the transition from large language models (LLMs) to "world models" as the next competitive focus in AI, highlighting the limitations of LLMs and the potential of world models to reshape AI's future and drive economic growth [2][5][28]. Summary by Sections AI's Evolution - AI development is categorized into three stages: perceptual AI, generative AI, and embodied AI, with each stage representing significant technological advancements [5][18]. Stage One: Perceptual AI - The breakthrough in perceptual AI occurred in 2012 when Geoffrey Hinton's team surpassed human image recognition accuracy, but its capabilities were limited to recognition without reasoning or cross-domain learning [7][9]. Stage Two: Generative AI - The introduction of the Transformer architecture in 2017 marked a qualitative leap, enabling AI to train on vast amounts of text data, significantly increasing its knowledge base [12][13]. However, this growth is nearing a limit, with predictions that usable internet data for training will peak around 2028 [15]. Stage Three: Embodied AI - The next phase involves embodied AI, where AI learns through interaction with the real world rather than just textual data, necessitating the development of world models [16][18]. What is a World Model? - A world model is a high-precision simulator that adheres to physical laws, allowing AI to learn through trial and error in a virtual environment, significantly reducing the data collection costs associated with real-world training [19][20]. Challenges of World Models - Unlike simple video generation, world models must ensure consistency with physical laws to be effective for training AI, addressing issues like physical inconsistencies in generated scenarios [20][22]. Breakthroughs by SenseTime - SenseTime's "KAIWU" world model allows users to describe scenarios in natural language, generating videos that comply with physical laws, thus revolutionizing training for autonomous driving and robotics [22][24]. Implications of World Models - The shift to world models will change data production methods, enhance training efficiency, and transform industries such as autonomous driving, robotics, manufacturing, healthcare, and education [28]. Future Outlook - The emergence of world models is anticipated to accelerate economic growth, with the potential for a "ChatGPT moment" in the next 1-2 years, driven by unprecedented investment and innovation in the AI sector [28][29].
通信行业:OpenAI发布chatGPTAgent并预热GPT5,英伟达端侧Thor即将发货
Shanxi Securities· 2025-07-25 10:36
Investment Rating - The report maintains an investment rating of "Outperform the Market" for the communication industry [1]. Core Insights - OpenAI has launched the new ChatGPT Agent, significantly enhancing its ability to perform complex, long-duration tasks, which is expected to drive demand for GPU computing and cloud servers [2][3][16]. - OpenAI's latest reasoning model has achieved gold medal status at the International Mathematical Olympiad (IMO), indicating a major advancement in reasoning capabilities and foreshadowing the upcoming release of GPT-5 [4][17]. - NVIDIA's Jetson Thor is set to be released, marking a breakthrough in physical and embodied AI, with significant computational power that is expected to propel industry growth [5][18]. - The computing sector is experiencing a surge, with leading companies in the IDC supply chain reaching new highs, driven by improved earnings confidence and long-term demand expectations [8][20]. Summary by Sections Industry Dynamics - OpenAI's ChatGPT Agent can automate various tasks and has been made available to Pro, Plus, and Team subscribers, with usage limits based on subscription type [3][16]. - The new reasoning model from OpenAI has achieved a score of 44.4 in the Human Last Exam (HLE) assessment, the highest publicly available score in the industry [3][16]. - NVIDIA's Jetson Thor products, T4000 and T5000, are set to launch, boasting high computational capabilities and compatibility with existing AI platforms [5][18]. Market Performance - The overall market saw an increase during the week of July 14-18, 2025, with the Shenwan Communication Index rising by 7.56% [9][21]. - The top-performing sectors included optical modules (+27.45%), liquid cooling (+10.16%), and IDC (+10.01%) [9][21]. - Notable individual stock performances included gains of +39.01% for Xinyisheng and +24.33% for Zhongji Xuchuang [9][21]. Recommendations - Companies to focus on include Zhongji Xuchuang, Dongshan Precision, Guangku Technology, and others in the overseas computing sector, as well as companies like Ruixinwei and Tianzhun Technology in the edge AI sector [21].
图像目标导航的核心究竟是什么?
具身智能之心· 2025-07-04 12:07
Research Background and Core Issues - Image goal navigation requires two key capabilities: core navigation skills and direction information calculation based on visual observation and target image comparison [2] - The research focuses on whether this task can be efficiently solved through end-to-end training of complete agents using reinforcement learning (RL) [2] Core Research Content and Methods - The study explores various architectural designs and their impact on task performance, emphasizing implicit correspondence computation between images [3][4] - Key architectures discussed include Late Fusion, ChannelCat, SpaceToDepth + ChannelCat, and Cross-attention [4] Main Findings - Early patch-level fusion methods (like ChannelCat and Cross-attention) are more critical than late fusion methods (Late Fusion) for supporting implicit correspondence computation [8] - The performance of different architectures varies significantly under different simulator settings, particularly the "Sliding" setting [8][10] Performance Metrics - The success rate (SR) and success path length (SPL) metrics are used to evaluate the performance of various models [7] - For example, when Sliding=True, ChannelCat (ResNet9) achieved an SR of 83.6%, while Late Fusion only reached 13.8% [8] Transferability of Abilities - Some learned capabilities can transfer to more realistic environments, especially when including the weights of the perception module [10] - Training with Sliding=True and then fine-tuning in a Sliding=False environment improved SR from 31.7% to 38.5% [10] Relationship Between Navigation and Relative Pose Estimation - A correlation exists between navigation performance and relative pose estimation accuracy, indicating the importance of direction information extraction in image goal navigation [12] Conclusion - Architectural designs that support early local fusion (like Cross-attention and ChannelCat) are crucial for implicit correspondence computation [15] - The simulator's Sliding setting significantly affects performance, but transferring perception module weights can help retain some capabilities in real-world scenarios [15] - Navigation performance is related to relative pose estimation ability, confirming the core role of direction information extraction in image goal navigation [15]
传媒中期策略报告:关注扎实基本面支持下有新业务推进及兑现的龙头标的-20250704
Guotou Securities· 2025-07-04 08:52
Core Insights - The report emphasizes the importance of solid fundamentals and the advancement of new business models in leading companies within the media sector, particularly in the context of AI technology and its impact on content creation and distribution [1][2] - It highlights the need for a narrative shift in the media industry as it adapts to the AI era, focusing on how AI can reshape content forms and business models [1][2] Media Industry Historical Review - The media internet era began around 2005, marked by innovations in content forms such as online literature, gaming, and digital music, which laid the groundwork for future developments [10][11] - The peak of the traffic dividend in 2018 led to a significant policy-driven cleanup in the media internet industry, transitioning from a focus on content to a more diversified approach to distribution and monetization [17][21] Game Sector Analysis - The gaming sector has seen a stable competitive landscape since 2017, with major players like Tencent and NetEase dominating the market, particularly in mobile gaming [30][31] - The report notes that the gaming industry has evolved significantly, with a shift towards mobile games and the emergence of new business models, including live streaming and esports [30][34] Investment Recommendations - The report suggests focusing on leading companies with strong fundamentals and new business initiatives in the second half of 2025, particularly in the gaming and film sectors [2] - Specific companies to watch include Wanda Film, Bona Film Group, and several others in the gaming and publishing sectors, indicating a strategic interest in firms with merger and acquisition potential [2] Media Sector Performance - The media sector's performance in the first half of 2025 shows a notable increase, with a 12.77% rise, ranking it fourth in terms of growth among sectors [18] - The report indicates that the film industry continues to thrive, with box office revenues reaching new heights and a growing number of cinema screens [26][28]
下半年CCF-A/B类会议窗口期收窄,发一篇具身论文还来得及吗?
具身智能之心· 2025-06-29 09:51
Core Viewpoint - The article emphasizes the importance of timely submission of research papers to key conferences, particularly for researchers in autonomous driving and embodied AI, and highlights the challenges faced in ensuring high-quality submissions under time constraints [1]. Group 1: Pain Points Addressed - The program targets students who lack guidance from mentors, have fragmented knowledge, and need a clear understanding of the research process [3][4]. - It aims to help students establish research thinking, familiarize themselves with research processes, and master both classic and cutting-edge algorithms [3]. Group 2: Phases of Guidance - **Topic Selection Phase**: Mentors assist students in brainstorming ideas or providing direct suggestions based on their needs [5]. - **Experiment Phase**: Mentors guide students through experimental design, model building, parameter tuning, and validating the feasibility of their ideas [7][12]. - **Writing Phase**: Mentors support students in crafting compelling research papers that stand out to reviewers [9][13]. Group 3: Course Structure and Duration - The total guidance period varies from 3 to 18 months depending on the target publication's tier, with specific core guidance and maintenance periods outlined for different categories [22][26]. - For CCF A/SCI 1区, the core guidance consists of 9 sessions, while for CCF B/SCI 2区 and CCF C/SCI 3区, it consists of 7 sessions each [22]. Group 4: Additional Support and Resources - The program includes personalized communication with mentors through dedicated groups for idea discussions and course-related queries [24]. - Students receive comprehensive training on research paper submission methods, literature review techniques, and experimental design methodologies [23][28].
清华大学最新综述!具身AI中多传感器融合感知:背景、方法、挑战
具身智能之心· 2025-06-27 08:36
Core Insights - The article emphasizes the significance of embodied AI and multi-sensor fusion perception (MSFP) as a critical pathway to achieving general artificial intelligence (AGI) through real-time environmental perception and autonomous decision-making [3][4]. Group 1: Importance of Embodied AI and Multi-Sensor Fusion - Embodied AI represents a form of intelligence that operates through physical entities, enabling autonomous decision-making and action capabilities in dynamic environments, with applications in autonomous driving and robotic swarm intelligence [3]. - Multi-sensor fusion is essential for robust perception and accurate decision-making in embodied AI systems, integrating data from various sensors like cameras, LiDAR, and radar to achieve comprehensive environmental awareness [3][4]. Group 2: Limitations of Current Research - Existing AI-based MSFP methods have shown success in fields like autonomous driving but face inherent challenges in embodied AI applications, such as the heterogeneity of cross-modal data and temporal asynchrony between different sensors [4][7]. - Current reviews often focus on single tasks or research areas, limiting their applicability to researchers in related fields [7][8]. Group 3: Structure and Contributions of the Research - The article organizes MSFP research from various technical perspectives, covering different perception tasks, sensor data types, popular datasets, and evaluation standards [8]. - It reviews point-level, voxel-level, region-level, and multi-level fusion methods, focusing on collaborative perception among multiple embodied agents and infrastructure [8][21]. Group 4: Sensor Data and Datasets - Various sensor types are discussed, including camera data, LiDAR, and radar, each with unique advantages and challenges in environmental perception [10][12]. - The article presents several datasets used in MSFP research, such as KITTI, nuScenes, and Waymo Open, detailing their modalities, scenarios, and the number of frames [12][13][14]. Group 5: Perception Tasks - Key perception tasks include object detection, semantic segmentation, depth estimation, and occupancy prediction, each contributing to the overall understanding of the environment [16][17]. Group 6: Multi-Modal Fusion Methods - The article categorizes multi-modal fusion methods into point-level, voxel-level, region-level, and multi-level fusion, each with specific techniques to enhance perception robustness [21][22][23][24][28]. Group 7: Multi-Agent Fusion Methods - Collaborative perception techniques are highlighted as essential for integrating data from multiple agents and infrastructure, addressing challenges like occlusion and sensor failures [35][36]. Group 8: Time Series Fusion - Time series fusion is identified as a key component of MSFP systems, enhancing perception continuity across time and space through various query-based fusion methods [38][39]. Group 9: Multi-Modal Large Language Model (LLM) Fusion - The integration of multi-modal data with LLMs is explored, showcasing advancements in tasks like image description and cross-modal retrieval, with new datasets designed to enhance embodied AI capabilities [47][50].
清华大学最新综述!当下智能驾驶中多传感器融合如何发展?
自动驾驶之心· 2025-06-26 12:56
Group 1: Importance of Embodied AI and Multi-Sensor Fusion Perception - Embodied AI is a crucial direction in AI development, enabling autonomous decision-making and action through real-time perception in dynamic environments, with applications in autonomous driving and robotics [2][3] - Multi-sensor fusion perception (MSFP) is essential for robust perception and accurate decision-making in embodied AI, integrating data from various sensors like cameras, LiDAR, and radar to achieve comprehensive environmental awareness [2][3] Group 2: Limitations of Current Research - Existing AI-based MSFP methods have shown success in fields like autonomous driving but face inherent challenges in embodied AI, such as the heterogeneity of cross-modal data and temporal asynchrony between different sensors [3][4] - Current reviews on MSFP often focus on single tasks or research areas, limiting their applicability to researchers in related fields [4] Group 3: Overview of MSFP Research - The paper discusses the background of MSFP, including various perception tasks, sensor data types, popular datasets, and evaluation standards [5] - It reviews multi-modal fusion methods at different levels, including point-level, voxel-level, region-level, and multi-level fusion [5] Group 4: Sensor Data and Datasets - Various sensor data types are critical for perception tasks, including camera data, LiDAR data, and radar data, each with unique advantages and limitations [7][10] - The paper presents several datasets used in MSFP research, such as KITTI, nuScenes, and Waymo Open, detailing their characteristics and the types of data they provide [12][13][14] Group 5: Perception Tasks - Key perception tasks include object detection, semantic segmentation, depth estimation, and occupancy prediction, each contributing to the overall understanding of the environment [16][17] Group 6: Multi-Modal Fusion Methods - Multi-modal fusion methods are categorized into point-level, voxel-level, region-level, and multi-level fusion, each with specific techniques to enhance perception robustness [20][21][22][27] Group 7: Multi-Agent Fusion Methods - Collaborative perception techniques integrate data from multiple agents and infrastructure, addressing challenges like occlusion and sensor failures in complex environments [32][34] Group 8: Time Series Fusion - Time series fusion is a key component of MSFP systems, enhancing perception continuity across time and space, with methods categorized into dense, sparse, and hybrid queries [40][41] Group 9: Multi-Modal Large Language Model (MM-LLM) Fusion - MM-LLM fusion combines visual and textual data for complex tasks, with various methods designed to enhance the integration of perception, reasoning, and planning capabilities [53][54][57][59]
专家访谈汇总:类人机器人训练,催生推理专用芯片
阿尔法工场研究院· 2025-06-18 11:24
Group 1: Electronic Components Sector - The electronic components sector has seen a strong rise, with an increase of over 5%, indicating strong market expectations for this sector [1] - The demand for high-performance, miniaturized, and integrated electronic components is continuously rising due to the upgrade trend in terminal products like 5G smartphones and smart wearable devices [1] - The number and performance requirements of electronic components in 5G smartphones are significantly higher than in 4G smartphones, particularly for core components like RF, filters, and IC substrates, driving growth in the PCB and upstream materials market [1] - The government has introduced multiple policies to support the electronic components industry, including tax incentives and special subsidies, aimed at achieving self-sufficiency and breakthroughs in key technologies [1] - Domestic manufacturers are gaining greater market space and policy benefits due to the dual pressures of international trade friction and supply chain security, making domestic substitution a key industry development theme [1] - Companies like Huadian Co., Shengnan Circuit, and Zhongjing Electronics are positioned well in high-density HDI boards and other niche markets, showing good growth potential [1] Group 2: Computing Power and Optical Networks - In 2024, over 90% of new resources will come from large or super-large projects, with high-power intelligent computing centers accounting for 40%, indicating a shift of core areas towards the "East Data West Computing" model [2] - Dongshan Precision plans to invest nearly 6 billion RMB to fully acquire Solstice Optoelectronics, which specializes in 10G to 800G optical modules, serving data centers and 5G base stations [2] - Hollow-core optical fibers are becoming a key area for next-generation communication infrastructure due to their ultra-low latency and high bandwidth, despite facing standard and cost barriers [2] Group 3: Memory Prices and A-share Storage Industry Impact - Major DRAM manufacturers like Samsung, SK Hynix, and Micron have announced a halt in DDR4 memory chip production, marking the end of the DDR4 product lifecycle [3] - The collective exit of these manufacturers has led to a sharp supply contraction, with DDR4 prices surging by 53% in May, the largest increase since 2017 [3] - This price increase is characterized by supply-side dominance, representing a structural opportunity that catalyzes the storage industry and domestic substitution processes [3] - As global suppliers exit, Chinese manufacturers are poised to rapidly increase their market share in the mid-to-low-end DDR4/LPDDR4 segments [3] - Micron will retain DDR4 shipments only for long-term clients in automotive and industrial sectors, allowing PC and consumer market orders to shift to domestic manufacturers [3] Group 4: AI and Robotics - The surge in token generation has driven computing power demand from G-level to TB-level, creating strong demand for inference-specific chips like NVIDIA Blackwell [4] - The convergence of "information robots" and "embodied AI" is shifting humanoid robot training from the physical world to Omniverse simulation training and Thor deployment [4]