Workflow
具身智能之心
icon
Search documents
空间智能终极挑战MMSI-Video-Bench来了
具身智能之心· 2026-01-06 00:32
Core Insights - The article discusses the launch of the MMSI-Video-Bench, a comprehensive benchmark for evaluating spatial intelligence in multimodal large language models (MLLMs), emphasizing the need for models to understand and interact with complex real-world environments [1][5][25]. Group 1: Benchmark Features - MMSI-Video-Bench is designed with a systematic approach to assess models' spatial perception capabilities, focusing on spatial construction and motion understanding [5][6]. - The benchmark evaluates high-level decision-making abilities based on spatiotemporal information, including memory update and multi-view integration [6][7]. - It consists of five main task types and 13 subcategories, covering planning and prediction capabilities [9]. Group 2: Model Performance - The benchmark revealed that even the best-performing model, Gemini 3 Pro, achieved only 38% accuracy, indicating a significant performance gap of nearly 60% compared to human levels [10][14]. - The evaluation highlighted deficiencies in models' spatial construction, motion understanding, planning, and prediction capabilities [14][16]. - Detailed error analysis identified five main types of errors affecting model performance, including detailed grounding errors and geometric reasoning errors [16][20]. Group 3: Data Sources and Evaluation - The video data for MMSI-Video-Bench is sourced from 25 public datasets and one self-built dataset, encompassing various real-world scenarios [11]. - The benchmark allows for targeted assessments of specific capabilities in indoor scene perception, robotics, and grounding [11]. Group 4: Future Directions - The article suggests that introducing 3D spatial cues could enhance model understanding and reasoning capabilities [21][26]. - It emphasizes the ongoing challenge of designing models that can effectively utilize spatial cues and highlights that current failures are rooted in fundamental reasoning limitations rather than a lack of explicit reasoning steps [26].
一个近300篇工作的综述!从“高层规划和低层控制”来看Manipulation任务的发展
具身智能之心· 2026-01-06 00:32
Core Insights - The article discusses the transformative advancements in robotic manipulation driven by the rapid development of visual, language, and multimodal learning, emphasizing the role of large foundation models in enhancing robots' perception and semantic representation capabilities [1][2]. Group 1: High-Level Planning - High-level planning is responsible for clarifying action intentions, organizing sequences, and allocating environmental attention, providing structured guidance for low-level execution [4]. - The core components of high-level planning include task decomposition and decision guidance, integrating multimodal information to address "what to do" and "in what order" [4]. - Task planning based on large language models (LLMs) maps natural language to task steps, with methods like SayCan and Grounded Decoding enhancing execution skill selection and planning capabilities [5]. - Multimodal large language models (MLLMs) break the limitations of pure text input by integrating visual and language reasoning, with models like PaLM-E and VILA demonstrating superior performance in embodied tasks [8]. - Code generation techniques convert planning into executable programs, improving the precision of language-based plans through methods like Code as Policies and Demo2Code [9]. - Motion planning utilizes LLMs and VLMs to generate continuous motion targets, linking high-level reasoning with low-level trajectory optimization [10]. - Usability learning focuses on establishing intrinsic associations between perception and action across geometric, visual, semantic, and multimodal dimensions [11]. - 3D scene representation transforms environmental perception into structured action proposals, bridging perception and action through techniques like Gaussian splatting [12]. Group 2: Low-Level Learning Control - Low-level control translates high-level planning into precise physical actions, addressing the "how to do" aspect of robotic manipulation [14]. - Learning strategies for skill acquisition are categorized into three main types, including pre-training and model-free reinforcement learning [16]. - Input modeling defines how robots perceive the world, emphasizing the integration of multimodal signals through reinforcement learning and imitation learning [18]. - Visual-action models utilize both 2D and 3D visual inputs to enhance action generation, while visual-language-action models integrate semantic, spatial, and temporal information [19]. - Additional modalities like tactile and auditory signals improve robustness in contact-rich manipulation scenarios [20]. Group 3: Challenges and Future Directions - Despite significant technological advancements, robotic manipulation faces four core challenges: the lack of universal architectures, data and simulation bottlenecks, insufficient multimodal physical interaction, and safety and collaboration issues [23][27][28][29]. - Future research directions include developing a "robotic brain" for flexible modal interfaces, establishing autonomous data collection mechanisms, enhancing multimodal physical interaction, and ensuring safety in human-robot collaboration [30]. - The review emphasizes the need for a unified framework that integrates high-level planning and low-level control, with a focus on overcoming data efficiency, physical interaction, and safety collaboration bottlenecks to facilitate the transition of robotic manipulation from laboratory settings to real-world applications [31].
多家具身公司正在推进IPO......
具身智能之心· 2026-01-05 09:28
2025年11月15日,宇树科技IPO上市辅导工作完成,公告显示,宇树科技拟申请在境内IPO。中国证监会官网 的这一信息。 2)银河通用 2025年12月,多个媒体报道银河通用完成股改,开始筹赴港上市。 最近社区分享了几家正在IPO进程的公司,都在大考,还有许多公司正在排队准备了。这里为大家整理了下相 关内容,本次整理参考各类已经公开的内容,若有不足欢迎后台指正。 1)宇树科技 3)智元机器人 智元机器人于2025年3月完成股改,11月宣布拟赴港IPO。 2025年7月通过其持股平台计划收购科创板上市公司上纬新材63.62%的股份,被市场解读为"借壳上市"的关键 落子。尽管智元对外回应称"本次行动仅为收购控股权,不构成《重大资产重组办法》所定义的借壳上市", 但这一动作被业界视为加速上市进程的重要举措。 4)乐聚机器人 1)持续的直播分享 社区为大家准备了很多圆桌论坛、直播,从本体、数据到算法,各类各样,逐步为大家分享具身行业究竟在 发生什么?还有哪些问题待解决。 2025年10月30日,乐聚智能(深圳)股份有限公司在深圳证监局完成上市辅导备案登记,辅导券商为东方证 券。 5)云深处 2025年12月23日, ...
网传某头部具身公司上市“绿色通道”被叫停,当事人正式回应......
具身智能之心· 2026-01-05 03:30
点击下方 卡片 ,关注" 具身智能 之心 "公众号 转载丨澎湃新闻 本文只做学术分享,如有侵权,联系删文 宇树科技于2025年7月8日提交了辅导备案登记材料,由中信证券担任辅导机构; >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 近日,某媒体发布宇树科技上市相关报道,涉及"所谓绿色通道被叫停",并被众多媒体、网络平台及自媒体大规模转载。 昨日,宇树科技正式向相关新闻媒体回应。该报道涉及我司上市工作相关动态情况的内容与事实情况不符,我司未涉及申请"绿色通道"相关事宜。 相关报道误导公众认知,已严重侵害我司合法权益。我司已向主管部门反映,同时督促相关方撤回不实报道。我司在此严正声明,后续将保留通过法律手段追责的 权利。宇树科技表示,目前,公司上市工作正常推进,相关进展将依法依规进行披露,感谢社会各界对公司的关心与支持。 4日早些时候,有报道称,宇树科技A股上市的绿色通道被叫停,但常规上市流程仍继续,被叫停或是因为"目前机器人赛道泡沫太大了",管理层希望能降降温。 随后,一张疑似宇树科技创始人、董事长王兴兴 ...
全职/兼职/实习!具身智能之心招募运营、编辑和销售的同学了
具身智能之心· 2026-01-05 03:30
负责公众号、小红书、社群的运营,提升粉丝粘性和关注度。我们希望您有一定的运营能力,对自媒体平台的玩 法有一定认识。 咨询我们 负责日常公众号平台的内容创作、编辑,我们希望您具备一定的专业基础,在知乎、公众号等平台上具有内容创 作经验。 销售岗位 负责平台课程、硬件等产品的销售推广。我们希望您具备一定的销售基础,对具身用户需求与市场有一定的了 解。 运营岗位 具身智能之心是具身领域的优秀技术创作平台,为行业输出了大量的前沿技术、课程、行业概况、融资、产品、 政策等内容。 现平台正处于上升期,因业务需求,面向全体粉丝招募编辑、运营、销售岗位,和我们一起继续为领域创造价 值,全职+实习哦(实习除编辑岗位均需线下哦~) 编辑岗位 如果您有兴趣和我们一起成长,欢迎添加峰哥微信oooops-life ...
王鹤团队最新!解决VLA 模型缺乏精准几何信息的问题
具身智能之心· 2026-01-05 01:03
Core Insights - The article discusses the development of the StereoVLA model, which enhances Vision-Language-Action (VLA) models by integrating stereo vision to address spatial perception challenges in robotic manipulation [1][4][16]. Group 1: Challenges in Existing VLA Models - Current VLA models primarily rely on single-view RGB images, which lack precise spatial geometric information, making them inadequate for high-precision manipulation tasks [1][4]. - Three core challenges identified include limitations of single-modal vision, difficulties in integrating geometric and semantic information, and the complexity of multi-camera setups [4][6][5]. Group 2: StereoVLA Technical Architecture - StereoVLA features a three-layer technical architecture: feature extraction, auxiliary training, and data support, which collectively enhance geometric perception and semantic understanding [8][10]. - The feature extraction module efficiently integrates geometric cues from stereo vision with semantic information from single-view images, improving the model's performance [12]. Group 3: Performance Validation - StereoVLA demonstrates significant performance improvements over existing baseline models in three key tasks, including general manipulation, bar object grasping, and small object manipulation [13][14]. - In comparative tests across various camera configurations, StereoVLA exhibited superior robustness to camera pose variations, achieving success rates of 79.3%, 71.9%, and 61.3% for small, medium, and large settings, respectively [14]. Group 4: Key Findings from Ablation Studies - Ablation studies confirmed the necessity of key design features, showing that the absence of semantic features led to a significant drop in success rates, validating the importance of geometric-semantic integration [15][18]. - The model's depth estimation strategy improved success rates by 18% compared to uniform sampling across the entire image, highlighting the effectiveness of focusing on interaction areas [18]. Group 5: Limitations and Future Directions - While StereoVLA represents a significant advancement in integrating stereo vision with VLA models, there are still areas for optimization, such as addressing long-term dependencies and enhancing feature extraction quality [16][18]. - Future work may involve expanding the model's applicability to humanoid robots and exploring additional stereo vision foundational models to further improve geometric feature quality [18].
AAAI 2026 | 小鹏联合北大,专为VLA模型定制视觉token剪枝方法
具身智能之心· 2026-01-05 01:03
Core Viewpoint - The article discusses the development of FastDriveVLA, a new framework for efficient visual token pruning in end-to-end autonomous driving systems, which significantly reduces computational costs and improves inference efficiency [1][8]. Group 1: Research Background and Problem - End-to-end autonomous driving shows great potential to transform future transportation systems, learning the entire driving process within a unified framework, thus reducing errors in information transfer between modules [7]. - Existing VLA models convert visual inputs into a large number of visual tokens, leading to significant computational overhead and increased inference latency, posing challenges for real-world deployment [7][8]. - Previous research aimed at reducing visual tokens has limitations in autonomous driving scenarios, as new designs often require retraining the entire model, and pruning strategies based on attention or similarity may retain irrelevant information [7][8]. Group 2: Methodology and Innovations - FastDriveVLA introduces a novel, reconstruction-based visual token pruning framework specifically tailored for end-to-end autonomous driving [8]. - The research team hypothesized that visual tokens related to foreground information are more valuable than those related to background content, leading to the creation of the nuScenes-FG dataset, which includes 241,000 images with foreground annotations [2][13]. - The lightweight, plug-and-play pruning tool, ReconPruner, is designed to effectively identify and select meaningful foreground visual tokens, utilizing a masked image modeling approach for pixel reconstruction [16][19]. Group 3: Experimental Results - FastDriveVLA achieved state-of-the-art (SOTA) performance in open-loop planning benchmarks on the nuScenes dataset, demonstrating significant efficiency improvements [2][20]. - When the number of visual tokens was reduced from 3,249 to 812, FastDriveVLA's FLOPs decreased by approximately 7.5 times, and it reduced prefill time by 3.7 times and decode time by 1.3 times, enhancing inference efficiency [26][27]. - The framework outperformed existing methods across various pruning ratios, particularly at a 50% pruning rate, where it maintained a balanced performance across all metrics [25][28]. Group 4: Efficiency Analysis - FastDriveVLA's efficiency was analyzed in terms of FLOPs and CUDA latency, showing a significant reduction in computational requirements while maintaining high performance [26][27]. - At a 25% pruning rate, FastDriveVLA demonstrated the best performance across all evaluation metrics, indicating that focusing on foreground-related visual tokens is crucial for enhancing autonomous driving performance [28].
RoboMIND 2.0:面向通用化具身智能的大规模双臂移动操作数据集
具身智能之心· 2026-01-05 01:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 近期北京人形机器人和北京大学团队发布RoboMIND 2.0:一款面向通用化具身智能的大规模双臂移动操作数据集,通过整合 6 种异构机器人平台的 310K 轨迹 数据、多模态感知信息(含触觉)、高保真数字孪生资产及标准化标注体系,填补了现有数据集在双臂协调、移动操作、跨形态泛化等维度的空白。配套提出的 MIND-2 快慢双系统框架(高层 VLM 规划 + 低层 VLA 执行),基于离线强化学习融合成功与失败轨迹训练,在长时域复杂任务、多机器人协作场景中显著超 越传统模仿学习与现有 VLA 模型,为机器人通用化操作能力的提升提供了数据支撑与算法范式。 机器人操作领域的瓶颈和痛点 在机器人操作领域,数据驱动的模仿学习已成为突破传统控制局限的核心路径,但现有数据集与算法体系仍面临多重瓶颈,严重制约了机器人在真实场景中的通 用化部署: 1. 数据集维度单一,缺乏综合多样性 现有数据集多聚焦单一维度的多样性(如仅覆盖单一机器人形态、单一任务类型或单一环境),难以支撑跨场景、跨硬件的泛化学习。例如,多数主流数据集以 单臂固定基座操作数据为主,缺乏双臂协同、移动操作的大规 ...
半年交付5000台!这家公司开启了26年具身领域的首笔融资~
具身智能之心· 2026-01-05 01:03
Core Viewpoint - The article highlights the rapid growth and technological advancements of Zhishen Technology, a startup in the field of embodied intelligence, which has successfully completed multiple rounds of financing and achieved significant product delivery milestones within a short time frame [3][30]. Financing and Investment - Zhishen Technology announced the completion of several financing rounds, accumulating several hundred million yuan, with investments from industry-related capital, indicating strong recognition from the industrial ecosystem [3][4]. - The financing round focuses on supporting industrial upgrades and technological innovation to enhance overall competitiveness [4]. Product Development and Production - In just two years, Zhishen Technology has completed product design, core component research and development, and established its own factory, achieving mass production of 6,000 units within six months of starting production in June 2025, with revenue exceeding 100 million yuan [4][30]. - The company has developed two main models: the Steel Coin L1 and the Copper Hammer M1, with the former weighing 15 kg and designed for security and inspection applications, and the latter weighing 35 kg, suitable for heavy-duty scenarios [13][15]. Technological Innovation - Zhishen Technology has achieved breakthroughs in key technologies, including high-power joint modules and bionic quadrupeds, covering the entire process from core components to complete machines [8][19]. - The company has established a quality assurance system for mass production, ensuring that each product can withstand extreme conditions and long-term use [12]. Market Position and Competitive Advantage - The company's competitive edge lies in its ability to provide clear value to the industry through a closed-loop system encompassing components, hardware, software algorithms, and solutions [18]. - Zhishen Technology has launched the MATRiX simulation platform, which addresses high costs and inefficiencies in the deployment process, significantly improving development and implementation efficiency for partners [21]. Industry Impact and Future Outlook - The year 2025 has been particularly busy for Zhishen Technology, with multiple product launches and participation in competitions, including winning the IROS 2025 quadruped robot challenge [25][27]. - The company is actively applying its robots in various scenarios, including power inspection and firefighting, and is exploring new business models in the entertainment sector [33][34]. - The recent financing reflects market recognition of Zhishen Technology's production capabilities, technological depth, and ecological strategy, positioning it as a key player in the advancement of human-machine coexistence [36].
王鹤团队最新工作!解决VLA 模型多依赖单视角图像,缺乏精准几何信息的问题
具身智能之心· 2026-01-04 08:58
Core Viewpoint - The article discusses the development of the StereoVLA model, which integrates stereo vision into Vision-Language-Action (VLA) models to enhance spatial perception and improve robotic manipulation capabilities. Group 1: Challenges in Existing VLA Models - Existing VLA models face three core challenges in spatial perception: limitations of single-modal vision, difficulties in integrating geometric and semantic information, and the constraints of current sensor technologies [4][5][6]. Group 2: Technical Architecture of StereoVLA - StereoVLA is built on a three-layer technical architecture: feature extraction, auxiliary training, and data support, which allows for deep integration of geometric perception and semantic understanding [8][10]. - The feature extraction module efficiently combines geometric cues from stereo vision with semantic information from single-view images, enhancing the model's performance [12]. Group 3: Performance Validation - StereoVLA demonstrates significant performance improvements in three key tasks compared to baseline models, achieving near-perfect success rates in specific object manipulation scenarios [13]. - In a comparison of camera configurations, StereoVLA shows superior robustness to camera pose variations, outperforming other setups in various scenarios [14][17]. Group 4: Key Findings from Ablation Studies - Ablation studies confirm the necessity of key design features, showing that the absence of semantic features leads to a significant drop in success rates, highlighting the importance of geometric-semantic integration [15][18]. Group 5: Limitations and Future Directions - While StereoVLA represents a breakthrough in integrating stereo vision with VLA models, there are areas for optimization, including the need for better long-term dependency capture and adaptation to multi-robot scenarios [16][18].