Workflow
具身智能之心
icon
Search documents
大摩预测了25家人形机器人公司将主导行业,没有宇树、智元
具身智能之心· 2025-12-12 07:59
Core Insights - Morgan Stanley predicts that 25 humanoid robot companies will dominate the industry, with 7 Chinese companies listed [2][3] - The Chinese companies include Baidu, Alibaba, Horizon Robotics, Junsheng Electronics, iFlytek, Desay SV, and Hesai Technology, focusing on various sectors such as AI, automotive, and electronic manufacturing [3][4] - The report emphasizes the importance of component and module suppliers over traditional humanoid robot manufacturers, highlighting the critical role of companies providing AI chips, visual sensors, precision actuators, and power management chips [3][4] Company and Industry Summary - The 7 Chinese companies identified are significant players in their respective fields, with a focus on AI, automotive intelligence, language recognition, and electronic manufacturing [3] - The absence of companies like Yushun and Zhiyuan in the report raised questions about its professionalism, but Morgan Stanley justified this by focusing on the foundational components essential for the humanoid robot industry [4] - The Chinese market has seen the emergence of nearly 150 humanoid robot startups, indicating a growing interest and investment in this sector, regardless of potential market bubbles [4]
GLaD:知识蒸馏将3D几何先验注入VLA模型,任务成功率突破94%
具身智能之心· 2025-12-12 01:22
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Minghao Guo等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 一、研究背景与核心动机 视觉-语言-动作(VLA)模型是具身智能领域的关键技术,能够让机器人直接从视觉观测和自然语言指令中生成控制动作。现有VLA模型大多依赖CLIP、SigLIP等 2D视觉编码器,这类编码器擅长捕捉图像与文本的语义对应关系,却无法编码3D空间信息(如深度、物体位姿、空间关系)。 这种缺陷会导致模型在操作任务中出现错误的注意力分配,如figure1所示:在"将桌布从桌角移到桌边"和"拾取盘子与ramekin之间的黑碗并放到盘子上"任务中,传 统VLA模型会错误关注无关区域,无法精准定位任务相关物体,进而影响操作任务的完成精度。 为解决这一问题,研究团队提出GLaD框架,核心思路是通过知识蒸馏将3D几何先验注入VLA模型,使其同时具备语义理解和空间推理能力,且无需依赖额外的深 度传感器或3D标注。 ...
被拒≠失败!这些高影响力论文都被顶会拒收过
具身智能之心· 2025-12-12 01:22
Core Insights - Waymo has released a deep blog detailing its AI strategy centered around its foundational model, emphasizing the use of distillation methods to create high-efficiency models for onboard operations [1][2] - Jeff Dean highlighted the significance of knowledge distillation, comparing it to the creation of the Gemini Flash model, which showcases the importance of distillation in AI model efficiency [1][2] Historical Context of Rejected Papers - Many foundational technologies in AI, such as optimizers for large models and computer vision techniques, were initially rejected by top conferences, showcasing a historical pattern of oversight in recognizing groundbreaking innovations [6] - Notable figures in AI, including Geoffrey Hinton and Yann LeCun, have faced rejection for their pioneering work, which was later recognized as transformative [6] Case Studies of Rejected Innovations - LSTM, a milestone for sequence data processing, was rejected by NIPS in 1996 but later became crucial in speech recognition and machine translation, highlighting the delayed recognition of its value [7][10] - SIFT, a dominant algorithm in computer vision, faced rejection from ICCV and CVPR due to its perceived complexity, yet proved to be vital in real-world image processing [11][13] - Dropout, a key regularization method for deep neural networks, was initially rejected for its radical approach but later became essential in training deep networks effectively [17][19] - Word2Vec, despite being rejected at ICLR, became a cornerstone in NLP due to its efficiency and practical application, eventually receiving recognition for its impact [20][24] - YOLO transformed object detection by prioritizing speed over precision, facing rejection for its perceived shortcomings but later becoming a widely adopted framework in the industry [28][30] Reflection on Peer Review Limitations - The peer review system often struggles to recognize disruptive innovations, leading to a systematic cognitive lag in evaluating groundbreaking research [40][41] - The tendency to equate mathematical complexity with research contribution can hinder the acceptance of simpler yet effective methods [41] - Historical examples illustrate that the true measure of a research's impact is not determined by initial peer review outcomes but by its long-term relevance and problem-solving capabilities [43][47]
NeurIPS'25! AutoSeg3D:在线完成任意3D分割,只需1张4090
具身智能之心· 2025-12-12 01:22
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨 具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 前沿 在大模型时代都在拼scaling,对于具身自驾这种任务似乎都想要8卡起步训练一个模型,今天借助分享的机会也给大家推荐可以1张4090就能发顶会的方向,就是本 文写的具身场景点云实例分割。当然不是说推荐给大家一个用少量资源"水论文"的方式,当时让学生做这个方向也是因为觉得是可以真实落地的技术,也没特别出 乎意料,这篇论文已经开始在两家公司进行技术转化切实落地。对于具身来说,VLA或者各种所谓世界模型是很fancy,但是还有很多听起来不那么fancy的方向既 能发论文又能真实落地,也希望能看到更多底层技术可以被研究优化支撑真正的产业化。 也欢迎大家来无界-AutoLab联合实验室(上海)实习,一起共创各种有意 思的技术方向:) -- Dylan老师 论文总结 (1)作者发现现有的在线 VFM 辅助方法通常先用 SAM 等 VFMs 预测 2D ...
AAAI 2026 Oral | 机器人也能“看人学活”?一次示范就能学会新任务!
具身智能之心· 2025-12-12 01:22
Core Insights - The article discusses a novel approach to robot learning through human demonstration, emphasizing the importance of fine-grained action alignment between human and robot movements [3][4][8]. - The proposed method, Human2Robot, utilizes a new dataset (H&R) and a two-stage framework to enhance robot learning capabilities, enabling one-shot generalization to new tasks [3][4][9]. Summary by Sections Introduction - The article introduces the limitations of existing methods that rely on coarse alignment of human-robot video pairs, which often leads to a lack of understanding of fine-grained actions necessary for task generalization [3][8]. Methodology - A new dataset, H&R, consisting of 2,600 synchronized human and robot action videos, is introduced to facilitate better learning [9]. - The Human2Robot framework consists of two main stages: Video Prediction Model (VPM) and Action Decoder [12][16]. Video Prediction Model (VPM) - The VPM generates robot action videos based on human demonstrations, allowing the model to learn detailed action dynamics [13][14]. - The model captures key information about the robot's shape and human hand movements through Spatial UNet and Spatial-Temporal UNet [15]. Action Decoder - The Action Decoder translates the generated video features into specific robot movements, enabling real-time task execution without needing continuous video input [16][20]. Experimental Results - Human2Robot outperforms existing baseline methods by maintaining a success rate improvement of over 10-20% across various tasks, demonstrating its effectiveness in leveraging detailed human video conditions [20][27]. - The introduction of KNN in the Human2Robot framework shows that it can still perform well even without direct demonstration input, indicating robust task execution capabilities [20][27]. Generalization Capability - Human2Robot exhibits strong generalization across different tasks, including new positions and object instances, due to the clear action correspondences established by the H&R dataset [27]. Ablation Studies - The effectiveness of the VPM is validated through experiments showing that relying solely on human video input leads to poor performance, highlighting the necessity of the video generation process for reliable action mapping [25][26].
具身智能之心求职与内推服务正式对外啦!
具身智能之心· 2025-12-11 09:33
具身智能之心的职位内推服务正式对外啦!近50家主流具身公司,校招&社招&实习均可。第一时间拿到靠谱 和高薪的岗位,欢迎简历砸来~ 这些问题,我们将逐个答复~~~ 各家公司的薪资结构 √ 技术路线和上升通道 √ 未来行业发展的前景 √ 工作到底适不适合自己的性格 √ ...... ...
只用SO-100可以完成π0和π0.5的效果吗?
具身智能之心· 2025-12-11 09:33
Core Viewpoint - The article discusses the challenges and complexities faced by beginners in implementing VLA (Vision-Language Alignment) models, emphasizing the need for practical experience and effective training methods to achieve successful deployment in real-world applications [2][4]. Group 1: Challenges in VLA Implementation - Many students report difficulties in achieving effective results with open-source models like GR00T and PI0, despite low training loss in simulations [2][4]. - The transition from simulation to real-world application (sim2real) poses significant challenges, particularly in data collection and model training [6][7]. - Beginners often struggle with the intricacies of data collection, model training, and deployment, leading to frustration and lack of progress [4][10]. Group 2: VLA Model Components - Data collection methods for VLA primarily include imitation learning and reinforcement learning, with a focus on high-quality data acquisition [6]. - Training VLA models typically requires simulation debugging and fine-tuning, especially when real-world data is limited [7]. - Deployment of VLA models necessitates optimization techniques such as model compression to ensure efficient performance on edge devices [9]. Group 3: Educational Initiatives - The article introduces a practical course aimed at helping students effectively learn VLA, covering various aspects such as hardware, data collection, algorithms, and real-world experiments [10][12]. - The course is designed for individuals seeking to enter the field of embodied intelligence, providing hands-on experience and project support [22][25]. - The course will commence on December 30, 2025, and includes a comprehensive curriculum to enhance participants' skills in VLA [23][26].
全球首个!灵巧手真实世界具身数采引擎Psi-SynEngine来了
具身智能之心· 2025-12-11 04:02
以下文章来源于灵初智能 ,作者PsiBot 灵初智能 . 灵初智能推进基于强化学习算法的机器人技能集训练、场景化的数据生成及采集、端到端解决方案的研发及落地,打造业界领先的通用操作智能体。 作者丨 PsiBot 编辑丨机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 灵初智能发布全球首个具身原生人类数据采集方案 Psi-SynEngine。该方案由灵初智能全栈自研,包含便携式外骨骼触觉手套数采套装、大规模 in the wild 数采数 据管线、基于世界模型和强化学习的跨本体数据迁移模型,并已率先将采集到的人类数据应用于物流等真实场景。同时,灵初智能同步发布覆盖视觉、语言、触 觉、动作的大规模真实世界多模态数据集 Psi-SynNet-v0。这一突破性成果标志着灵初智能全自研的真实世界具身数据引擎已经正式启动。 相比大模型和自动驾驶,数据问题一直是困扰整个具身智能领域的痛点。行业现有的数据采集方案: 灵初智能 Psi-SynEngine 从根本上突破 ...
连场景都做?这家给智元机器人造大脑的公司4个月融了3个亿
具身智能之心· 2025-12-11 04:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 这家刚成立不到半年,天使轮就2亿融资的新秀到底是做什么的? 日前,为智元机器人提供相关大脑产品的公司星源智机器人,完成了超亿元人民币天使+轮融资。本轮募集资金将用于具身大脑专业版RoboBrain Pro的研发、垂直 行业解决方案的拓展以及高端人才的引进等。 这家刚成立不到半年,天使轮就2亿融资的新秀到底是做什么的? 星源智下半年在北京亦庄注册,由北京智源研究院孵化,目标是"让机器人理解物理世界并自主行动"。CEO刘东曾任京东智能驾驶总经理,牵头京东无人配送车全 国落地;联合创始人穆亚东为北大研究员、智源学者,近五年发表具身智能顶会论文30余篇。 产品上,星源智推出了"通用大脑"+"算力炸弹"。 跨本体的RoboBrain:同一套AI系统可在机械臂、AGV、人形机器人之间即插即用,无需重复训练。在智元发布的新一代工业级交互式具身作业机器人"精灵G2"上 就搭载了星源智的具身大脑产品(作为星源智的合作伙伴,智元早在公司天使轮融资中就参与其中)。 2070 TOPS端侧算力平台:星源智T5算力平台基于NVIDIA Jetson Thor处理器开发,具备强大的Tr ...
从视频生成到机器人操控:VideoVLA 开启通用机器人新范式
具身智能之心· 2025-12-11 04:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yichao Shen等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在机器人操控领域,视觉 - 语言 - 动作(VLA)模型虽已实现语言指令驱动的任务执行,但在陌生物体处理、跨机器人技能迁移等泛化能力上仍存在瓶颈。现有方 案多依赖预训练视觉语言理解模型,难以突破训练数据的场景限制。 由西安交通大学、微软亚洲研究院等机构联合提出的 VideoVLA 框架 ,创新性地将大规模视频生成模型转化为端到端 VLA 系统,通过 "动作预测 + 视觉想象" 双目 标策略,首次实现了机器人在未知场景下的稳健泛化,为通用机器人操控提供了全新技术路径。 论文题目:VideoVLA: Video Generators Can Be Generalizable Robot Manipulators 核心贡献: 首次将视频生成模型改造为通用机器人操控系统,通过联合预测动作序列与未来视觉结果,解锁跨物体、跨技 ...