Workflow
具身智能之心
icon
Search documents
3D视觉被过度设计?字节Depth Anything 3来了,谢赛宁点赞
具身智能之心· 2025-11-17 00:47
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 现在,只需要一个简单的、用深度光线表示训练的 Transformer 就行了。 这项研究证明了,如今大多数 3D 视觉研究都存在过度设计的问题。 本周五,AI 社区最热门的话题是一篇新论文,有关 3D 建模的。 经过一年多的探索,来自字节跳动的团队推出了 Depth Anything 3(DA3),将单目深度估计扩展到了任何视角场景,让计算机实现了媲美人类的空间感知。 论文:https://arxiv.org/abs/2511.10647 项目页面:https://depth-anything-3.github.io 为了追求最小建模,DA3 的工作获得了 两个关键见解 : 就是这样的方法, 在姿态估计方面比当前业界最先进的方法 (SOTA) 提升了 44%,在几何估计方面提升了 25%。 原来 3D 视觉竟然这么简单? 纽约大学计算机科学助理教授、知名 AI 学者谢赛宁表示,论文有点像电影: ...
性能超越GPT和Google,北京人形机器人创新中心开源全球最强具身VLM
具身智能之心· 2025-11-17 00:47
作者丨 咖啡不加糖 编辑丨 焉知机器人 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 2025 年 11 月 14 日,北京具身智能机器人创新中心正式发布 Pelican-VL 1.0 具身视觉语言模型( VLM ),不仅宣称性能超越 GPT-5 同类模型 和 Google Gemini 系列,更以 " 全球最大规模开源具身多模态大模型 " 的身份,展示了中国在具身智能领域的技术硬实力。 具身智能,简单来说就是让机器人像人类一样感知世界、做出决策并执行动作的技术,而视觉语言模型( VLM )相当于机器人的 " 眼睛 " 和 " 大脑中 枢 " ,负责把看到的图像信息转化为可理解的语言指令,再规划出具体的行动步骤。 图 Pelican-VL 1.0 (中文是塘鹅或者鹈鹕的意思)在抱脸虫和魔搭都可下载 Pelican-VL 1.0 称为 " 视觉语言大脑 " ,它 的开源有力推动了 具身 智能技术的进步 。 一、北京人形机器人创新中心和 Pelican-VL ...
4个旷视天才具身创业获投近10亿,阿里独家很瞩目
具身智能之心· 2025-11-17 00:47
编辑丨 量子位 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 具身智能赛道的创投大戏,仍在持续升温。 这不, 具身智能公司Dexmal原力灵机,完成了共计近10亿元的融资 。 更引人注目的是,在最新一轮融资中, 阿里巴巴 以独家投资方的身份,赫然在列。 数亿元砸向具身智能黑马 最新消息,具身智能新秀——Dexmal原力灵机,宣布完成数亿元A+轮融资。 这轮融资之所以格外惹眼,是因为股东名单中,赫然出现了阿里巴巴的名字,并且是以「独家投资方」的身份。 如果再往前翻一页,9月初,原力灵机刚刚完成A轮融资,由蔚来资本领投,洪泰基金、联想创投、锡创投和正景基金跟投,老股东君联资本 超额追投、启明创投和九坤创投追投。 算上此轮融资,仅仅两个多月的时间,原力灵机已经筹集到近10亿元。 公司表示,这笔钱将主要用于机器人软、硬件技术研发与场景落地。 要知道,原力灵机今年3月才正式成立, 此后仅20天便宣布完成2亿元人民币的天使轮融资 ,背后站着君联资本、九坤创投、启明创投 ...
微软&港科对比多种迁移技术!VLA 到底如何有效地继承 VLM 中丰富的视觉-语义先验?
具身智能之心· 2025-11-15 16:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Chuheng Zhang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在具身智能领域,基于大型视觉语言模型(VLM)初始化训练视觉语言动作模型(VLA)已成为主流范式。但核心疑问始终未解: VLA 如何有效继承 VLM 中 丰富的视觉 - 语义先验? 微软研究院、香港科技大学等团队联合提出的 GrinningFace 基准 ,以表情符号桌面操作任务为切入点,通过模拟与真实机器人双环境实验,系统对比多种迁移 技术,不仅揭示了 VLM 先验对 VLA 泛化能力的关键作用,更为高效知识迁移提供了明确指导。 为什么需要专门的 VLA 知识迁移基准? 当前 VLA 训练虽普遍依托 VLM 初始化,但存在三大核心痛点,传统基准难以精准诊断: | 核心痛点 | 具体表现 | | --- | --- | | 先验迁移效果模糊 | VLM 的视觉 - 语义知识与 VLA 的机器人动作技能交织,无法 ...
李飞飞和LeCun的世界模型之争
具身智能之心· 2025-11-15 16:03
Core Viewpoint - The article discusses the competition among three major players in the AI industry—Li Fei Fei, LeCun, and Google—regarding the development of world models, highlighting their distinct technological approaches and implications for artificial general intelligence (AGI) [2][22][39]. Group 1: Li Fei Fei's Marble - Li Fei Fei's company, World Labs, has launched its first commercial world model, Marble, which is considered to have significant commercial potential due to its ability to generate persistent, downloadable 3D environments [5][21]. - Marble features a native AI world editor called Chisel, allowing users to create and modify worlds with simple prompts, which is particularly beneficial for VR and game developers [7][9]. - However, some experts argue that Marble resembles a 3D rendering model rather than a true world model, as it focuses on visual representation without incorporating the underlying physical laws necessary for robotic training [10][20]. Group 2: LeCun's JEPA - LeCun's approach to world models, exemplified by JEPA, emphasizes control theory and cognitive science rather than 3D graphics, focusing on abstract representations that enable robots to predict changes in the environment [22][25]. - JEPA is designed to train robots by capturing essential world states without generating visually appealing images, making it more suitable for robotic training [27][29]. - This model contrasts sharply with Marble, as it prioritizes understanding the structure of the world over visual fidelity [39]. Group 3: Google's Genie 3 - Google DeepMind's Genie 3, launched in August, generates interactive video environments based on prompts, showcasing improvements in long-term consistency and event triggering [31][34]. - Despite its advancements, Genie 3 remains fundamentally a video logic model, lacking the deep understanding of physical laws that LeCun's JEPA provides [35][36]. - The visual quality and resolution of Genie 3 are also limited compared to Marble, which offers high-precision, exportable 3D assets [38]. Group 4: Comparative Analysis - The three world models—Marble, Genie 3, and JEPA—represent different paradigms: Marble focuses on visual representation, Genie 3 on dynamic video generation, and JEPA on understanding the underlying structure of the world [39]. - This creates a "world model pyramid," where models become increasingly abstract and aligned with AI's cognitive processes as one moves up the hierarchy [47][48].
我们的自驾、具身和大模型社区7500人了!
具身智能之心· 2025-11-15 16:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨 具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 之心的全平台星球7500人了!2年多的时间,取得了还算不错的成绩。很多同学可能还不知道我们的 媒体矩阵,目前我们团队孵化了自动驾驶之心、具身智能之心、大模型之心Tech、3D视觉之心四个 IP,每个IP都有对应的付费社区与私域。 我们期望未来2年内做到近万人的规模。给大家打造一个交流+技术分享的聚集地,是许多初学者和 进阶的同学经常逛的地方。 | 0 国内高校著名自动驾驶团队整理 链接: https://t.zsxq.com/hlVJZ | 5 算法进阶 | (17) 规划控制 | 链接: https://t.zsxg.com/USyyN | | --- | --- | --- | --- | | | (1) BackBone汇总 | 链接: https://t.zsxq.com/melQb | (33) 自动驾驶仿真 | | 1 自动驾驶领域 ...
超大参数量具身VLM开源:DPPO训练范式,模型性价比天花板!
具身智能之心· 2025-11-15 16:03
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 最近,国内具身智能的开源 VLM 登顶了行业之巅。2025 年以来,具身智能的行业研发力似乎也迎来了井喷式爆发。 11 月 13 日,北京人形机器人创新中心正式开源了具身智能 VLM 模型 ——Pelican-VL 1.0,根据介绍,该模型覆盖 7B、72B 参数规模,被称为 "最大 规模的开源具身多模态大脑模型"。 官方资料显示,其核心优势在于深度整合海量数据与自适应学习机制:并在由 1000+ A800 GPU 组成的集群上训练,单次检查点训练耗费超过 50,000 A800 GPU - 小时;团队从原始数据中蒸馏出包含数亿 token 的高质量元数据以做训练基石。在基线基础上性能提升 20.3%,超过同级别开源模型 10.6%。根据测试,其平均性能超越 GPT-5 和 Google gemini 等闭源系列模型,成为了目前最强具身性能的开源多模态大模型 。 DPPO 模仿人类元认知的学习 ...
北大等团队用“分层小脑+仿真分身”让G1零样本上岗
具身智能之心· 2025-11-14 16:03
编辑丨 量子位 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 近日,来自北京大学与BeingBeyond的研究团队提出 DemoHLM 框架,为人形机器人移动操作(loco-manipulation)领域提供一种新思 路——仅需1次仿真环境中的人类演示,即可自动生成海量训练数据,实现真实人形机器人在多任务场景下的泛化操作,有效解决了传统方 法依赖硬编码、真实数据成本高、跨场景泛化差的核心痛点。 DemoHLM的核心创新在于"分层控制+单演示数据生成"双引擎,既保证了全身运动的稳定性,又实现了极低数据成本下的泛化学习。 分层控制架构:兼顾灵活性与稳定性 DemoHLM采用"低层全身控制器+高层操作策略"的分层设计,解耦"运动控制"与"任务决策": 此外,团队为机器人设计了 2DoF主动颈部+RGBD相机 (Intel RealSense D435),通过比例控制器实现"视觉追踪稳定",模仿人类操作 时的视线调节能力,避免物体遮挡导致的感知失效。 核心挑战:人 ...
SemanticVLA:面向高效机器人操作的语义对齐剪枝与增强方法
具身智能之心· 2025-11-14 16:03
Core Insights - The article discusses significant advancements in visual-language-action models for robotic operations, highlighting the challenges faced in dynamic and cluttered environments, which hinder the deployment of existing models [2][4]. Research Background - Visual-language-action models have made notable progress in robotic operations through pre-trained visual language models that enable end-to-end mapping from language to action. However, two main bottlenecks limit their deployment in real-world scenarios: low computational efficiency and weak task grounding capabilities [2]. Key Innovations - Introduction of a semantic-guided dual-visual pruner that addresses visual redundancy through instruction-aware token filtering and geometric-aware aggregation, while maintaining semantic alignment [3]. Main Work Overall Framework Design - The framework processes real-time visual observations, robot state (e.g., joint angles, end-effector posture), and natural language instructions to predict future action sequences. It employs two parallel paths for visual input processing, culminating in an end-to-end pipeline for action mapping [4]. Visual Perception Redundancy - The general visual encoder processes all pixels uniformly, leading to background interference and environmental noise, which increases computational costs and dilutes attention on critical task cues [5]. Semantic Complementary Layered Fusion - A semantic complementary layered fusion mechanism integrates dense patch features with sparse semantic tokens, enhancing the alignment of instruction semantics with spatial structures [5]. Semantic Conditioned Action Coupler - The design reconstructs the mapping from visual to action, improving the efficiency and interpretability of action decoding by representing actions as semantically coherent types [5]. Experimental Results Efficiency Advantages - The model achieves a training cost reduction of 3.0 times, inference latency reduction of 2.7 times, and visual token compression of 8-16 times, significantly enhancing throughput [14]. Real-World Performance - In long-range tasks, the model's success rate reaches 77.8%, surpassing the OpenVLA-OFT model by 22.2%, demonstrating strong generalization capabilities [14]. Ablation Studies - The dual-pruning combination of the SD-Pruner enhances success rates by 2.1%-5.2%, achieving optimal performance and efficiency balance at an 8× sparsification ratio [16].
雷军下铺的兄弟,创业具身智能机器人
具身智能之心· 2025-11-14 16:03
Core Insights - The article discusses the career transition of Cui Baoqiu, a former Xiaomi executive, who is now venturing into the field of embodied intelligence and robotics after leaving Xiaomi [2][6][12]. Group 1: Career Transition and Vision - Cui Baoqiu, known as a "technical guru" at Xiaomi, is now focusing on creating household service robots, marking a shift from his previous role of building platforms for AI [2][4][5]. - His vision has evolved from "connecting everything" to "transforming the physical world," aiming to create an AI that can think, move, and interact with humans [4][7]. - Prior to his current venture, he served as the Chief Technical Advisor for a RISC-V chip company, indicating a strategic move towards foundational technology [8][10]. Group 2: Background and Achievements at Xiaomi - Cui joined Xiaomi in 2012 at the invitation of Lei Jun and played a crucial role in establishing Xiaomi's AI and cloud platform team [14][29]. - He was instrumental in promoting Xiaomi's "AIoT" strategy, which initially focused on connecting devices like smart speakers and cameras [7][29]. - Under his leadership, Xiaomi launched significant AI products, including the AI assistant "Xiao Ai," which reflects the culmination of his earlier predictions about AI capabilities [30][32]. Group 3: Industry Trends and Implications - The article highlights a broader trend in the tech industry where former executives from major companies are now focusing on building physical embodiments for AI, as software alone is insufficient to unlock AI's full potential [42][44]. - This shift towards embodied intelligence is seen as the next phase in the AI evolution, with many former tech leaders entering the robotics space [42][47]. - The competition in this sector is intensifying, with significant investments flowing into startups focused on general-purpose robotics and embodied intelligence [45][48].