世界模型

Search documents
腾讯研究院AI速递 20250711
腾讯研究院· 2025-07-10 14:48
Group 1 - Musk released Grok4, highlighting its superior performance in various tests, particularly in the "ultimate human exam" surpassing competitors [1] - Grok4's training approach has shifted to emphasize "first principles" thinking, learning to use tools to solve problems during the training phase [1] - Grok faces controversy over the "mechanical Hitler" issue, as its unfiltered approach attracts users but also raises concerns about AI alignment challenges [1] Group 2 - Microsoft open-sourced Phi-4-mini-flash-reasoning, utilizing the innovative SambaY architecture, achieving a 10x increase in reasoning efficiency and a 2-3x reduction in latency [2] - The SambaY architecture enables efficient memory sharing across layers without explicit positional encoding, significantly enhancing long context processing capabilities [2] - The new model is suitable for resource-constrained devices, running on a single GPU, excelling in advanced mathematical reasoning and long text generation, making it ideal for educational and research fields [2] Group 3 - Perplexity officially launched the AI browser Comet, centered around "agent search," competing with Google Chrome [3] - Comet's three main value propositions include personalized understanding of user thinking, powerful and user-friendly content comprehension, and efficiency improvements reducing tab switching [3] - Comet features rich functionalities, capable of replacing user actions on the web, intelligently processing content, managing email calendars, and searching personal data, currently supporting Mac and Windows systems [3] Group 4 - OpenAI completed the acquisition of io company, with former Apple designer Jony Ive and his team LoveFrom joining to take on deep design and creative responsibilities [4][5] - Ive is expected to assist OpenAI in developing new intelligent hardware products, with initial ideas being transformed into feasible designs [5] - The io company, co-founded by Ive and several experts, includes hardware and software engineers and scientists, and will closely collaborate with OpenAI's R&D team [5] Group 5 - Google released new medical AI models: the multimodal MedGemma 27B and the lightweight encoder MedSigLIP, expanding the HAI-DEF medical model collection [6] - The MedGemma series includes 4B and 27B versions, supporting image and text input with text output; the 4B version achieved a 64.4% accuracy rate in medical Q&A tests, while the 27B version reached 87.7% [6] - MedSigLIP, with only 400 million parameters, is a medical image encoder optimized through various medical imaging techniques, suitable for image classification, zero-shot classification, and semantic retrieval, providing visual understanding for MedGemma [6] Group 6 - Tencent launched a co-creation activity for the 2026 "Year of the Horse" zodiac penguin, with requests surging 300% within hours and token usage doubling, prompting urgent server expansion [7] - The activity invites users to design the 2026 "Horse Goose" figurine using the Mix Yuan 3D AI creation engine, allowing text input, image uploads, or sketch submissions to generate designs [7] - Outstanding works will have the opportunity to be co-branded with Tencent for mass production and sold in official merchandise stores, with the activity closing on July 27, 2025 [7] Group 7 - OpenAI plans to release an "open weight model," similar to the o3 mini level, as early as next week, allowing companies to deploy it themselves, marking the first model weight release since 2019 [8] - OpenAI is developing an AI browser based on Chromium, which will process web content within the ChatGPT native interface, enabling AI agents to execute tasks directly, challenging Google Chrome [8] - OpenAI is expanding its business scope from model development to browsers and other user interfaces, indicating its ambition for technological leadership and ecosystem control [8] Group 8 - Hugging Face and Pollen Robotics jointly launched the open-source robot Reachy Mini, starting at $299, designed for human-robot interaction and AI experimentation [10] - Reachy Mini offers a basic version ($299) and a wireless version ($449), supporting Python programming and equipped with multimodal interaction features like cameras, microphones, and speakers [10] - The robot stands 28 cm tall, weighs 1.5 kg, provides 15 preset behaviors, is fully open-source and extensible, with the basic version expected to ship by late summer 2025 and the wireless version in batches starting fall 2025 [10] Group 9 - Meta released a 40-page report, positioning the "mental world model" alongside the physical world model as a key component of embodied intelligence [11] - The mental world model focuses on human goals, intentions, emotional states, social relationships, and communication methods, enabling AI to understand human psychological states and engage in social interactions [11] - Meta proposed a dual-system architecture integrating "observational learning" (System A) and "action learning" (System B), where the former provides abstract knowledge and the latter explores actions for more efficient agent learning [11] Group 10 - Top AI products like Cursor, Perplexity, and Lovable have adopted a "anti-framework" approach, building directly on basic AI units rather than using frameworks [12] - Frameworks have become innovation barriers in the rapidly changing AI field, leading to excessive abstraction, bloated structures, and slow iterations, while basic units offer combinability and specialization [12] - The basic unit method (e.g., Memory, Thread, Tools) allows developers to construct AI products like building blocks, reducing cognitive load and enhancing performance and flexibility, better suited for rapid AI technology iterations [12]
具身数采方案一览!遥操作和动捕的方式、难点和挑战(2w字干货分享)
自动驾驶之心· 2025-07-10 12:40
以下文章来源于具身智能之心 ,作者具身智能之心 具身智能之心 . 与世界交互,更进一步 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 继具身本体未定论专场讨论后,几位嘉宾意犹未尽,决定再来一场圆桌,聚焦具身智能的"方向盘"--遥操作。 遥操作本身并非新概念,甚至在一二十年前效果就非常好了。那这一次,遥操作再次走进大家视野,是带来或准备带来哪些升级呢? 同时,希望本次圆桌,会给正在或准备进行遥操作相关学习和研究的同学,带来有关遥操作一些高屋建瓴的认知,同时为他们今后的学习研究之路带来一些 启发。 本期我们会深入聊到:遥操作是什么、各式各样的遥操作体验分享、遥操存在的意义只是为了采数据吗、动捕有什么难点、aloha的划时代意义、遥操终局 畅想、如果机器人有操作系统等。大家一起来体验这场火花四溅又若有所思的圆桌吧! 完整视频已经上传到国内首个具身智能全栈技术社区: 具身智能之心知识星球 内部,感兴趣的同学欢迎加入交流。 圆桌嘉宾:赵仲夏 格灵深瞳算法总监 北京大学和智源研究院访问-学者(小红书id:夏染) 圆桌嘉宾:智元机器人遥操负责人-王文灏 圆桌嘉宾:清华 ...
VLA统一架构新突破:自回归世界模型引领具身智能
机器之心· 2025-07-10 04:26
然而,现有方法多以语言模态为中心,往往忽视了视觉信息蕴含的丰富时序动态与因果结构。 本文来自:王宇琪,中国科学院自动化所博士,研究方向为世界模型,自动驾驶感知与决策等,在 CVPR、NeurIPS、ICCV、 ECCV、ICLR 等顶级会议上发表过多篇论文。 王鑫龙团队,北京智源研究院,研究方向为原生多模态大模型,Emu 系列工作核心负责人。 张兆翔团队,中国科学院自动化研究所,研究方向涵盖世界模型、视觉生成与重建、自动驾驶、具身智能等。 从 Sora 到 Genie2,从语言驱动的视频生成到世界的交互模拟,世界模型正加速成为连接感知、理解与决策的关键基座。随着视觉 - 语 言 - 动作(VLA)模型在具身智能领域的快速发展,多模态之间的边界正被重塑。 论文标题: Unified Vision-Language-Action Model 网站链接: https://robertwyq.github.io/univla.github.io/ 论文链接: https://arxiv.org/abs/2506.19850 代码链接: https://github.com/baaivision/UniVLA 为此,北 ...
Meta发布40页报告,具身智能的下一步是「心智世界模型」:能听,能看,能理解,会共情
量子位· 2025-07-10 03:19
Core Insights - Meta is actively investing in talent acquisition, with a reported expenditure of $100 million to recruit personnel [1] - The company has released a comprehensive 40-page report focusing on embodied intelligence and the introduction of a "mental world model" alongside traditional physical world models [2][3] Group 1: World Models - The report emphasizes the importance of both physical and mental world models, with the latter focusing on psychological laws such as intentions, emotions, and social relationships [3][4] - The physical world model includes information about object properties, spatial relationships, dynamic changes in the environment, and causal relationships based on physical laws [8] - The mental world model encompasses goals, intentions, emotional states, social dynamics, and communication methods, which are crucial for understanding human behavior [8][10][15] Group 2: Implications for AI - To create intelligent agents that can collaborate effectively with humans, it is essential for these agents to learn and understand human psychological states [15][17] - The report outlines a dual learning system combining observational learning (System A) and action-based learning (System B) to enhance AI capabilities [23][28] - The integration of these systems aims to improve the efficiency of AI learning and its ability to adapt to dynamic environments [28][29] Group 3: Future Directions - Despite current limitations in the performance of mental world models, their potential in multi-agent collaboration is significant [30] - The mental world model can facilitate consensus among agents, allowing them to align goals and coordinate actions in uncertain environments [32] - This advancement represents a critical step towards more empathetic and context-aware human-machine interactions [33][34]
筹备了半年!端到端与VLA自动驾驶小班课来啦(一段式/两段式/扩散模型/VLA等)
自动驾驶之心· 2025-07-09 12:02
与传统模块化方法不同,端到端系统实现了从传感器输入到车辆规划/控制信息的直接建模,避免了模块化 方法间的误差累积。BEV感知打通了模块化方法间的壁垒,在统一的上帝视角下实现了技术的一次跃迁。 之后UniAD统一了各个感知和规划任务,所有的模块第一次在一个模型中运行起来,至此端到端时代来临~ 而随着学术界和工业界的目光投向端到端这个技术领域,我们发现了很多问题。UniAD是端到端的最终解 吗?显然不是!一系列算法如雨后春笋般冒出: 技术栈多?入门困难? 去年我们推出了《首个面向工业级的端到端算法与实战教程》,今年很多小伙伴反馈技术发展太快了,先 前的技术方案已经不适合当下的大环境。端到端目前发展出多个领域技术的方向,需要掌握多模态大模 型、BEV感知、强化学习、视觉Transformer、扩散模型等相关的知识。学习端到端自动驾驶,是一个一站 式强化多领域知识的好机会。但这样的学习路径往往非常痛苦。同时掌握多个领域的知识已经足够困难, 而各领域的论文数量繁多、知识碎片化,入门者往往还没了解各个领域就已然放弃。如何从零散论文中提 炼框架、掌握领域发展趋势,是初学者的常见挑战。同时学习目标驱动导航需要结合实际任务完成 ...
「世界模型」也被泼冷水了?邢波等人揭开五大「硬伤」,提出新范式
机器之心· 2025-07-09 07:10
机器之心报道 编辑:泽南、+0 现在的世界模型,值得批判。 我们知道,大语言模型(LLM)是通过预测对话的下一个单词的形式产生输出的。由此产生的对话、推理甚至创作能力已经接近人类智力水平。 但目前看起来,ChatGPT 等大模型与真正的 AGI 还有肉眼可见的差距。如果我们能够完美地模拟环境中每一个可能的未来,是否就可以创造出强大的 AI 了?回想 一下人类:与 ChatGPT 不同,人类的能力组成有具体技能、深度复杂能力的区分。 模拟推理的案例:一个人(可能是自私的)通过心理模拟多个可能结果来帮助一个哭泣的人。 人类可以执行广泛的复杂任务,所有这些任务都基于相同的人类大脑认知架构。是否存在一个人工智能系统也能完成所有这些任务呢? 论文:Critiques of World Models 论文链接:https://arxiv.org/abs/2507.05169 研究人员指出了构建、训练世界模型的五个重点方面:1)识别并准备包含目标世界信息的训练数据;2)采用一种通用表征空间来表示潜在世界状态,其含义可 能比直接观察到的数据更为丰富;3)设计能够有效对表征进行推理的架构;4)选择能正确指导模型训练的目标函数; ...
具身智能论文速递 | 强化学习、VLA、VLN、世界模型等~
具身智能之心· 2025-07-08 12:54
算法框架: 点击下方 卡片 ,关注" 具身智能 之心 "公众号 强化学习如何提升VLA泛化能力 清华大学、上海期智研究院、北京中关村科学院通过强化学习微调(PPO算法)显著提升视觉-语言-动作模 型(VLA)的泛化能力: 1)执行任务成功率提升42.6%(OOD场景) 2)语义理解任务成功率从61.5%提升至75.0%(未见物体) 3)动态干扰场景成功率从28.6%跃升至74.5%(Tab 3) 主要贡献: 论文标题:What Can RL Bring to VLA Generalization? An Empirical Study 论文链接:https://arxiv.org/pdf/2505.19789 1. 构建了一个严谨且具有挑战性的基准,用于评估 VLA 微调方法在视觉、语义和执行等不同维度上对泛 化能力的影响。 2. 确定 PPO 是优于 GRPO 和 DPO 的 VLA 微调 RL 算法,并讨论了将这些 RL 算法从 LLM/VLM 范式适 配到 VLA 独特需求时的关键挑战。 3. 开发了一种高效的基于 PPO 的 VLA 微调方案,该方案借助共享的 actor-critic 骨干网络、VL ...
写了两万字综述 - 视频未来帧合成:从确定性到生成性方法
自动驾驶之心· 2025-07-08 12:45
现在在做 camera ready 版本,如果大家有 insight 或者文献补充欢迎留言 作者 | hzwer 黄哲威 编辑 | 自动驾驶之心 原文链接: https://zhuanlan.zhihu.com/p/1918322086205718663 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 本来是去年想投个IJCAI survey track练手的,一开始只写了七页,结果出了一些事故desk reject 后来修修改改投期刊,补到二十多页,终于可以发表了 希望能比 deep research 自动生成的水平高一些 论文链接:https://arxiv.org/abs/2401.14718 摘要:未来帧合成(Future Frame Synthesis, FFS)的目标是基于现有内容生成未来的帧序列,强调合成方 面,扩展了视频帧预测的范围。本综述全面回顾了FFS领域的现有研究,涵盖了常用的基准数据集和代表性 算法。我们讨论了该领域的关键挑战,并追溯了FFS在 ...
独家对话「杭州六小龙」云深处CEO:人形机器人进家干活还要10年
36氪· 2025-07-08 09:18
以下文章来源于智能涌现 ,作者苏建勋 富充 智能涌现 . 直击AI新时代下涌现的产业革命。36氪旗下账号。 给机器人安上"世界模型", 就不需要那么多数据了。 访谈 | 苏建勋 杨轩 文| 富充 编辑 | 苏建勋 来源| 智能涌现(ID:AIEmergence) 封面来源 | 企业官方 在热闹的具身智能行业中,杭州云深处科技有限公司(以下简称"云深处科技")低调得出奇。 即使年初被媒体列为"杭州六小龙"之一,但创始人朱秋国仍鲜少露面。他个人的视频号里,没有一条和本人相关的内容,点赞量最高的一段视频,是公司 旗下四足机器人产品"绝影"攀爬楼梯、越障奔跑的影像。 "这个牛X了,三只(机器)狗跑出不一样的动作,会自主判断用什么策略。"有人在视频下激动留言,朱秋国没有回复。 正如"云深处"的名字一样,朱秋国认为技术爬坡必须要沉下心来,才能实现"远上寒山石径斜,白云深处有人家"的豁然开朗。他身边的同事告诉我 们:"朱老师喜欢把公司和产品往前放,让自己退到后面。" 可即使有一颗大隐隐于市的心,成立第八年的云深处,终究在今天具身智能的浪潮下,被推到了舞台中央。 智能涌现获悉,云深处科技宣布完成近5亿元人民币新一轮融资。本轮 ...
感觉捕手
3 6 Ke· 2025-07-08 09:04
初中时候,我心智混沌,上课经常不听讲,学习上磕磕碰碰。 有次考物理,最后一道难题是关于浮力的。我完全不记得浮力公式,于是开始徒手推导,用一种感觉模 拟的方式,写出了自己的解答。 发卷的时候,我的答案居然对了,不过分数是零。同桌只写下了公式,后面啥都没做,得了5分。 物理学家费米 说:计算方法只有两种。 "第一种,也是我喜欢采用的,就是拥有一个明确的物理影像; 第二种,则必须具备严密的数学形式结构。" 也许那时我用的方法,就是在大脑中模拟一种初始的物理影像与过程吧。 后来读高中,我稍有改善,不再那么心不在焉,从物理中获得了更多的乐趣。 我尤其喜欢力学,因为只需要极少的公式,就能解决一些很复杂、很天马行空的难题。--照例用的是那 种"感觉"为主的方式。 我会在脑海中模拟物体的受力(以及力的分解),模拟运动,模拟各种要素组合之后的系统作用,然后 再用公式来计算。 爱因斯坦在1945年给数学家雅克·阿达玛写了一封信。阿达玛当时正在研究科学家和数学家的思维过 程。 其中有如下一段: "书写或说出的词语或语言,在我的思考机制中似乎不起任何作用。那些似乎在思想中充当要素的心理 实体,是某些可以'自愿地'被再现和组合的、或多 ...