Workflow
机器之心
icon
Search documents
想让机器人春晚包饺子?阿里达摩院:别急,先把「大脑」优化一下
机器之心· 2026-02-10 03:46
2026 年,那么多机器人上春晚,能给大家表演个包饺子吗?相信这是很多人会好奇的一个问题。 但根据最近的彩排报道,这个可能性不大,机器人更有可能被设计为托着托盘呈上饺子。 其实业内人士都知道,如果不靠编程或摇操,让机器人包饺子这事儿远比移动、导航要复杂,还涉及到「饺子皮」这种堪称机器人图灵测试的柔性物体,没有一 个足够聪明的「大脑」肯定是做不到的。这也是为什么,在过去的一年, 越来越多的研究力量和资金开始涌向「大脑」 。 编辑|张倩 阿里达摩院最近的一项工作 —— RynnBrain 也瞄准了这一方向。不过和一些表演叠衣服、做早餐的研究不同,他们关注的问题还要更底层一些:如果机器人在做 家务的时候被打断,临时去门口接收快递,它还能不能回来接着刷碗?如果机器人被要求完成一件需要借助很多工具的任务,它制定的计划会不会包含手头压根 没有的工具? 在关于机器人的各种宏大叙事里,这些问题可能没有那么起眼,甚至连相关的 benchmark 都是缺失的,但却是机器人走出实验室必须迈过的门槛。在 RynnBrain 的 构建中,达摩院具身智能团队选择从底层出发,将 时空记忆 和 物理空间推理 直接训进模型里,并且达到了不错的 ...
这个春节,AI 不聊天了,开始替我买单
机器之心· 2026-02-09 05:12
Core Viewpoint - The article discusses the competitive landscape of AI applications during the Chinese New Year, highlighting how major tech companies are leveraging AI to attract user attention and enhance consumer experiences through innovative strategies. Group 1: AI Competition and Strategies - Tencent initiated the AI Spring Festival battle by launching a 1 billion cash red envelope campaign and introducing a new AI social feature called "Yuanbao" [1] - Baidu followed with a 500 million red envelope initiative, collaborating with Beijing TV for the Spring Festival Gala [1] - ByteDance brought its Volcano Engine to the backstage of the CCTV Spring Festival Gala, intensifying the competition [1] - Alibaba's Qianwen APP entered the fray with a 3 billion "Spring Festival Treat Plan," integrating its ecosystem to offer a comprehensive consumer experience [1] Group 2: User Engagement and Order Volume - On the first day of the Qianwen APP event, over 10 million AI orders were completed within 9 hours, with more than 30 million "help me buy" requests received [2] - The overwhelming user engagement led to server congestion, prompting the Qianwen team to request leniency from users [5] Group 3: New Consumption Habits and AI Capabilities - The 3 billion investment not only aimed at distributing cash but also tested new consumer habits, encouraging users to interact with the Qianwen APP [6] - Qianwen demonstrated impressive cross-application coordination, efficiently managing travel planning by integrating services from Fliggy and Gaode [10] - In the family consumption sector, Qianwen acted as a "universal shopping guide," quickly filtering products from Taobao and Tmall based on user needs [13] Group 4: Differentiation in AI Development - The article notes the distinct strategies among tech companies, with Qianwen focusing on integrating AI into shopping and ticketing, thus targeting higher-value consumer scenarios [15] - The differences in AI development paths between China and the U.S. are highlighted, with U.S. AI primarily focusing on high-value B2B markets, while China's AI is more consumer-oriented [20][21] Group 5: Future of AI Applications - The article posits that 2023 and 2024 will be pivotal years for AI technology, with 2025 marking a year for application exploration, as evidenced by Qianwen's efforts to break the boundaries of AI applications [26] - Qianwen's approach combines AI capabilities with a robust ecosystem, enabling seamless integration of AI into everyday life [28] - The evolution of user interaction with digital platforms is shifting towards a more streamlined experience, emphasizing efficiency and convenience [29]
CVPR 2026 Workshop征稿|第六届AdvML@CV:多模态大模型智能体安全
机器之心· 2026-02-09 05:12
Core Viewpoint - The article announces the 6th AdvML@CV workshop focusing on the safety and robustness of vision-language agents, scheduled during the CVPR 2026 conference in Denver, Colorado from June 3 to June 7, 2026 [2][3]. Group 1: Workshop Themes - The workshop will address the safety and robustness of vision-language agents, which have seen significant advancements due to multimodal foundational models [4][5]. - Vision-language agents are becoming integral in fields such as autonomous driving and intelligent robotics, but their increased autonomy introduces complex security risks, including adversarial prompts, instruction injection, and jailbreak manipulations [5]. Group 2: Call for Papers - The workshop invites submissions related to various topics, including attacks and defenses on vision-language agents, as well as datasets and benchmarks for evaluating these agents [6][7]. - Specific areas of interest for submissions include adversarial/jailbreak attacks, improving agent robustness, and aligning vision-language agents [10]. Group 3: Submission Guidelines - Long papers should be a maximum of 8 pages (excluding references), while extended abstracts should be no more than 4 pages (including references) [10]. - All submissions must be anonymous and adhere to the CVPR 2026 Author Kit template [10]. Group 4: Important Dates - Abstract and paper submission deadline is set for March 5, 2026, with author notifications on March 17, 2026, and camera-ready submissions due by April 1, 2026 [10].
先解行为,再训Agent:CMU开源首份Agentic Search日志数据,把Agent拆开给你看
机器之心· 2026-02-09 01:18
Core Insights - The article discusses the lack of systematic characterization and analysis of how intelligent agents perform queries, rewrite them, and utilize retrieved information in the context of Agentic Search driven by large language models [2][7]. Group 1: Research Contributions - The CMU team organized over 14 million Agentic Search requests and approximately 4 million sessions from six months of real traffic, releasing the first open-source Agentic Search behavior log dataset [7][8]. - A three-layer analytical framework was proposed, consisting of session intent (Declarative / Procedural / Reasoning), trajectory actions (Specialization / Generalization / Exploration / Repetition), and the Context-driven Term Adoption Rate (CTAR) to measure the adoption of retrieved information [2][8]. Group 2: Data and Platform Overview - The DeepResearchGym (DRGym) platform was established for research purposes, providing a unified search API based on dense retrieval from fixed web corpus snapshots [12]. - The dataset includes logs from 25 countries and nearly 600 IP addresses, ensuring diverse usage and anonymity through data cleaning and anonymization processes [13][14]. Group 3: Session Analysis Methodology - A semantic and temporal joint sessionization strategy was employed to analyze behavior patterns, resulting in approximately 4 million sessions characterized by high-frequency and iterative queries [16][19]. - The analysis revealed that the majority of queries were concentrated in a dispersed semantic space, with low overlap with common Agentic Benchmark tasks [18]. Group 4: Intent and Trajectory Dynamics - The research categorized multi-turn sessions into three types of session intents: Declarative, Procedural, and Reasoning, with distinct characteristics in session length and retrieval configurations [22][25]. - Four types of trajectory moves were identified: Specialization, Generalization, Exploration, and Repetition, with a notable "drill-down bias" observed in the agents' behavior [27][32]. Group 5: CTAR Insights - The CTAR metric indicated that over half of new terms in queries could be traced back to previously retrieved documents, highlighting the agents' reliance on historical context [34][35]. - Different trajectory moves exhibited significant variations in CTAR, with Specialization and Exploration showing higher rates of term adoption compared to Repetition [36][37]. Group 6: System Design Implications - The findings suggest that repeated actions could signal potential stagnation in the agent's search process, prompting the need for system interventions to trigger exploration or generalization strategies [41]. - The retrieval budget should adapt based on task intent and trajectory state, allowing for more effective document coverage and query refinement [42]. - Incorporating CTAR and similar metrics into system monitoring can help assess whether agents are effectively utilizing retrieved information [43]. Group 7: Overall Contributions - The research provides the first open-source dataset for Agentic Search behavior logs, establishing a reproducible data foundation for future studies [46]. - It introduces an analytical framework for understanding Agentic Search processes, offering tools for behavior modeling and strategy comparison [47]. - The study also translates empirical observations into quantifiable design recommendations for improving agentic search systems [48].
英伟达世界模型再进化,一个模型驱动所有机器人!机器人的GPT时刻真正到来
机器之心· 2026-02-09 01:18
驱动具身智能进入通用领域最大的问题在哪里? 我们认为,核心问题在于 「跨具 身(cross-em bodiment)迁移」 。 当然,具身智能执行通用复杂任务的核心是一个完善的世界模型。但是,大多世界模型其实并没有我们想象的那样具备极强的泛化性和迁移能力。 简单来说,这些用在机器人或是智能汽车上的世界模型,基本都是在某个固定的硬件平台上设计训练的,大多不具备很强的泛化能力,跨具身迁移几乎靠运气。 说白了,大多数机器人今天学到的不是 「世界是如何运作的」,而是 「在这台机器该怎么动」。我们需要能学到一个真正理解物理与因果的世界模型 —— 知道 世界会怎么变、动作会带来什么后果,才能在不同身体、不同环境中迁移与泛化。 在这个问题上,作为算力的王者,深耕各类世界模型的英伟达再一次发力,构建了一个全新是世界模型,一切都是 Zero-Shot 的。 最近, 英伟达 GEAR 实验室提出 DreamZero , 一种 基于预训练视频扩散骨干网络构建 的世界动作模型(WAM) 。 这是一个拥有 140 亿参数的模型,能够让机器人仅通过简单的文本提示就完成此前从未见过的任务。 实验室负责人 Jim Fan 将其称为机器人领域 ...
童年的滚球兽「走进」现实?华为天才少年创业,全球首个虚实融合的实时交互视频模型来了
机器之心· 2026-02-09 01:18
Core Viewpoint - The article discusses the emergence of Xmax AI's real-time interactive video model X1, which allows users to seamlessly integrate virtual characters into their real-world environment, marking a significant advancement in the field of AI video generation and interaction [3][10][26]. Group 1: Technology and Innovation - Xmax AI has developed the X1 model, which enables real-time interaction with virtual characters using just a smartphone camera, eliminating the need for complex prompts or lengthy rendering times [4][10]. - The global AI video generation market is projected to grow from $614.8 million in 2024 to $2.5629 billion by 2032, indicating strong demand and competition in the sector [8]. - Xmax AI's approach focuses on making AI video generation accessible to the general public by lowering interaction barriers and enhancing real-world integration [10][26]. Group 2: Features of X1 Model - The X1 model offers four core functionalities: dimensional interaction, world filters, touch animations, and expression capture, allowing users to interact with virtual characters in a natural and engaging manner [10][11][14][16]. - Dimensional interaction allows users to summon characters into their environment using a reference image, while world filters enable real-time transformation of video styles based on uploaded images [11][14]. - Touch animations bring static images to life, allowing users to control movements through touch, and expression capture generates dynamic emojis based on real-time facial recognition [15][16]. Group 3: Technical Challenges and Solutions - Xmax AI faces significant technical challenges, including achieving ultra-low latency for real-time interactions, understanding user intent, and addressing data scarcity for training models [19][20]. - The company has innovated an end-to-end streaming re-rendering video model architecture to meet the demand for real-time responsiveness, reducing latency to milliseconds [24]. - To tackle the issue of intent understanding, Xmax AI has developed a unified interaction model that comprehensively interprets user gestures and actions [24]. Group 4: Team and Expertise - The founding team of Xmax AI comprises individuals with strong technical backgrounds, including experience at leading AI companies and academic institutions, which enhances their capability to address complex engineering challenges [22][23]. - The team has successfully built a robust technical foundation that combines algorithmic knowledge with practical engineering skills, positioning them well to innovate in the AI video generation space [22][24]. Group 5: Future Vision - Xmax AI aims to redefine user interaction with AI-generated content, envisioning a future where virtual characters can seamlessly integrate into daily life, serving as virtual companions or pets [26][28]. - The company's slogan, "Play the World through AI," encapsulates its mission to make the virtual world more interactive and accessible, allowing users to engage with digital content in a tangible way [28].
扩散语言模型深度思考
机器之心· 2026-02-08 10:37
以下文章来源于精博士小酒馆 ,作者王云鹤 写这个的时候,其实我脑子里第一反应是好多年以前某位领导问过我, transformer的下一跳是什么? 我当时 的回复是transformer是一个量变到质变长期积累得到的范式,很早期的视觉里面也有类似的nonlocal等,而且 卷积也在跟attention持续互补发挥作用。 diffusion本身也不算transformer的下一条,但是从建模方式上,可能 有潜力会对ar带来很大冲击。 很早就关注扩散语言模型了(diffusion language model,dllm),但是受限于精力和算力一直没机会深度思 考。从文本角度探索diffusion的架构相对当前比较好入手,并且这里面很多问题不解决,多模态的版本也不好 搞,所以我们会先聚焦dllm上的算法基础。 去年下半年陆陆续续开始在一些方向上有一些探索,受启发于某位内部专家,赶在元旦之前写了一篇算是洞察 材料的文章。 前几天在AAAI的报告重点介绍了团队的几个工作,包含next-block diffusion的训练,diffusion in diffusion的分 层结构,diffusion agent等。 相关P ...
模型「漂移」新范式,何恺明新作让生成模型无须迭代推理
机器之心· 2026-02-08 10:37
训练一个生成模型是很复杂的一件事儿。 从底层逻辑上来看,生成模型是一个逐步拟合的过程。与常见的判别类模型不同,判别类模型通常关注的是将 单个样本映射到对应标签 ,而生成模型则关注 从一 个分布映射到另一个分布。 从大家最熟悉的扩散模型说起,扩散模型,包括一些基于流的对应方法,通常通过微分方程(随机微分方程 SDE 或常微分方程 ODE)来刻画从噪声到数据的映 射。 但训练扩散模型是一件费时费力的事情,因为其核心计算过程是一个 迭代过程 。 为了尽可能提升生成模型的效率,大量工作致力于 减少扩散的步数 。比较有代表性的一类是蒸馏方法,将一个预训练的多步模型蒸馏为单步模型。另一类研究则 尝试从零开始训练单步扩散模型。例如: 变分自编码器(VAE)通过优化证据下界(ELBO)进行训练,该目标由重建损失和 KL 散度项组成。在采用高斯先验时,经典 VAE 本身就是一步生成模型。然 而,在当今主流应用中,VAE 往往使用由扩散模型或自回归模型学习得到的先验,此时 VAE 更多地充当分词器的角色。 正则化流(Normalizing Flows, NFs)学习从数据到噪声的映射,并通过最大化样本的对数似然进行训练。这类方法 ...
登顶Hugging Face论文热榜,LLM重写数据准备的游戏规则
机器之心· 2026-02-08 10:37
跨系统表结构不一致,对齐逻辑复杂,人工映射耗时耗力 海量数据缺少标签和语义描述,分析师「看不懂、用不好」 这背后是数据准备这一经典难题 —— 它占用了数据团队近 80% 的时间与精力,却依然是智能化进程中最顽固的瓶颈。传统方法主要依赖静态规则与领域特定模 型,存在三大根本局限:高度依赖人工与专家知识、对任务语义的感知能力有限、在不同任务与数据模态间泛化能力差。 如今,一份引爆 HuggingFace 趋势榜的联合综述 指出,大语言模型(Large Language Models,LLMs)正在从根本上改变这一局面,推动数据准备从 「 规则驱 动」向「 语义驱动」 的范式转变。 在企业级系统中,数据团队普遍面临一个困境:模型迭代飞速,但数据准备的「 老旧管道」却愈发沉重。清洗、对齐、标注…… 这些工作依然深陷于人工规则与 专家经验的泥潭。您的团队是否也为此困扰? 研究团队指出,LLM 的引入正在推动这一流程从「 规则驱动」向「 语义驱动」转变。模型不再仅仅执行预设逻辑,而是尝试理解数据背后的含义,并据此完成检 测、修复、对齐和补充等操作。 在这篇综述中,作者从应用层面(Application-Ready)的视角 ...
生成式科学智能的新标杆:IntelliFold 2新近发布并开源,主要指标实现全面领先
机器之心· 2026-02-08 10:37
在 GenAI 带动的 "生成式科学智能(Generative Science)" 的新浪潮中,生物基石模型始终是广受关注的热门领域;自然界的生命语言(序列、结构)与人类符号 语言呈现类似的序列化特征,但其背后蕴含严苛的物理约束与生物演化逻辑,长期以来难为人类完全破解,同时因其对于人类生产、生活的关键重要作用,使生 物基石模型成为领域内广受关注的 "皇冠上的明珠"。 生物基石模型的关键价值,在于其能从海量信息中借助 Transformer 等 GenAI 架构充分打开隐空间,挖掘出人类难以感知归纳却又大有所用的 "生命语法"; DeepMind 旗下的 AlphaFold 系列研究无疑是其中颇具开创性的重大突破。 及至 AlphaFold 3 发布后,其现象级的突破性进展与巨大产业潜力有目共睹,成为毫无争议的行业典范。随后,围绕结构预测及与之密切关联的从头设计应用,全 球相继涌现出一批以 GenAI 大模型寻求突破的代表性成果(Chai Discovery、Boltz、OpenFold 等),明星团队、大额融资及至大厂并购(Evolutionary Scale)等积 极消息此起彼伏,市场热度持续上涨;但及至 ...