Workflow
机器之心
icon
Search documents
人脸机器人登上Science Robotics封面:用AI教会仿生人脸机器人「开口说话」
机器之心· 2026-01-15 04:31
胡宇航(网名 "U 航"),毕业于美国哥伦比亚大学,博士学位,首形科技创始人。长期专注于机器人自主学习的研究工作。研究成果发表于《Nature Machine Intelligence》,《Science Robotics》等国际顶级期刊。致力于赋予机器人 "自我模型" 能力,即构建对自身物理结构与运动的内部表征,使机器人能够更好地理解 自身,并适应多变的形态、环境与任务。在仿生人机交互方向,他提出融合语音、视觉与动作的情绪理解与表达一体化系统,为机器人提供更加自然的交互能 力。通过自监督学习机制,他的方法使机器人在无需人工干预的情况下不断提升人机互动质量,朝着具备终身学习能力的智能体不断迈进。 论文地址: https://www.science.org/doi/10.1126/scirobotics.adx3017 曾发表论文: 2026 年 1 月 15 日,一项来自美国哥伦比亚大学工程学院的突破性研究正式发表于《Science Robotics》,并登上期刊封面。该研究展示了一项全新的机器人技术: 一台具备仿生面部结构的人形机器人,通过深度学习实现与语音和歌曲同步的真实唇部运动。它能跟着人类的语言精准张 ...
解锁任意步数文生图,港大&Adobe全新Self-E框架学会自我评估
机器之心· 2026-01-15 03:52
Core Viewpoint - The article discusses the introduction of Self-E, a novel text-to-image generation framework that eliminates the need for pre-trained teacher models and allows for any-step generation while maintaining high quality and semantic clarity [2][28]. Group 1: Introduction and Background - Traditional diffusion models and flow matching have improved text-to-image generation but require numerous iterations, limiting their real-time application [2]. - Existing methods often rely on knowledge distillation, which incurs additional training costs and leaves a gap between "from scratch" training and "few-step high quality" generation [2][28]. Group 2: Self-E Framework - Self-E represents a paradigm shift by focusing on "landing evaluation" rather than "trajectory matching," allowing the model to learn the quality of the final output rather than just the correctness of each step [7][28]. - The model operates in two modes: learning from real data and self-evaluating its generated samples, creating a self-feedback loop [12][13]. Group 3: Training Mechanism - Self-E employs two complementary training signals: one from data and the other from self-evaluation, enabling the model to learn local structures and assess its outputs simultaneously [14][19]. - The training process involves a long-distance jump to a landing point, where the model uses its current local estimates to generate feedback on how to improve the output [17][19]. Group 4: Inference and Performance - During inference, Self-E can maintain semantic and structural quality with very few steps, and as the number of steps increases, the quality continues to improve [22][23]. - In the GenEval benchmark, Self-E outperforms other methods across all step counts, showing a significant advantage in the few-step range, with a notable improvement of +0.12 in a 2-step setting compared to the best existing methods [24][25]. Group 5: Broader Implications - Self-E's approach aligns pre-training and feedback learning, creating a closed-loop system similar to reinforcement learning, which enhances the model's ability to generate high-quality outputs with fewer steps [26][29]. - The framework allows for dynamic step selection based on the application context, making it versatile for both real-time feedback and high-quality offline rendering [28].
实测夸克「千问划词快捷指令」,这7个邪修Prompt,建议收藏
机器之心· 2026-01-15 03:52
Core Viewpoint - The article discusses the challenges and solutions related to effectively using AI for understanding complex information, emphasizing the importance of well-structured prompts to enhance AI responses [6][8]. Group 1: AI Interaction Challenges - Many users struggle to communicate effectively with AI, leading to frustration and doubts about AI's intelligence [5][6]. - The quality of AI responses often depends on the clarity and structure of the user's prompts, suggesting that refined instructions can significantly improve outcomes [6][10]. Group 2: Quark AI Browser Features - The Quark AI Browser has introduced a feature called "Thousand Questions Highlighting," which allows users to create custom shortcut commands for frequently used prompts, streamlining the interaction process [8][10]. - Users can set up specific commands for tasks like translation and content optimization, making it easier to achieve precise results without repetitive input [11][12]. Group 3: Practical Applications of AI Prompts - The article highlights various effective prompt categories, such as "Evil Master Prompts" that encourage AI to ask for necessary information to fulfill tasks effectively [15][16]. - A "Human Language Translator" prompt is suggested for simplifying complex academic papers, allowing users to receive clear explanations [25][27]. - The "Citation Source Finder" prompt aids in quickly identifying relevant research sources, significantly reducing the time spent on literature review [30][33]. Group 4: Content Creation Enhancements - Content creators can utilize tailored prompts for different platforms, ensuring that the tone and style match the audience's preferences, thus enhancing engagement [35][39]. - Specific prompts for platforms like Xiaohongshu and Weibo are provided, demonstrating how to adapt content for various social media environments [39][42]. Group 5: Future of AI Browsers - The Quark AI Browser aims to evolve into a comprehensive application, integrating various AI models and supporting multimodal inputs, which enhances user experience and functionality [45][46]. - The browser's capabilities are designed to create a seamless workflow for users, enabling them to perform tasks more efficiently and effectively [48][51].
已证实!清华姚班陈立杰全职加入OpenAI,保留伯克利教职
机器之心· 2026-01-15 03:52
机器之心编辑部 据机器之心求证,清华大学「姚班」校友、加州大学伯克利分校(UC Berkeley)助理教授 陈立杰(Lijie Chen)已正式加入 OpenAI 。 知情人士透露,陈立杰此次是以 全职 身份加入 OpenAI 开展研究工作。与此同时,他目前在伯克利的状态为 On Leave(停薪留职),即他保留了在大学 的教职,并未离职。 陈立杰是理论计算机科学领域的顶尖青年学者,本科毕业于清华姚班,博士毕业于麻省理工学院(MIT),在计算复杂性理论等领域拥有卓越的学术成就。 截至目前,其个人主页和 LinkedIn 页面尚未更新。 从 IOI 金牌到伯克利助理教授 陈立杰高中就读于杭州外国语学校。他在信息学竞赛(OI)领域表现突出,是当时知名的竞赛选手。 2011 年,他获得全国青少年信息学奥林匹克竞赛(NOI)金牌;2013 年,他代表中国队出征第 25 届国际信息学奥林匹克竞赛(IOI),不仅夺得金牌, 更取得了全球第一名的成绩。 进入清华大学姚班后,陈立杰逐渐将重心从程序设计竞赛转向计算机科学理论研究。2016 年,他获得清华大学本科生特等奖学金。在特等奖学金答辩会 上,陈立杰曾立下宏愿:「 有生之 ...
Agent时代,为什么多模态数据湖是必选项?
机器之心· 2026-01-15 00:53
Core Viewpoint - The year 2025 is anticipated to be remembered as the dawn of the AI industrial era, with many companies racing to invest in AI applications and agent development, but the true competition lies beyond just application-level advancements [1][4]. Group 1: AI Infrastructure and Data Management - The AI era emphasizes that the foundation for AI applications is robust data infrastructure, which is crucial for building true competitive advantages for companies [3][8]. - Companies need to develop capabilities to handle multimodal data, as the real benefits of the AI era lie not in merely possessing state-of-the-art models but in the ability to continuously manage and nurture them [9][18]. - The industry is entering the "second half" of AI, where the focus shifts to how AI should be utilized and how to measure real progress, necessitating a change in mindset to leverage AI thinking [4][5]. Group 2: Multimodal Data Lakes - The construction of multimodal data lakes is becoming essential for companies to participate in the agent competition, as it allows for the transformation of previously dormant unstructured data into usable competitive assets [14][21]. - IDC predicts that by 2025, over 80% of enterprise data will be unstructured, highlighting the need to awaken this data to build competitive strength in the agent era [16][19]. - The transition from traditional data lakes to multimodal data lakes is critical, as it enables companies to manage and utilize diverse data types effectively, driving business intelligence and operational efficiency [12][22]. Group 3: Data Infrastructure Evolution - The evolution of data infrastructure is outlined in three progressive stages: overcoming computing bottlenecks, integrating models into data pipelines, and implementing comprehensive data governance [30][31][33]. - The first stage focuses on breaking through computing limitations by adopting heterogeneous architectures that support both CPU and GPU, ensuring data can be processed quickly and efficiently [30]. - The second stage emphasizes the integration of pre-trained large models into data workflows, allowing for the automatic conversion of multimodal data into usable formats for AI applications [31][32]. - The final stage aims for unified data governance, enhancing the management and activation of data assets while ensuring compliance and security [33][34]. Group 4: Strategic Recommendations for Companies - Companies should prioritize transforming their data infrastructure from a "storage center" to a "value center," ensuring that data can be quickly accessed and understood by AI models [38][39]. - The focus should be on practical business applications, avoiding the pitfalls of excessive computational power that does not translate into business value [40][41]. - A modular and open data infrastructure is essential for adapting to future uncertainties, allowing companies to upgrade smoothly as technologies evolve [43][44][45]. Group 5: Industry Applications and Impact - The implementation of multimodal data lakes has shown significant improvements across various industries, such as a 20-fold performance increase in a smart driving company's model training and a 90% efficiency boost in content production for a leading media company [51][59]. - These examples illustrate the necessity of adopting multimodal data strategies to unlock the potential for intelligent transformation across diverse sectors [52][56].
大模型长脑子了?研究发现LLM中层会自发模拟人脑进化
机器之心· 2026-01-15 00:53
Core Insights - The article discusses the emergence of a "Synergistic Core" structure in large language models (LLMs), which is similar to the human brain's organization [1][2][17]. - The research indicates that this structure is not inherent to the Transformer architecture but develops through the learning process [18][19]. Model Analysis - Researchers utilized the Partial Information Decomposition (PID) framework to analyze models such as Gemma, Llama, Qwen, and DeepSeek, revealing strong synergistic processing capabilities in the middle layers, while lower and upper layers exhibited redundancy [5][6][8]. - The study involved cognitive tasks across six categories, with models generating responses that were analyzed for activation values [9][10]. Experimental Methodology - The Integrated Information Decomposition (ID) framework was applied to quantify interactions between attention heads, leading to the development of the Synergy-Redundancy Rank, which indicates whether components are aggregating signals independently or integrating them deeply [12][13]. Findings on Spatial Distribution - The experiments revealed a consistent "inverted U-shape" curve in the distribution of synergy across different model architectures, indicating a common organizational pattern [14]. - This pattern suggests that synergistic processing may be a computational necessity for achieving advanced intelligence, paralleling the human brain's structure [17]. Core Structure Characteristics - The "Redundant Periphery" consists of early and late layers with low synergy, focusing on basic tasks, while the "Synergistic Core" in the middle layers shows high synergy, crucial for advanced semantic integration and reasoning [21][23]. - The Synergistic Core is identified as a hallmark of the model's capabilities, exhibiting high global efficiency for rapid information integration [23]. Validation of Synergistic Core - Ablation experiments demonstrated that removing high-synergy nodes led to significant performance declines, confirming the Synergistic Core as a driving force behind model intelligence [25]. - Fine-tuning experiments showed that training focused on the Synergistic Core resulted in greater performance improvements compared to training on redundant nodes [27]. Implications for AI and Neuroscience - Identifying the Synergistic Core can aid in designing more efficient compression algorithms and targeted parameter updates to accelerate training [29]. - The findings suggest a convergence in the organizational patterns of large models and biological brains, providing insights into the nature of general intelligence [29].
端到端智驾新SOTA | KnowVal:懂法律道德、有价值观的智能驾驶系统
机器之心· 2026-01-14 07:18
Core Viewpoint - The article discusses the development of KnowVal, an advanced autonomous driving system that integrates perception and knowledge retrieval to enhance visual-language reasoning capabilities, aiming for higher levels of automated driving [4][21]. Group 1: System Overview - KnowVal is a novel autonomous driving system that combines perception modules with knowledge retrieval modules to achieve visual-language reasoning [4]. - The system constructs a comprehensive driving knowledge graph that includes traffic regulations, defensive driving principles, and ethical considerations, supported by an efficient retrieval mechanism based on large language models [4][15]. - KnowVal integrates a world model and a value model within its planner to ensure value-aligned decision-making [4][11]. Group 2: Technical Framework - The system employs an open 3D perception and knowledge retrieval framework, enhancing the traditional visual-language paradigm to facilitate basic visual-language reasoning [7][9]. - It utilizes specialized perception for autonomous driving and open-world 3D perception to extract both common and rare instance features, ensuring effective feature transfer throughout the system [9]. - The knowledge graph retrieval process involves natural language processing of perception data to access relevant knowledge entries, ranked by relevance [10][15]. Group 3: Value Model and Trajectory Planning - KnowVal's trajectory planning is based on a world prediction and value model, iteratively generating candidate trajectories and evaluating them against retrieved knowledge for value assessment [11][19]. - A large-scale driving value preference dataset was created to train the value model, consisting of 160,000 trajectory-knowledge pairs, which were annotated with value scores ranging from -1 to 1 [19]. Group 4: Experimental Results - The KnowVal framework was tested against baseline models GenAD, HENet++, and SimLingo, achieving the lowest collision rate on the nuScenes dataset and the highest driving score and success rate on the Bench2Drive benchmark [21]. - The results indicate that KnowVal outperforms existing end-to-end and visual-language-action models, demonstrating its effectiveness in real-world driving scenarios [21][22]. Group 5: Qualitative Analysis - The article highlights qualitative analysis examples to illustrate KnowVal's performance in adhering to legal and ethical driving behaviors, such as slowing down in wet conditions and obeying lane change regulations in tunnels [23][25].
Sebastian Raschka 2026预测:Transformer统治依旧,但扩散模型正悄然崛起
机器之心· 2026-01-14 07:18
Core Insights - The article discusses the evolving landscape of large language models (LLMs) as of 2026, highlighting a shift from the dominance of the Transformer architecture to a focus on efficiency and hybrid architectures [1][4][5]. Group 1: Transformer Architecture and Efficiency - The Transformer architecture is expected to maintain its status as the foundation of the AI ecosystem for at least the next few years, supported by mature toolchains and optimization strategies [4]. - Recent developments indicate a shift towards hybrid architectures and efficiency improvements, rather than a complete overhaul of existing models [5]. - The industry is increasingly focusing on mixed architectures and efficiency, as demonstrated by models like DeepSeek V3 and R1, which utilize mixture of experts (MoE) and multi-head latent attention (MLA) to reduce inference costs while maintaining large parameter counts [7]. Group 2: Linear and Sparse Attention Mechanisms - The standard Transformer attention mechanism has a complexity of O(N^2), leading to exponential growth in computational costs with increasing context length [9]. - New models like Qwen3-Next and Kimi Linear are adopting hybrid strategies that combine efficient linear layers with full attention layers to balance long-distance dependencies and inference speed [14]. Group 3: Diffusion Language Models - Diffusion language models (DLMs) are gaining attention for their ability to generate tokens quickly and cost-effectively through parallel generation, contrasting with the serial generation of autoregressive models [12]. - Despite their advantages, DLMs face challenges in integrating tool calls within response chains due to their simultaneous generation nature [15]. - Research indicates that DLMs may outperform autoregressive models when high-quality data is scarce, as they can benefit from multiple training epochs without overfitting [24][25]. Group 4: Data Scarcity and Learning Efficiency - The concept of "Crossover" suggests that while autoregressive models learn faster with ample data, DLMs excel when data is limited, achieving significant accuracy on benchmarks with relatively small datasets [27]. - DLMs demonstrate that increased training epochs do not necessarily lead to a decline in downstream task performance, offering a potential solution in an era of data scarcity [28].
继宇树后,唯一获得三家大厂押注的自变量:具身模型不是把DeepSeek塞进机器人
机器之心· 2026-01-14 07:18
Core Viewpoint - The article discusses the evolution of embodied intelligence, emphasizing that the next battleground will be the "brain" of robots, which is crucial for their autonomous operation in the physical world [1][4]. Group 1: Investment and Development - The company Zivariable has recently raised $1 billion in funding from ByteDance and Sequoia, indicating strong investor interest in their approach to robotic intelligence [1]. - Zivariable's focus is on developing a foundational model for physical intelligence that operates independently of existing AI models, aiming for a paradigm shift in how robots interact with the physical world [7][12]. Group 2: Challenges in Embodied Intelligence - The complexity of physical tasks requires robots to have a brain supported by a physical world foundational model, which is distinct from merely applying existing AI models [1][4]. - Current AI models struggle with understanding subtle physical differences that only become apparent through real-world interaction, highlighting the need for a model that can process long sequences of actions and understand causality over time [6][7]. Group 3: Model Development Approach - Zivariable advocates for an end-to-end architecture that allows for a holistic understanding of physical interactions, contrasting with the modular approach that often leads to a loss of critical details [9][10]. - The company emphasizes the importance of a general-purpose model that can learn the common structures of the physical world, similar to how language models have evolved [11]. Group 4: Unique Characteristics of Zivariable - Zivariable is committed to self-research, particularly in foundational models, believing that the next phase of competition in embodied intelligence will revolve around the ability to construct data loops and evolve models [15][16]. - The company has developed two core models, WALL-A and WALL-OSS, which integrate various aspects of embodied intelligence and have been successfully deployed in real-world scenarios [16][13]. Group 5: The Path Forward - The construction of a physical world foundational model is likened to retracing the developmental path of human infants, as it involves learning complex physical interactions that are not easily articulated [22]. - Zivariable's journey in this domain is characterized as long and challenging but ultimately rewarding, as they aim to redefine the capabilities of robots in the physical world [23].
仅用10天?Anthropic最新智能体Cowork的代码竟然都是Claude写的
机器之心· 2026-01-14 05:37
编辑|冷猫、+0 最近, Anthropic 发布了全新的智能体工具 Cowork ,号称能让普通用户像开发者使用 Claude Code 一样,轻松搞定非技术性任务。 更令人咋舌的是, Cowork 的诞生仅仅用了一周半。 根据官方介绍,Cowork 的能力包括但不限于:自动整理下载文件夹、从截图生成电子表格、基于散乱笔记起草报告,甚至支持连接 Google Calendar 等 现有工具,直接生成文档或演示文稿。 据 Claude Code 创建者 Boris Cherny 所说, Cowork 的全部代码都是由 Claude Code 写的。 这简直就是 Claude Code 最好的广告,当其他 AI 公司还在靠收购构建生态的是时候,Anthropic 已经开始让 AI 自己生 AI 了。 Cowork 是 Claude Code 的简化版本,专为普通用户设计。目前作为研究预览版,仅向 macOS 桌面端的 Claude Max 订阅者开放。用户只需授权访问 特定文件夹,便能通过自然语言指令,让 AI 自主读取、编辑或创建文件。它不仅能制定计划、并行执行任务,还会实时更新进度,并邀请用户参与指导。 有 ...