机器之心
Search documents
17岁高中生用AI解决数学界难题,陶哲轩、Jeff Dean点赞
机器之心· 2026-01-25 04:01
编辑|杨文 你的童年我的童年好像不一样。 我的 17 岁,是坐在教室里苦哈哈地刷数学卷子;而这个名叫 Enrique Barschkis 的高中生,利用课间休息时间,成功解决了困扰数学家多年的埃尔德什第 347 号问 题。 这一成就不仅在社交平台 X 上引发热议,更得到了谷歌首席科学家 Jeff Dean 的盛赞。 什么是埃尔德什第 347 号问题? 埃尔德什第 347 号问题,最初由埃尔德什和格雷厄姆在 1980 年提出,核心问题是:是否存在一个整数序列,其中相邻项的比值趋近于 2,并且对于该序列的任何 余有限子序列,其有限子集和构成的集合在自然数中的密度都是 1? 这个问题触及了数论中完全序列理论的核心,其难度在于需要在严格的增长率限制下,保证几乎所有足够大的正整数都能表示为序列中某些项的和。 去年 10 月,著名数学家、菲尔兹奖得主陶哲轩在 Erdős 问题网站的讨论区里,用 ChatGPT 搜索相关文献,找到了一篇 Burr 和 Erdős 的旧论文。 然而数学家沃特很快发现,那篇论文中的结果使用的是相邻两项的比值条件,与本问题要求的相邻项比值条件略有不同。 陶哲轩提出了一个巧妙的构造思路:将序列分成 ...
国内首篇!融合语言模型的多模态触觉传感器,推动机器人触觉迈向人类水平
机器之心· 2026-01-25 04:01
论文第一作者为清华大学博士、南洋理工大学博士后李寿杰,清华大学博士生吴同和人工智能硕士生徐建乐。论文通讯作者包括清华大学深圳国际研究生院 副教授丁文伯,大连理工大学教授解兆谦,新加坡国立大学助理教授吴昌盛和香港城市大学教授于欣格。 随着机器人技术从「预设程序执行」向「具身智能交互」跨越,触觉感知作为理解物体属性、实现精细操作的核心感测方式,其重要性日益凸显,但当前系 统在感知维度、分辨率及信号解读能力上仍远逊于人类,导致机器人往往处于「有感无知」的状态。 在此背景下, 清华大学深圳国际研究生院丁文伯团队 联合无界智航(Xspark AI)及多所国内外科研机构,从鸽子卓越的多光谱视觉和非成像感知机制中获得灵 感,研发出了一种仿生多模态触觉传感器 SuperTac 。 该系统将多光谱成像、摩擦电感测与惯性测量融为一体,并通过构建 8.5B 参数的触觉语言模型 DOVE ,实现了触觉信号从底层感知到高层语义推理的突 破。 相关成果作为封面文章发表于 《Nature Sensors》 第一期,也是国内以第一单位在该期刊发表的首篇,标志着机器人触觉感知向「人类水平」迈出了关键 一步。 论文标题:Biomimetic m ...
谷歌用一堆不赚钱的AI小玩意,给科技圈上了一课
机器之心· 2026-01-25 02:35
Core Viewpoint - Google's Arts & Culture projects embody the philosophy of "uselessness" leading to significant value, as they prioritize cultural exploration over direct commercial benefits [2][51]. Group 1: Project Overview - Google Arts & Culture features various artistic experiments that do not generate direct revenue, such as the Botanic Atlas, which maps over 30,000 plant species [9]. - The Learning Light project offers virtual tutorials on lighting principles in art and design [11]. - Art Palette allows users to upload images and find matching artworks from over 1,500 cultural institutions based on color analysis [13][24]. Group 2: Interactive Experiences - "Don't Touch the Art" is a humorous game that turns the museum's no-touch policy into a playful experience, enhancing art appreciation [29][31]. - "One Sound, Two Frames" challenges users to match AI-generated music with corresponding artworks, blurring the lines between visual and auditory art [36][40]. - "Musical Canvas" enables users to create art while generating matching music, making art creation accessible to those without musical training [43][46]. Group 3: Cultural Significance - These projects collectively aim to bridge the gap between art and the public, allowing for a more engaging interaction with culture [50]. - The existence of these initiatives reflects a commitment to cultural exploration without the pressure of profitability, highlighting the importance of art in society [51][54].
拒绝Reward Hacking!港科联合快手可灵提出高效强化学习后训练扩散模型新范式
机器之心· 2026-01-25 02:35
Core Insights - The article discusses the challenges of using Reinforcement Learning (RL) to fine-tune diffusion models like Stable Diffusion, particularly the issue of "Reward Hacking" which can degrade image quality [2][5] - A new framework called GARDO (Gated and Adaptive Regularization with Diversity-aware Optimization) is introduced, which aims to prevent Reward Hacking while enhancing sample exploration and diversity generation [2][12] Background and Motivation - RL has shown promising results in visual tasks, but defining an ideal reward function is challenging, often leading to the use of proxy rewards that can result in Reward Hacking [5][4] - The article highlights the pitfalls of RL post-training, including low sample efficiency and hindered exploration due to static reference models [9][10] GARDO Framework - GARDO is designed to address the issues of Reward Hacking by implementing three core insights: 1. Gated KL Mechanism, which applies KL regularization only when the model generates samples in unreliable reward regions [14][15] 2. Adaptive Regularization Target, which updates the reference model to prevent optimization stagnation [17] 3. Diversity-Aware Advantage Shaping, which encourages diversity in generated samples to avoid mode collapse [18][19] Experimental Results - GARDO has been tested on various base models (SD3.5-Medium, Flux.1-dev) and demonstrated significant advantages over baseline methods like Flow-GRPO [20][21] - The framework effectively prevents Reward Hacking while maintaining high image quality and sample efficiency, achieving better performance with fewer training steps [22][23] Emergent Behavior - GARDO has shown the ability to generate a higher number of objects in challenging tasks, indicating its potential to unlock new capabilities in visual generation [24][25] Conclusion - The work emphasizes that precise control is more important than strict constraints in visual generation using RL, making GARDO a valuable framework for researchers and developers looking to leverage RL in diffusion models [27]
不止于Prompt:揭秘「神经网络可重编程性」
机器之心· 2026-01-24 04:09
Core Viewpoint - The article discusses the evolution of model adaptation techniques in the context of large pre-trained models, emphasizing a shift from parameter-centric adaptation to reprogrammability-centric adaptation, which allows for efficient task adaptation without modifying model parameters [5][9]. Group 1: Transition in Model Training Paradigms - The adaptation paradigm has fundamentally shifted from traditional parameter adjustment to a focus on model reprogrammability, enabling the reuse of pre-trained models across various tasks with minimal computational overhead [5][9]. - The new approach emphasizes modifying the task presentation rather than the model itself, allowing a single frozen model to handle multiple tasks by changing the interaction method [9]. Group 2: Efficiency Advantages of Reprogrammability - Empirical data shows that reprogrammability-centric adaptation (RCA) significantly outperforms parameter-centric adaptation (PCA) in terms of parameter efficiency, requiring 2-3 orders of magnitude fewer parameters for task adaptation [11][12]. - RCA enables adaptation in resource-constrained environments and supports simultaneous adaptation to multiple tasks without catastrophic forgetting, making it increasingly relevant as pre-trained models grow in scale and complexity [12]. Group 3: Terminology and Framework - The article identifies a terminological confusion in the research community, where similar adaptation methods are referred to differently across fields, such as "prompt tuning" in NLP and "model reprogramming" in machine learning literature [14]. - Despite the different names, these methods fundamentally leverage the same property of neural networks—reprogrammability—leading to the proposal of a unified framework that connects these disparate research areas [14][17]. Group 4: Mathematical Expression of Reprogrammability - The article provides a mathematical framework for neural network reprogrammability, defining how a fixed pre-trained model can be adapted to new tasks through configurable transformations without changing the model's parameters [25][34]. Group 5: Case Studies of Reprogrammability - The article illustrates three methods of reprogramming using a vision-language model, highlighting how each method achieves the same goal of reusing a frozen model for new tasks through different computational paths [27][30]. - Input manipulation and output alignment are key components of these methods, allowing for effective task adaptation without additional training parameters [30][32].
估值35亿美元,LeCun创业公司官宣核心方向,掀起对Next-token范式的「叛变」
机器之心· 2026-01-24 04:09
机器之心编辑部 自从图灵奖得主 Yann LeCun 离开 Meta 创立 AMI Labs(Advanced Machine Intelligence) 以来,这家新公司便引发了业界的高度关注。本周,他们终于确认了核心 方向: 开发所谓的「世界模型(world models)」,以此构建能够理解现实世界的智能系统。 官网地址: https://amilabs.xyz/ 一直以来,LeCun 都对现有大语言模型的发展持怀疑态度,认为仅靠预测下一个 token 的生成式模型无法真正做到理解现实世界。他提出了 世界模型这一不同路 径, 一种能够准确反映现实动态的新型人工智能架构。这类全新的智能系统,应同时具备四项关键能力: 这一愿景背后,直指当前大模型路线的一个核心局限。 理解真实世界; 拥有持久记忆; 能够进行推理与规划; 可控且安全。 值得一提的是,在业界另一条技术路线中,LeCun 也开始发挥更广泛的影响力。近日,硅谷初创公司 Logical Intelligence 任命 Yann LeCun 为其技术研究委员会创 始主席。 现实世界的数据主要来自摄像头与各类传感器,其特征是连续、高维且充满噪声。过去几年 ...
挑战Claude Code?OpenAI Codex发布月将至,今先揭秘智能体循环
机器之心· 2026-01-24 04:09
Core Insights - OpenAI's CEO Sam Altman announced an upcoming series of exciting releases related to Codex, particularly emphasizing cybersecurity [1] - OpenAI released a technical blog titled "Unrolling the Codex Agent Loop," which details the core architecture of Codex CLI and its functionalities [3][4] Group 1: Codex Overview - Codex CLI is a cross-platform local software agent developed by OpenAI that can generate high-quality software changes [7] - OpenAI has accumulated significant experience in building world-class software agents since the initial release of CLI in April [8] Group 2: Agent Loop Mechanism - The agent loop is the core logic of Codex CLI, coordinating interactions between users, models, and tools for executing software tasks [10] - The agent loop consists of several steps: input acquisition, inference, decoding, decision-making, execution, and retry until a final response is generated [16][17] Group 3: Model Inference and API Interaction - Codex CLI operates model inference by sending HTTP requests to the Responses API, which drives the agent loop [22][23] - The Responses API endpoints used by Codex CLI are configurable, allowing integration with various implementations [24][25] Group 4: Prompt Construction - The initial prompt for the Responses API is constructed based on user inputs and various roles, including system, developer, user, and assistant [28][30] - Codex appends user messages to the input after constructing the initial prompt, facilitating the start of the dialogue [33] Group 5: Performance Considerations - The JSON payload sent to the Responses API can grow quadratically during conversations, but Codex currently avoids using the previous_response_id parameter to maintain statelessness [51] - Prompt caching is crucial for efficiency, allowing Codex to reuse previous inference results and reduce computational costs [52][53] Group 6: Context Management - Codex employs a strategy of compressing dialogue once the token count exceeds a certain threshold, replacing the input with a smaller representation to continue the conversation [58][59] - The Responses API has evolved to support a compact endpoint for more efficient dialogue compression [58] Group 7: Future Directions - The blog introduces the Codex agent loop and discusses practical considerations for developers building agent loops on top of the Responses API [61] - Future articles will delve deeper into the architecture of CLI, exploring tool invocation implementations and the sandbox model of Codex [63]
1月28日,直播预约!来聊聊具身评测中的科学与乱象
机器之心· 2026-01-24 03:02
过去一年,我们几乎每周都能看到各种惊艳的机器人 Demo:机器人会叠衣服了、会做咖啡了、会跳各种 舞了。但在繁荣的背后,有一个问题越来越频繁地被提起,那就是: 我们到底怎么判断一个具身模型是真 的进步了 ? 具身评测是具身智能产业发展的"度量衡",是技术从实验室走向产业化的必经之路。 但一走出实验室,面对真实世界的复杂、多变和不确定性时,那些号称接近完美的成功率往往会瞬间"缩 水"。"刷榜容易,落地难",成为了悬在具身智能商业化路上的达摩克利斯之剑。 1月28日(下周三)晚19:00 ,直播即将开启。 本次圆桌对话由 机器之心创始人兼CEO 赵云峰 主持,特邀四位产业与学术专家: *音序 共同深入探讨具身智能评测的真实现状与核心挑战。 主持人 赵云峰 机器之心创始人兼 CEO 圆桌嘉宾(音序) 范浩强 李永露 范浩强,Dexmal 原力灵机Co-Founder 李永露,上海交通大学副教授、上海创智学院全时导师 沈宇军,蚂蚁灵波科技首席科学家 赵行,星海图联合创始人、清华大学助理教授 2026/01/28 19:00-20:00 周三晚 描 硕 约 直 -/ 圆桌嘉宾 , 沈宇军 赵 星海图联合创始人 清华大学 ...
LeCun、谢赛宁团队重磅论文:RAE能大规模文生图了,且比VAE更好
机器之心· 2026-01-24 01:53
Core Insights - The article discusses the emergence of Representation Autoencoders (RAE) as a significant advancement in the field of text-to-image diffusion models, challenging the dominance of Variational Autoencoders (VAE) [1][4][33] - The research led by notable scholars demonstrates that RAE can outperform VAE in various aspects, including training stability and convergence speed, while also suggesting a shift towards a unified multimodal model [2][4][33] Group 1: RAE vs. VAE - RAE has shown superior performance in pre-training and fine-tuning phases compared to VAE, particularly in high-quality data scenarios, where VAE suffers from catastrophic overfitting after just 64 epochs [4][25][28] - The architecture of RAE utilizes a pre-trained and frozen visual representation encoder, which allows for high-fidelity semantic starting points, contrasting with the lower-dimensional outputs of traditional VAE [6][11] Group 2: Data Composition and Training Strategies - The study highlights that merely increasing data volume is insufficient for RAE to excel in text-to-image tasks; the composition of the dataset is crucial, particularly the inclusion of targeted text rendering data [9][10] - RAE's architecture allows for significant simplifications in design as model sizes increase, demonstrating that complex structures become redundant in larger models [17][21] Group 3: Performance Metrics and Efficiency - RAE has achieved a convergence speed that is approximately four times faster than VAE, with significant improvements in evaluation metrics across various model sizes [23][25] - The robustness of RAE is evident as it maintains stable generation quality even after extensive fine-tuning, unlike VAE, which quickly memorizes training samples [28][29] Group 4: Future Implications - The success of RAE indicates a potential shift in the text-to-image technology stack, moving towards a more unified semantic modeling approach that integrates understanding and generation within the same representation space [29][34] - This advancement could lead to more efficient and effective multimodal models, enhancing the ability to generate images that align closely with textual prompts [36]
音频-视觉全模态的未来预测,FutureOmni给出了首份答卷
机器之心· 2026-01-24 01:53
复旦大学、上海创智学院与新加坡国立大学联合推出首个全模态未来预测评测基准 FutureOmni,要求模型从音频 - 视觉线索中预测未来事件, 实现跨模态因果和时间推理。包含 919 个视频和 1,034 个多选题问答对,在 13 个全模态模型 和 7 个纯视频模型 上的评估显示,当前系统在预 测未来事件方面存在显著困难,最佳准确率仅为 64.8%。 在日常生活中,人类不仅能理解「发生了什么」,更重要的是能够预测「将会发生什么」。看到乌云密布、听到雷声渐近,我们会主动关窗收衣;看到老师眉头 紧皱,反复强调某个知识点(听),我们知道接下来可能会有提问;看到球员起跳的动作和听到观众的惊呼,我们能够预判这是一个精彩的扣篮。 然而,现有的多模态大语言模型(MLLMs)虽然在全方位感知方面展现出强大的能力,但它们从音频 - 视觉线索中预测未来事件的能力仍然很大程度上未被探 索。现有的音视频模态基准主要关注回顾性理解 ⸺「视频中发生了什么」,而非前瞻性预测 ⸺「接下来会发生什么」。 现在,这一空白终于被填补了!复旦大学、上海创智学院与新加坡国立大学联合发布 FutureOmni ,不仅重新定义了多模态模型的「未来预测」评测 ...