机器之心
Search documents
ICLR 2026|UIUC:一行代码彻底解决LLM推理的过度思考!
机器之心· 2026-02-08 03:10
Core Insights - The article discusses the introduction of the Self-Aligned Reward (SAR) method, which aims to enhance reasoning efficiency and accuracy in large language models by addressing the "overthinking" phenomenon observed in existing reinforcement learning frameworks [3][25]. Group 1: Introduction of DeepSeek-R1 and RLVR - On January 20, 2025, DeepSeek released the reasoning model DeepSeek-R1, sparking significant interest in reinforcement learning methods for large models [2]. - Researchers found that using simple feedback signals like "correct/incorrect" in tasks with clear answers allowed models to learn complex reasoning strategies, leading to improved reasoning capabilities [2]. Group 2: Limitations of RLVR - Despite the success of RLVR, it faces limitations, particularly the "overthinking" phenomenon, where models generate unnecessarily lengthy and repetitive reasoning processes for simple questions [3]. - This issue reduces reasoning efficiency and increases costs, highlighting a critical challenge that needs to be addressed in current RLVR methods [3][4]. Group 3: Proposed Solutions and SAR - Researchers have identified that the root cause of the overthinking phenomenon lies in the coarse-grained nature of the reward signals in RLVR, which do not differentiate between intermediate reasoning steps [4]. - A common approach to mitigate this issue involves imposing explicit constraints on reasoning length, such as penalizing the total number of tokens generated, but this often compromises overall accuracy [5]. - To address these challenges, researchers from the University of Illinois at Urbana-Champaign and Amazon AWS proposed the Self-Aligned Reward (SAR), which utilizes internal signals from large language models to provide feedback on the usefulness of reasoning processes [6][25]. Group 4: Characteristics of SAR - SAR is designed to be continuous and finely grained, allowing for a more nuanced assessment of output quality rather than binary feedback [9]. - It avoids introducing complex evaluation frameworks or independent reward models, thus reducing implementation and training costs [10]. - SAR directly engages with semantic information during the reasoning process, accurately reflecting the effectiveness and relevance of the reasoning content [10]. Group 5: Experimental Results - The article presents experimental evaluations across four foundational models and seven datasets, demonstrating that SAR can be seamlessly integrated into mainstream reinforcement learning algorithms like PPO and GRPO [18]. - The introduction of SAR led to an average accuracy improvement of approximately 4% and a reduction in output length by at least 30% compared to baseline methods using only RLVR [18][23]. - SAR showed stable and excellent performance across various tasks, including logical reasoning, indicating its strong cross-task generalization capability [18]. Group 6: Conclusion and Future Implications - The study introduces SAR as a simple yet effective solution to the overthinking problem in reinforcement learning reasoning models, enhancing both accuracy and computational efficiency [25]. - SAR reflects a new approach in the field of large model reinforcement learning, transforming internal model information into continuous feedback signals for sustainable training [25].
神秘模型「Pony Alpha」引外网热议,它会是国产大模型中的谁?
机器之心· 2026-02-08 03:10
机器之心编辑部 这两天,外网都在好奇: 全球模型服务平台 OpenRouter 上这个搜索第一的神秘模型是哪家的? 这个匿名模型 叫做「Pony Alpha」 。根据 OpenRouter 官方的说法,它是新一代的通用大模型,在编程、逻辑推理和角色扮演方面表现突出,并针对 Agent 工作流 进行了优化,具有极高的工具调用准确率。 目前,该模型可以免费使用。 每每出现这种能力强大的匿名模型,网友们刨根问底的热情就会被瞬间点燃。 有人翻模型参数、有人对比输出风格、有人逐字分析回复,试图从一丁点蛛丝马迹中还原它背后的训练数据。 知名 X 博主「karminski - 牙医」猜测 Pony Alpha 是国产大模型,要么是 DeepSeek-V4,要么是 GLM 新模型。 让 AI 来猜 AI,ChatGPT 也猜是 Anthropic 家的 Claude Sonnet 5。 当然也有人认为是刚刚并入 SpaceX 的 xAI 家的 Grok 4.2。 很多人猜测是 Opus 5.3、或者 Codex 4.6。 总之,现在是众说纷纭,就等官宣了。 Pony Alpha 细节与案例展示 根据 OpenRouter 上 ...
走出屏幕,多模态智能硬件如何承载最新的 AI?
机器之心· 2026-02-08 01:30
Group 1 - The advancement of multimodal models is accelerating the penetration of artificial intelligence into real-world scenarios, with multimodal smart hardware evolving to adapt to a wider range of applications [1][4] - The global multimodal AI market is expected to reach $10.89 billion by 2030, with a compound annual growth rate of 36.8%, driven primarily by hardware devices [1][4] - AI smartphones are currently one of the most focused areas in smart hardware, with companies aiming to integrate AI deeply into operating systems to enhance new interaction methods [1][4][5] Group 2 - The humanoid robot market is projected to exceed 1 billion units by 2050, with an estimated market size of $5 trillion, primarily serving industrial and commercial applications [1][5] - Tesla plans to mass-produce its Optimus Gen 3 humanoid robot by 2026, targeting a production goal of 1 million units by 2030 [1][5] - Smart glasses are becoming a key medium for different manufacturers to compete for interaction sovereignty, with significant funding flowing into the sector [1][5][6] Group 3 - Recent innovations in smart hardware include lightweight wearable devices like rings and pins, as well as card recording devices aimed at office scenarios, enhancing user experience in personal life and workplace collaboration [1][6]
Waymo联手DeepMind打造世界模型:基于Genie 3,让自动驾驶「脑补」罕见场景
机器之心· 2026-02-07 07:00
Core Insights - Waymo has launched the Waymo World Model, a new standard in large-scale, hyper-realistic autonomous driving simulation, built on DeepMind's Genie 3 [1][4] - The model can generate highly realistic and interactive 3D environments tailored for the strict requirements of autonomous driving [4][8] - Waymo Driver has completed nearly 200 million miles of fully autonomous driving, enhancing road safety through extensive virtual world training [4][28] Group 1: Model Capabilities - Waymo World Model leverages Genie 3's extensive world knowledge to simulate rare events that are difficult to replicate in real life, such as tornadoes and encounters with elephants [4][9] - The model supports high-fidelity, multi-sensor data generation, including camera images and LiDAR point clouds, providing a comprehensive training and testing environment for autonomous systems [4][8] - The simulation allows for real-time adjustments through simple language prompts, driving inputs, or scene layouts, enhancing the model's adaptability [4][11][16] Group 2: Simulation Control Mechanisms - The model features three main control mechanisms: driving behavior control, scene layout control, and language control, enabling the simulation of various driving scenarios [11][13][16] - Driving behavior control allows for the simulation of counterfactual events, assessing how the Waymo Driver would respond under specific conditions [11] - Scene layout control enables customization of road layouts and traffic signals, while language control provides flexibility in adjusting time of day and weather conditions [13][16] Group 3: Realism and Accuracy - Waymo World Model can convert real-world videos into multi-modal simulations, achieving high levels of realism and factual accuracy [22] - The model's efficient variants allow for long-duration simulations while maintaining high fidelity, supporting large-scale testing [24] - By simulating rare scenarios, Waymo Driver prepares for complex driving situations, setting a higher safety benchmark for autonomous systems [28]
人形机器人的真机强化学习! ICLR 2026 通研院提出人形机器人预训练与真机微调新范式
机器之心· 2026-02-07 07:00
Core Insights - The article discusses the advancements in humanoid robots, particularly their ability to perform complex tasks like dancing and running, while emphasizing the importance of continuous reinforcement learning in real-world environments [2][3] - The LIFT framework proposed by researchers aims to bridge the gap between large-scale pretraining and efficient fine-tuning for humanoid control, addressing the limitations of existing methods [9][12] Group 1: Background and Motivation - Current humanoid robots primarily rely on on-policy algorithms like PPO for pretraining, which are not effective for continuous learning due to safety and economic concerns [7] - The main challenge is to achieve large-scale pretraining speed without sacrificing sample efficiency and safety during the fine-tuning phase [9] Group 2: LIFT Framework - LIFT utilizes off-policy reinforcement learning algorithms like SAC for large-scale pretraining, which allows for better sample efficiency when data is limited [12][15] - The framework incorporates a physics-informed world model to enhance prediction performance and fine-tuning efficiency [12][18] Group 3: Experimental Results - LIFT demonstrated significant advantages over baseline methods like PPO and SAC in terms of convergence time and sample efficiency during pretraining and fine-tuning [20][24] - The framework allows for zero-shot deployment of pre-trained policies to real-world robots, showcasing its effectiveness in real-time applications [20][22] Group 4: Challenges and Future Directions - The article highlights several bottlenecks that need to be addressed for scaling reinforcement learning in real-world applications, including observation and state estimation, safety mechanisms, and system throughput [41]
像挖币一样挖激活函数?DeepMind搭建「算力矿场」,暴力搜出下一代ReLU
机器之心· 2026-02-07 04:09
Core Insights - The article discusses the evolution of activation functions in neural networks, highlighting the transition from traditional functions like Sigmoid and ReLU to newer ones like GELU and Swish, emphasizing the impact on model performance [1][2]. Group 1: DeepMind's Innovation - Google DeepMind is revolutionizing the search for activation functions through a new method called AlphaEvolve, which explores an infinite space of Python functions rather than relying on predefined search spaces [2][4]. - The research paper titled "Finding Generalizable Activation Functions" showcases how DeepMind's approach led to the discovery of new activation functions, including GELUSine and GELU-Sinc-Perturbation, which outperform traditional functions in certain tasks [4][30]. Group 2: Methodology - AlphaEvolve utilizes a large language model (LLM) to generate and modify code, allowing for a more flexible and expansive search for activation functions [8][11]. - The process involves a "micro-laboratory" strategy, where synthetic data is used to optimize for out-of-distribution (OOD) generalization capabilities, avoiding the high costs of searching on large datasets like ImageNet [14][18]. Group 3: Performance of New Functions - The newly discovered functions demonstrated superior performance in algorithmic reasoning tasks, with GELU-Sinc-Perturbation achieving a score of 0.887 on the CLRS-30 benchmark, surpassing ReLU and GELU [34]. - In visual tasks, GELUSine and GELU-Sinc-Perturbation maintained competitive accuracy on ImageNet, achieving approximately 74.5% Top-1 accuracy, comparable to GELU [34][35]. Group 4: Insights on Function Design - The research indicates that the best-performing functions often follow a general formula combining a standard activation function with a periodic term, suggesting that incorporating periodic structures can enhance model generalization [25][35]. - The study highlights the importance of understanding the inductive biases introduced by activation functions, suggesting that periodic elements can help capture complex data structures beyond linear relationships [40][42].
苏炜杰获2026「统计学诺奖」考普斯奖,14年来首位华人得主
机器之心· 2026-02-07 04:09
机器之心编辑部 在时隔 14 年之后,有着「统计学诺贝尔奖」之称的考普斯奖(COPSS Presidents' Award),又一次迎来了华人得主。 2026 年考普斯奖颁给了「北大校友、现宾夕法尼亚大学副教授苏炜杰」。 奖项委员会给他的评语是 ,「为大语言模型的多项应用建立了严格的统计基础;在隐私保护数据分析方面取得突破性进展,并成功应用于 2020 年美国人口普查; 设计了 AI 顶级会议的同行评审机制,并于 ICML 2026 正式落地;在凸优化领域开展了奠基性研究;以及在深度学习的数学理论与高维统计推断方面作出了广泛而 深远的贡献。」 作为国际统计学和数据科学领域的最高荣誉,考普斯奖的地位相当于数学界的菲尔兹奖,每年只颁发给一位年龄在 40 岁以下的统计学家 。该奖项由五大顶级统计 学会(国际数理统计学会 IMS、美国统计学会 ASA、加拿大统计学会 SSC 及美国东西部生物统计学会 ENAR 与 WNAR)共同评选,旨在表彰对统计学理论、方 法或应用做出杰出贡献的学者。 在历史上,考普斯奖的获得者几乎都是后来定义了该领域的宗师级人物。 统计学是华人的优势学科,曾有多位华人获得考普斯奖,包括近期回国的 ...
全新视角看世界模型:从视频生成迈向通用世界模拟器
机器之心· 2026-02-07 04:09
近年来, 视频生成(Video Generation)与世界模型(World Models)已跃升为人工智能领域最炙手可热的焦点 。从 Sora 到可灵(Kling),视频生成模型在运动 连续性、物体交互与部分物理先验上逐渐表现出更强的「 世界一致性」,让人们开始认真讨论:能否把视频生成从「 逼真短片」推进到可用于推理、规划与控制 的 「 通用世界模拟器 」 。 与此同时,这一研究方向正快速与具身智能(Embodied AI)、自动驾驶(Autonomous Driving)等前沿场景深度交织,被视为通往通用人工智能(AGI)的重要路 径。 然而,在研究热潮之下,「 何为真正的世界模型 」以及「 如何评判视频模型的世界模拟能力 」等核心议题却陷入了多维争论。当前,世界模型的定义与分类层 出不穷,理论维度的交叉重叠往往令研究者感到困惑,也限制了技术的标准化发展。 为建立更系统、清晰的审视视角, 快手可灵团队 与 香港科技大学(广州)陈颖聪教授团队(共同一作:博士生王罗州、博士生陈知非) 联合发表了从全新视角 深度剖析视频世界模型的系统综述。 本文旨在弥合当代「 无状态」视频架构与经典「 以状态为中心」的世界模型 ...
GDP 上升 7% 只是起步,「牛市女皇」看到了 AI 带来的哪些「真增长」?
机器之心· 2026-02-07 02:30
Group 1 - Cathie Wood predicts a global GDP growth rate of 7%, which she considers conservative, driven by the integration of five major technological platforms including AI and robotics [5][6] - The current GDP measurement system is seen as lagging, failing to account for significant contributions from technology-driven outputs, which Wood refers to as "invisible labor" [6][7] - Wood emphasizes that the core measure of progress should focus on productivity growth driven by technology, rather than traditional GDP metrics [7][8] Group 2 - Wood argues that technological advancements lead to a beneficial deflationary cycle, contrary to Keynesian economics, which suggests growth leads to inflation [8][9] - Alexander Wissner-Gross suggests that per capita productivity is a more accurate measure of future progress than GDP, questioning the relevance of traditional indices like the S&P 500 [9][10] - Wood identifies GNI (Gross National Income) as a more accurate indicator of real wealth, as it reflects income flows better than GDP, especially during periods of technological upheaval [12][13] Group 3 - Wood connects the essence of economic activity to energy transformation, highlighting the importance of advancements in nuclear, solar, and battery technologies as future energy pillars [13]
AI卖广告,吵到了超级碗:全球网友围观奥特曼破防
机器之心· 2026-02-06 03:57
Core Viewpoint - The article discusses the contrasting advertising strategies of OpenAI and Anthropic, highlighting Anthropic's decision to keep its AI assistant Claude ad-free as a response to OpenAI's introduction of ads in ChatGPT [2][12][24]. Group 1: Anthropic's Advertising Strategy - Anthropic aired a Super Bowl ad that humorously critiques OpenAI's decision to introduce ads in ChatGPT, emphasizing that Claude will remain ad-free [2][6][12]. - The company believes that integrating ads into AI conversations is incompatible with Claude's role as a serious assistant, especially in sensitive or complex discussions [11][14]. - Anthropic's stance is that users should not have to question whether the AI is genuinely helping them or steering conversations towards monetization [13][15]. Group 2: OpenAI's Response - OpenAI's CEO Sam Altman acknowledged the humor in Anthropic's ad but criticized it as a dishonest tactic, asserting that OpenAI's advertising principles differ significantly from what Anthropic portrayed [18][20]. - Altman highlighted that OpenAI aims to provide free access to AI for everyone, with a significant user base in Texas alone surpassing the total users of Claude [19][24]. - He accused Anthropic of catering to wealthier clients and attempting to control AI usage, contrasting OpenAI's commitment to broad access and democratic decision-making [20][21]. Group 3: Financial Context and Business Models - OpenAI is facing substantial financial pressures, with projected losses of approximately $9 billion this year against expected revenues of $13 billion, while only about 5% of its 800 million active users are paying subscribers [24]. - In contrast, Anthropic, while also not yet profitable, is expected to achieve profitability faster and relies on enterprise contracts and paid subscriptions rather than large-scale data center investments [24]. - Anthropic's Claude Code and Cowork have reportedly generated at least $1 billion in revenue, indicating a different approach to monetization compared to OpenAI [24].