Workflow
机器之心
icon
Search documents
ICLR 2026 | Rebuttal 是一场「带着镣铐的舞蹈」?港科 RebuttalAgent 用心智理论「读懂」审稿人
机器之心· 2026-02-03 14:22
Core Insights - The article discusses the introduction of a new framework called RebuttalAgent, which utilizes Theory of Mind (ToM) to enhance the effectiveness of academic rebuttals by AI, allowing it to understand reviewers' underlying intentions and generate persuasive responses [2][5][21]. Group 1: RebuttalAgent Framework - RebuttalAgent is designed to address the challenges faced by authors in responding to reviewers, particularly in understanding the implicit biases and knowledge gaps of reviewers [5][21]. - The framework operates through a three-step process: ToM (analyzing reviewer comments), Strategy (developing a response plan), and Response (crafting the final reply) [9][10][11]. - The model has been trained on a dataset called RebuttalBench, which contains over 70,000 high-quality "analysis-strategy-response" data chains to improve its performance [14]. Group 2: Performance Evaluation - RebuttalAgent has shown superior performance compared to other models, achieving an average score of 9.42 in a comprehensive evaluation, significantly outperforming GPT-4.1 and other baseline models [22][17]. - The model's ability to generate diverse and contextually appropriate responses has been enhanced through a diversity penalty mechanism, preventing it from relying on generic templates [22][21]. - In a specific evaluation, RebuttalAgent outperformed GPT-4.1 in terms of persuasiveness, indicating the effectiveness of incorporating Theory of Mind into the model [22][17]. Group 3: Practical Implications - RebuttalAgent serves as a strategic tool for authors, particularly those new to academia, helping them navigate the complexities of reviewer feedback and improve the clarity and constructiveness of academic dialogue [25][21]. - The framework aims to enhance academic communication by reducing misunderstandings caused by poor expression or lack of strategic communication, thereby fostering a more constructive dialogue between authors and reviewers [25][21].
大道至简,何恺明团队新作pMF开启像素级「无潜、单步」生成范式
机器之心· 2026-02-03 14:22
机器之心编辑部 何恺明团队新论文,再次「大道至简」。 此次研究直指当前以 DiT 为代表的主流扩散模型与流匹配模型存在的通病,并 提出了一种用于单步、无潜空间(Latent-free)的图像生成新框架 。 在生成式 AI 领域,追求更高效、更直接的生成范式一直是学界的核心目标。 当前,以 DiT 为代表的主流扩散模型与流匹配模型主要依赖两大支柱来降低生成难度,一是通过多步采样将复杂的分布转换分解为微小的步进,二是在预训练 VAE(变分自编码器)的潜空间中运行以降低计算维度。 尽管这些设计在图像质量上取得了巨大成功,但从深度学习「端到端」的精神来看,这种对多步迭代和预置编码器的依赖,无疑增加了系统的复杂性和推理开 销。 面对这些挑战, 何恺明团队提出了用于单步、无潜空间图像生成的 pixel MeanFlow(pMF)框架 。该框架继承了改进均值流(improved MeanFlow,MF)的思 路,通过在瞬时速度(即 v)空间内定义损失函数,来学习平均速度场(即 u)。 与此同时,受 Just image Transformers(JiT)的启发,pMF 直接对类似于去噪图像的物理量(即 x-predicti ...
刚刚,腾讯姚顺雨署名首篇论文发布,「下半场」先搞上下文学习
机器之心· 2026-02-03 10:35
Core Insights - The core argument of the article emphasizes that the key bottleneck for models to achieve high-value applications lies in their ability to effectively utilize context [1][5][7]. Group 1: Context Learning Challenges - Recent research indicates that even when context is provided, models may still struggle to solve tasks, highlighting a significant shortfall in their learning capabilities [5][32]. - The article discusses the difference in learning abilities among models, comparing it to individuals with varying talents who learn from the same material [5]. - Current models primarily rely on "parameterized knowledge," which is static and does not adapt to new information from the context [12][34]. Group 2: CL-bench Benchmark - The CL-bench benchmark was developed to assess how well language models can learn new knowledge from context and apply it correctly [16][26]. - It includes 500 complex contexts, 1,899 tasks, and 31,607 validation standards, all designed to require models to learn from the provided context [16][27]. - The benchmark covers four main real-world context learning scenarios: domain knowledge reasoning, rule system application, procedural task execution, and empirical discovery [28][29]. Group 3: Model Performance Evaluation - Evaluation results show that even the best-performing model, GPT-5.1 (High), only solved 23.7% of tasks, indicating a significant gap in context learning capabilities [31][32]. - The majority of errors stem from models ignoring or misusing context, rather than a lack of information [34][35]. - The article notes that models struggle particularly with tasks requiring inductive reasoning from experimental data, often achieving less than 10% success [39]. Group 4: Future Directions - The article suggests that improving context learning could shift the role of humans from data providers to context providers in AI systems [43]. - It raises the challenge of how to make knowledge learned from context persistent, as current models lose this knowledge once the context window is cleared [43][46]. - The potential for models to achieve autonomous learning through effective context learning and memory consolidation is highlighted as an exciting future prospect [47][48].
致敬Kimi K2:基于slime的全流程INT4量化感知RL训练
机器之心· 2026-02-03 10:35
Core Insights - The SGLang RL team has successfully implemented the INT4 Quantization-Aware Training (QAT) process inspired by the Kimi K2 team, achieving stability and consistency comparable to BF16 full precision training while enabling extreme compression of large models [2][3][4]. Technical Overview - The project is a collaboration among multiple teams, including SGLang RL, InfiXAI, Ant Group, and others, with functionalities shared in the slime and Miles communities [4]. - A complete QAT INT4 closed-loop solution has been established, enhancing training stability and efficiency in reinforcement learning (RL) scenarios [6]. - The rollout efficiency has significantly improved by eliminating cross-machine communication bottlenecks, allowing 1TB models to fit within a single H200 (141G) GPU memory [6][10]. Training Process - The training phase utilizes Fake Quantization to simulate quantization noise while maintaining high precision BF16 weights, ensuring the model adapts to low precision representations [8][9]. - The Straight-Through Estimator (STE) technique allows gradients to bypass the non-differentiable quantization operations, maintaining the training continuity [9][11]. - The transition from BF16 weights to INT4 format is executed during the weight conversion phase, facilitating efficient inference [10][25]. Performance Evaluation - Experiments demonstrate that the QAT INT4 training approach maintains robust performance, with the rollout configuration showing consistent growth in raw rewards compared to BF16 and FP8 configurations [41][46]. - The INT4 QAT strategy effectively mitigates discrepancies between training and inference outputs, achieving a high degree of consistency [51][56]. Future Directions - The project aims to explore further optimizations to enhance training efficiency and investigate the application of FP4 precision in RL training and inference as NVIDIA's Blackwell architecture becomes more prevalent [58][62].
刚刚,马斯克收购了马斯克
机器之心· 2026-02-03 03:33
机器之心编辑部 一觉醒来,马斯克又搞了个大的。 旗下太空探索技术公司与人工智能公司合二为一了。 SpaceX 正式宣布收购 xAI! 目前,双方都已确认了这一消息。 据彭博社报道,合并后的公司预计将以每股约 527 美元的价格定价,其估值将达到 1.25 万亿美元。 以下为 Elon Musk 签名公告全文 : SpaceX 已收购 xAI,旨在打造地球上(及地球之外)最宏伟的垂直整合创新引擎。该体系集成了人工智能、火箭技术、天基互联网、手机直连通信,以及全球领 先的实时信息与言论自由平台。 虽然发射这些卫星的需求将同样作为「强力函数」来推动星舰的改进与发射频率,但天基数据中心所需的惊人卫星数量,将把星舰推向更高的高度。通过每小时 一次、单次运载 200 吨的频率,星舰每年将向轨道及深空输送数百万吨载荷,开启人类在群星间探索的激动人心的未来。 基本的数学逻辑是:每年发射 100 万吨卫星,若每吨产生 100 kW 的计算能力,每年将增加 100 GW 的 AI 算力储备,且无需后续的运营或维护投入。最终,我们 拥有一条从地球实现每年发射 1 TW 算力载荷的路径。 这不仅是 SpaceX 和 xAI 使命的新 ...
国产版Ollama来了,Clawdbot终于不只属于Mac和英伟达
机器之心· 2026-02-03 03:33
Core Viewpoint - The article discusses the emergence of Clawdbot (now OpenClaw) and its impact on the AI development community, highlighting the shift towards local agents and the introduction of the Xuanwu CLI as a solution to the challenges faced by developers using domestic computing power [3][4][5]. Group 1: Clawdbot and AI Development - Clawdbot is a practical AI tool that can autonomously write code and fix bugs, leading to the creation of an AI social platform called Moltbook, where 1.5 million agents evolve independently [3][4]. - The rise of Clawdbot has raised concerns about privacy and costs associated with cloud-based services, prompting a demand for local agents that operate without continuous cloud billing [4]. Group 2: Challenges in Domestic Computing Power - Current mainstream solutions for AI agents are primarily built around macOS and NVIDIA GPU ecosystems, leaving domestic computing solutions like Huawei Ascend and Suiruan at a disadvantage due to a lack of community support and toolchain maturity [5][6]. - Developers using domestic GPUs face significant challenges due to fragmented architectures and the need for extensive configuration, which can lead to frustration and inefficiency [14][15]. Group 3: Introduction of Xuanwu CLI - Xuanwu CLI, launched by Qingmang Intelligent, aims to simplify the deployment of large models on domestic hardware, reducing the barrier to entry for developers [9][10]. - The tool allows for quick model service startup within five minutes, making it a cost-effective solution for enterprises and developers looking to utilize domestic computing power [9][10]. Group 4: Features and Benefits of Xuanwu CLI - Xuanwu CLI automates the recognition of various domestic chips, eliminating the need for users to understand underlying architecture differences, thus achieving "zero-debug deployment" [21][22]. - The CLI offers a user-friendly experience similar to Ollama, allowing for rapid service startup and seamless model interaction without complex configurations [22][24]. - It supports multiple engines and can run offline, ensuring data security and stability, which is crucial for sensitive applications [31][28]. Group 5: Ecosystem and Future Prospects - Xuanwu CLI is positioned as a foundational tool for local AI capabilities, enabling integration with popular AI tools like Clawdbot, thus enhancing the overall value of local AI applications [32][33]. - The development team behind Xuanwu CLI has a strong technical background and aims to address the ecological challenges faced by domestic GPU users, potentially transforming the landscape of AI development in China [35][36].
VL-LN Bench:模拟「边走边问找具体目标」的真实导航场景
机器之心· 2026-02-02 08:00
交互式实例导航任务 (Interactive Instance Goal Navigation, IIGN) 本工作由上海人工智能实验室、中国科学技术大学、浙江大学、香港大学 的研究者们共同完成。 如果将一台在视觉语言导航(VLN)任务中表现优异的机器人直接搬进家庭场景,往往会遇到不少实 际问题。 首先是使用门槛偏高:传统 VLN 需要用户给出又长又精确的路线式指令,例如 "从门口直走三步,看到门右转,再往前……",这会显著增加沟通成本,降 低日常使用体验。 论文标题:VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs 项目主页:https://0309hws.github.io/VL-LN.github.io/ ArXiv 论文:https://arxiv.org/abs/2512.22342 Hugging Face 数据集: https://huggingface.co/datasets/InternRobotics/VL-LN-Bench Hugging Face 模型:https://huggi ...
Moltbook漏洞大到可以冒充Karpathy发帖,黑客都急了
机器之心· 2026-02-02 08:00
Core Viewpoint - Moltbook, dubbed as the "AI version of Reddit," has faced significant scrutiny due to allegations of fake content and security vulnerabilities, raising concerns about its credibility and safety in the AI community [1][2][4]. Group 1: Content Authenticity Issues - Initially, Moltbook gained popularity for its concept of "AI posting, human observing," but it was soon revealed that much of the content was fabricated, with human users posting under the guise of AI [2][4]. - The platform's claimed number of AI Agent registrations was also found to be misleading, as users could create accounts without restrictions, leading to the generation of fake accounts, with one user reportedly creating 500,000 fake accounts in a short time [6][7]. Group 2: Security Vulnerabilities - A significant security flaw was disclosed by a white-hat hacker, exposing the entire database of Moltbook, including sensitive information such as API keys, making it possible for anyone to impersonate any Agent on the platform [8][9]. - The vulnerability stemmed from the public exposure of Supabase keys, allowing unauthorized access to user data through simple GET requests [12]. Group 3: Response and Mitigation Efforts - The hacker attempted to contact Moltbook's founders for a resolution but received no response, leading to public calls for immediate action to secure the database [13]. - Proposed solutions included enabling row-level security on the agents table and creating restrictive access policies to prevent anonymous users from accessing sensitive data [15]. Group 4: Complications from Fixes - Following the discovery of the security issues, resetting all API keys to secure the platform posed a new challenge, as users would be locked out without a web login feature to regain access [19]. - Suggestions for resolving this included creating a temporary interface for users to exchange old keys for new ones or requiring users to verify their identity through another platform to obtain new keys [19]. Group 5: Additional Vulnerabilities - A former engineer from Anthropic reported a remote code execution vulnerability in OpenClaw, which could allow attackers to gain access to the system without user interaction [21][22]. - Feedback from users indicated that some organizations had issued warnings against using the Clawdbot platform due to these significant vulnerabilities [23].
像开发软件一样造世界,Agent2World来了,把世界模型做成可运行的符号环境
机器之心· 2026-02-02 06:14
让模型真正 " 能行动 ",往往需要一个可执行、可验证的符号世界模型(Symbolic World Model):它不是抽象的文字描述,而是能被规划器或执行器直接调用的 形式化定义 —— 例如 PDDL 领域 / 问题,或可运行的环境代码 / 模拟器。一旦世界被 "写成可运行的规则",我们就能在同一套约束下进行推演、测试与复现:模 型不再停留在 "会说",而是能回答 "如果我这样做,会发生什么",并用执行结果检验自己是否真的理解了这个世界。 问题在于,现有自动生成路线普遍陷入三重困局:脚本式工作流、知识边界封闭、表示覆盖单一。许多方法仍沿用固定的 "生成 — 修复" 脚本,并以解析 / 规则匹 配 / 固定检查集等静态校验为主:它们或许能修语法与格式,却常常抓不住只有在交互执行中才暴露的行为级错误(例如状态更新不一致、目标不可达、奖励机制 失效)。与此同时,当任务规格含糊、缺失关键规则或背景常识时,系统缺少主动检索与补全机制,只能依赖模型记忆 "猜"。更关键的是,既有研究往往只覆盖 一种世界模型表示(只做 PDDL,或只做可执行代码),导致同一任务难以在不同符号表达之间共享验证闭环与改进经验,限制了方法的通用 ...
中途退学的艺术生,开发Web 3D项目,周下载量破400万
机器之心· 2026-02-02 06:14
机器之心编辑部 一个并不常被普通用户提起的开源项目,刚刚刷新了自己的历史纪录。 近日,Three.js 官方 X 账号公布:Three.js 每周下载量突破 400 万。 链接: https://x.com/threejs/status/2013044943909191680 你或许没用过 Three.js ,也未必听过它的名字,但你大概率已经见过它的作品。 那些可以旋转的 3D 商品展示页、会随鼠标晃动的官网首页、可交互的数据可视化,甚至一些看似只是酷炫 动画的 Web 页面背后,Three.js 正默默地承担着核心的 3D 渲染工作。 注:Three.js 是一个基于 WebGL 的 JavaScript 3D 图形库,由 Ricardo Cabello(网名 Mr.doob)于 2010 年创建。它的核心目标是让开发 者能够在浏览器中轻松创建和展示 3D 内容,而无需直接处理复杂的 WebGL 底层 API。 在官网示例里,同一个图形界面你可以选择不同的状态如跑、跳。 我们再回到官方发布的那张图,其展示了 Three.js 从 2016 年到 2026 年的周下载量变化,呈现出非常典型的 指数级增长曲线 ...