机器之心

Search documents
GPT-5问题太多,奥特曼带团回应一切,图表弄错是因「太累了」
机器之心· 2025-08-09 06:02
| 机器之心报道 | | --- | 机器之心编辑部 前期有多期望,后期就有多失望,这大概是大多数业界人士在看到 GPT-5 这场事先张扬的高调发布后的最大心声。 当然,也许在内部测试的时候,OpenAI 确实觉得 GPT-5 是目前最为强大的模型,可是走进真实世界后却好像并非如此。 一位 X 网友发现 GPT-5 在解决可能属于小学水平的数学题时无能为力,吐槽到底被官方称为「博士」水平的智力是哪个学校颁发的? 不仅是数学,自 GPT-5 发布以来,各种社交媒体上充斥着各种 GPT-5 在逻辑、编码任务中「失误」的案例。 前期的高调炒作、直播中的低水准图表错误、用户试用后的失望,等等,不仅让 GPT-5 没有收到预期的鲜花与掌声,更多是吐槽和质疑声的时候,OpenAI 联合创 始人兼首席执行官 Sam Altman 似乎也开始「坐不住了」,表示 GPT-5 的发布过程确实存在一点问题。 GPT-5 发布后不久, 在 Reddit r/ChatGPT 的 AMA 活动中,Sam Altman 和 GPT-5 团队核心成员针对网友们的提问进行了回答,从发布会上出现的令人尴尬的「图 表犯罪」失误,到用户抱怨 GPT ...
ARPO:智能体强化策略优化,让Agent在关键时刻多探索一步
机器之心· 2025-08-09 06:02
Core Viewpoint - The article introduces a novel method called Agentic Reinforced Policy Optimization (ARPO), designed to enhance the performance of large language models (LLMs) in multi-round interactions by addressing the challenges of uncertainty and exploration during tool usage [3][41]. Group 1: Research Motivation and Background - The emergence of Agentic Reinforcement Learning (RL) is driven by the need for LLMs to engage in dynamic multi-round interactions with external tools, moving from static problem-solving to a more interactive agent-environment reasoning paradigm [8]. - Existing Agentic RL methods often underestimate the value of multi-round interactions due to sparse rewards and overuse of tools, leading to a lack of fine-grained exploration of tool usage [8][41]. - The study identifies a significant increase in entropy (uncertainty) after tool calls, indicating an opportunity for exploration that current methods do not fully leverage [14][16]. Group 2: ARPO Methodology - ARPO introduces an entropy-driven adaptive rollout strategy that enhances exploration during high-entropy tool usage phases, allowing for more diverse reasoning paths [11][20]. - The method includes four key steps: initialization of global rollout, monitoring entropy changes, adaptive branching based on entropy, and defining termination conditions for the rollout process [24][27]. - ARPO incorporates advantage attribution estimation to help the model better internalize the value differences in tool usage at each step [28][30]. Group 3: Experimental Results - ARPO outperforms existing sample-level RL methods, achieving better performance with only half the tool call budget across 13 challenging benchmarks, demonstrating its efficiency in training multi-round reasoning agents [21][41]. - The method shows consistent improvements in performance metrics such as Pass@3 and Pass@5, particularly in dynamic, multi-round tasks [37][39]. - In comparative tests, ARPO achieves higher accuracy than GRPO and DAPO in various tasks, including deep search and knowledge-intensive reasoning [41][42]. Group 4: Future Directions - Future research may explore the application of ARPO in multi-modal tasks, expanding its capabilities beyond text-based reasoning to include images and videos [42]. - There is potential for integrating a broader range of external tools to enhance complex task performance through optimized tool usage strategies [42]. - The scalability and real-time deployment of ARPO in larger models and dynamic environments could further improve its practical value and cost-effectiveness [42].
ICCV 2025 | 新型后门攻击直指Scaffold联邦学习,NTU联手0G Labs揭示中心化训练安全漏洞
机器之心· 2025-08-09 03:59
Core Viewpoint - The article introduces BadSFL, a novel backdoor attack method specifically designed for the Scaffold Federated Learning (SFL) framework, highlighting its effectiveness, stealth, and persistence compared to existing methods [2][39]. Group 1: Background on Federated Learning and Scaffold - Federated Learning (FL) allows distributed model training while protecting client data privacy, but its effectiveness is heavily influenced by the distribution of training data across clients [6][10]. - In non-IID scenarios, where data distribution varies significantly among clients, traditional methods like FedAvg struggle, leading to poor model convergence [7][10]. - Scaffold was proposed to address these challenges by using control variates to correct client updates, improving model convergence in non-IID settings [7][12]. Group 2: Security Vulnerabilities in Scaffold - Despite its advantages, Scaffold introduces new security vulnerabilities, particularly against malicious clients that can exploit the model update mechanism to inject backdoor behaviors [8][9]. - The reliance on control variates in Scaffold creates a new attack surface, allowing attackers to manipulate these variates to guide benign clients' updates towards malicious objectives [9][16]. Group 3: BadSFL Attack Methodology - BadSFL operates by subtly altering control variates to steer benign clients' local gradient updates in a "poisoned" direction, enhancing the persistence of backdoor attacks [2][9]. - The attack utilizes a GAN-based data poisoning strategy to enrich the attacker's dataset, maintaining high accuracy for both normal and backdoor samples while remaining covert [2][11]. - BadSFL demonstrates superior persistence, maintaining attack effectiveness for over 60 rounds, which is three times longer than existing benchmark methods [2][32]. Group 4: Experimental Results - Experiments conducted on MNIST, CIFAR-10, and CIFAR-100 datasets show that BadSFL outperforms four other known backdoor attacks in terms of effectiveness and persistence [32][33]. - In the initial 10 rounds of training, BadSFL achieved over 80% accuracy on backdoor tasks while maintaining around 60% accuracy on primary tasks [34]. - Even after the attacker ceases to upload malicious updates, BadSFL retains backdoor functionality significantly longer than benchmark methods, demonstrating its robustness [37][38].
用户痛批GPT-5,哭诉「还我GPT-4o」,奥特曼妥协了
机器之心· 2025-08-09 03:59
| 机器之心报道 | | --- | 对于用顺手了这些旧模型的人来说,这个更改真是无比难受。很多用户希望这些「老朋友」赶紧回来。尤其是 GPT-4o。 机器之心编辑部 o4 回归,你那可以了吗? 等了好久,终于等到 GPT-5 。但大家似乎对这个模型并不满意。 可以使用 GPT-5 的小伙伴,现在打开页面,是这样的。 以前的模型都消失了,原因在于,作为 GPT-5 发布的一部分,OpenAI 移除了 ChatGPT 中的模型选择器。这个下拉菜单此前汇集了 OpenAI 一系列名称容易混淆的 模型,用户可以根据不同需求在它们之间切换。例如,用户可以选择 GPT-4o 来处理复杂任务,或者选择更高效的 o4 mini 模型来完成负担较轻的工作。用户还可 以在不同代际的模型之间切换,例如从去年发布的 GPT-4o 切换到更新的 GPT-4.1。 以前是这样的 然而,随着新模型的发布,OpenAI 将 GPT-5 设为 ChatGPT 的默认模型,并会根据任务类型自动为用户分配不同的子版本。 为了表达心中的不满,很多人玩起了梗图,看起来又好笑,又无奈。 来源: https://x.com/pengkeshen281/ ...
上海AI Lab、浙大EagleLab等提出RRVF:利用「验证非对称性」,只输入图片学习视觉推理
机器之心· 2025-08-09 03:59
Core Insights - The article discusses the concept of "Asymmetry of Verification," which posits that verifying the quality of a solution is often easier than creating one from scratch, thus reshaping the future of AI [3][4] - The RRVF (Reasoning-Rendering-Visual-Feedback) framework exemplifies how to leverage this principle to tackle complex visual reasoning challenges [4][19] Summary by Sections Research Background - The research was conducted by a team from Shanghai AI Lab, Zhejiang University EagleLab, and Shanghai Chuangzhi Academy, focusing on multimodal large models and reasoning [2] Verification Asymmetry - The principle of verification asymmetry suggests that tasks with objective truths and quick verification can be efficiently solved by AI through iterative guess-and-check methods [3] RRVF Framework - RRVF operates without expensive image-text paired data, allowing models to self-validate in a closed-loop system [9][11] - The framework consists of three main components: Iterative Visual Reasoning, Visual Feedback, and Visual Judge, which collectively enhance the model's learning process [11][12][13] Experimental Results - RRVF demonstrated superior performance compared to traditional supervised fine-tuning (SFT), achieving a code execution rate of 97.83% without any standard code answers [21] - The 7B model trained with RRVF outperformed the 72B model that provided feedback, showcasing a self-learning effect [22] - RRVF maintained high performance on unseen datasets, indicating strong generalization capabilities [23] Implications for AI Development - The findings suggest that the future bottleneck in AI development may lie in designing efficient verification environments rather than solely in model size [23]
OpenAI 董事会主席:「按 token 计费」大错特错!市场终将选择「按成果付费」
机器之心· 2025-08-09 01:30
本文来自PRO会员通讯内容,文末关注「机器之心PRO会员」,查看更多专题解读。 Agent 创企 Sierra 创始人、Salesforce 前联席 CEO、OpenAI 董事会主席 Bret Taylor 近日参与 Lenny's Podcast 的访谈,他以「大模型创业已死路一条,除非你能像马 斯克那样砸几十亿美元」的结论为起点,探讨了创业者在当前 AI 浪潮中应如何精准定位,找到真正属于自己的机会。 目录 01. 基础模型是创业死路,「长尾 Agent 公司」才是机会? Bret Taylor 为何称「应用 AI」才是创业者的生路?「长尾 Agent 公司」将如何取代传统 SaaS?... 为什么对开发者最友善的Python 反而成为AI的瓶颈?什么是 AI 编程的新范式?... 01 基础模型是创业死路,「长尾 Agent 公司」才是机会? 1、Bret Taylor 的履历贯穿了过往二十年的科技浪潮。在近期的访谈中,他复盘了自身以往的成功和失败的经历,并在此基础上以创业者的身份剖析了当前 AI 时代的机会,即 真正的蓝海在于能交付业务成果的 Agent,商业模式的核心将是「按成果付费」,而传统的市场 ...
挤不动的世界机器人大会上,自变量秀出了真·通用具身智能
机器之心· 2025-08-08 10:18
机器之心报道 编辑:泽南 会整理家务、制作香囊,还能比心比耶。 具身智能已经进化到这种程度了,真实满足用户需求似乎指日可待。 今天上午,2025 世界机器人大会 WRC 正式开幕。最近的 AI 技术突破,让参展的公司纷纷拿出了新技术,人头攒动的展台之间,我们看到了一众「具身智能」加 持的机器人,其中很多还是首次发布。 它们从整理家务、工业物流、制造装配到跳舞表演可谓无所不能,形态也各式各样,颇有些前不久大模型「百模大战」的样子。不过在这其中,有一家的公司反 其道而行之,实现了「一脑多用」的真・通用智能。 国内头部创业公司「自变量机器人」给具身智能定义了一个新的标准。 一脑多用 覆盖多种场景 在 WRC 的展台上,自变量的通用轮式双臂机器人「小量」在制作香囊,为看展观众们送上专属小礼物。 它内置自变量自研的通用具身大模型 WALL-A,短短几天时间内就学会了自主制作香囊,而且工作时不挑环境 —— 不论展会现场周遭的声光环境有多复杂,人群 动向如何,都不会干扰到它的「细活」。 如果提出指令,机器人能根据观众的喜好,自主拾取不同的香包,在不到 10cm 的空间中,双臂精巧配合,处理柔性物体的复杂形变,完成香囊的制作 ...
4比0横扫Grok 4,o3强势夺冠,首届大模型对抗赛结果出炉
机器之心· 2025-08-08 10:18
Core Viewpoint - The first Kaggle AI Chess Championship concluded with o3 defeating Grok 4 decisively, showcasing the advancements in AI chess models and their competitive capabilities [2][4][15]. Group 1: Championship Results - o3 won the championship by sweeping Grok 4 with a score of 4-0 [4][15]. - Gemini 2.5 Pro secured third place by defeating o4-mini with a score of 3.5-0.5 [4][17]. Group 2: Performance Analysis - Grok 4, initially a strong contender, made critical mistakes during the final match, leading to its unexpected defeat [6][7][8]. - In the first game, Grok 4 lost a piece early on, which set a negative tone for the rest of the match [8][10]. - The second game featured a risky opening strategy from Grok 4 that resulted in a significant blunder, allowing o3 to capitalize easily [10][12]. - The third game saw Grok 4 fail to maintain its position, leading to a complete loss despite initial promise [12][13]. - The final game was closely contested, but o3 demonstrated superior endgame skills, ultimately securing victory [13][15]. Group 3: Insights on Competitors - Gemini 2.5 Pro's performance was marked by inconsistency, with several amateur-level mistakes during its matches [17][19]. - Despite the chaotic nature of the matches, Gemini managed to secure third place, indicating potential for future improvements [24].
扩散LLM推理新范式:打破生成长度限制,实现动态自适应调节
机器之心· 2025-08-08 10:18
随着 Gemini-Diffusion,Seed-Diffusion 等扩散大语言模型(DLLM)的发布,这一领域成为了工业界和学术界的热门方向。但是,当前 DLLM 存在着在推理时必须 采用预设固定长度的限制,对于不同任务都需要专门调整才能达到最优效果。 为了解决这一本质的问题,香港中文大学 MMLab,上海 AI 实验室等提出 DAEDAL,赋予 DLLM 可以根据问题的具体情况自主调整回答长度的能力,弥补了 DLLM 与自回归 LLM 的关键差距,为更灵活、高效、强大的扩散大语言模型打下了基石。 论文标题:Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models 论文地址:https://arxiv.org/abs/2508.00819 代码地址:https://github.com/Li-Jinsong/DAEDAL DAEDAL 作为一种 Training Free 的去噪策略,从一个统一且很短的初始长度开始,让模型根据自己的需求在生成中调节长度,动态扩展,达到了和现有去噪策略 在每个评测基准上精心调整生成 ...
从Debugger到Developer : 低代码时代新基准NoCode-bench,SWE-Bench作者力荐
机器之心· 2025-08-08 07:53
Core Insights - The article discusses the introduction of a new benchmark called NoCode-bench, aimed at evaluating the capabilities of large language models (LLMs) in natural language-driven feature addition tasks in software development [3][27]. - Current LLMs show a low success rate of only 20% in performing these tasks, highlighting significant challenges in AI's ability to handle real-world software development scenarios [3][26]. Group 1: Benchmark Development - NoCode-bench was developed to address the limitations of existing benchmarks like SWE-bench, which primarily focus on bug fixing rather than feature addition [6][27]. - The benchmark emphasizes the importance of understanding software documentation changes to implement new features, reflecting a more realistic development environment [6][27]. - The construction of NoCode-bench involved a rigorous five-phase process, starting from selecting well-maintained open-source projects to filtering instances based on developer-verified release notes [8][10][16]. Group 2: Challenges Identified - The tasks in NoCode-bench present three main challenges: 1. Increased complexity of input, with document changes being nearly twice as long as bug reports, requiring better long-text comprehension [12]. 2. Difficulty in locating changes, as tasks often involve multiple files and code blocks, demanding high cross-file editing capabilities [13]. 3. Greater editing volume, with nearly 20% of tasks requiring modifications of over 200 lines of code, increasing the risk of errors [14]. Group 3: Model Performance Evaluation - A comprehensive evaluation of six leading LLMs, including Claude-4-Sonnet and GPT-4o, revealed disappointing success rates, with the best-performing model achieving only 15.79% success [18][26]. - The analysis of failure cases identified three primary reasons for poor performance: lack of cross-file editing ability, insufficient understanding of codebase structure, and inadequate tool invocation capabilities [20][21][22]. Group 4: Future Directions - The research indicates that the current state of LLMs is not ready for the complexities of document-driven feature development, suggesting a need for further advancements in AI capabilities [24][27]. - The findings provide a roadmap for future AI software engineers, focusing on improving cross-file editing, codebase comprehension, and tool interaction [27].