Workflow
机器之心
icon
Search documents
开源RL框架Verlog来了,专为LLM智能体打造,400回合不成问题
机器之心· 2025-10-08 04:13
Core Insights - The article discusses the challenges faced by intelligent agents in maintaining clear reasoning and robust decision-making over long-term tasks, particularly when the task extends to hundreds of steps [2][3] - It introduces Verlog, a multi-turn reinforcement learning framework designed to handle long-horizon tasks effectively, overcoming limitations of traditional frameworks [3][20] Group 1: Framework Overview - Verlog is built on the foundations of VeRL and BALROG, incorporating specialized optimization techniques to ensure stable and efficient training across tasks that can extend beyond 400 steps [3][20] - The framework has been validated in complex environments such as BabyAI, BabaIsAI, and Crafter, demonstrating strong performance in tasks with varying episode lengths [3][19] Group 2: Methodology - The base model for Verlog is the Qwen-2.5 Instruct variant, which allows seamless integration with BALROG and facilitates the use of benchmark testing prompts with minimal modifications [6][7] - A memory mechanism is employed to retain only the latest n + 1 rounds of interactions, optimizing performance for the 3B parameter Qwen model [9][10] Group 3: Algorithmic Innovations - The Dual Discounting GAE algorithm is introduced to decouple tokens from steps, encouraging agents to complete tasks in fewer environment steps [11][20] - The recursive calculation of GAE enhances the stability of training, allowing for effective learning even in sparse reward scenarios [12][14] Group 4: Experimental Results - Verlog was tested on three challenging benchmarks: Crafter, BabyAI, and BabaIsAI, showcasing its ability to adapt to long-duration tasks with sparse rewards [16][19] - The training of the Qwen2.5-7B-Instruct model in the Crafter environment utilized 8 H100 GPUs over approximately 36 hours, while the Qwen2.5-3B-Instruct model for BabyAI and BabaIsAI was trained on 4 A40 GPUs for about 24 hours [19] Group 5: Future Directions - Verlog aims to serve as a flexible research platform to advance the development of long-horizon LLM-Agent reinforcement learning [21][20] - The framework addresses key engineering challenges such as managing long interaction histories, ensuring training stability under sparse rewards, and handling variable trajectory lengths [20][23]
谷歌加入CUA战场,发布Gemini 2.5 Computer Use:让AI直接操作浏览器
机器之心· 2025-10-08 03:18
Core Insights - Google DeepMind has launched the Gemini 2.5 Computer Use model, which allows AI to directly control user browsers, similar to OpenAI's Computer-Using Agent (CUA) [1][25] - The model demonstrates state-of-the-art (SOTA) performance in various benchmarks, outperforming competitors in several tasks [6][25] Benchmark Performance - Gemini 2.5 Computer Use achieved notable scores in benchmark tests, such as: - Online-Mind2Web: 69.0% accuracy - Measured by Browserbase: 65.7% accuracy - WebVoyager: 88.9% self-reported accuracy - AndroidWorld: 69.7% accuracy [7] Speed and Accuracy - The model exhibits high accuracy and speed in completing tasks, effectively gathering information and organizing notes [5][9] - However, it struggles with more complex tasks, indicating limitations in its current capabilities [9][11] User Interaction and Workflow - Users can access the model's capabilities through Google AI Studio and Vertex AI's Gemini API, with a demo environment available for testing [13] - The model operates in a loop, analyzing user inputs and generating UI action function calls, with safety mechanisms in place to confirm actions [19][21] Safety Mechanisms - Google has integrated safety measures during the training phase to mitigate risks associated with AI controlling computers, including user misuse and unexpected model behavior [23][26] - Developers are provided with options to prevent the model from executing potentially harmful actions [24][26] Industry Implications - The introduction of Gemini 2.5 Computer Use signals a competitive shift in the AI agent landscape, with major tech companies vying to redefine human-computer interaction [25]
2025诺贝尔物理学奖花落宏观量子隧穿:他们在实验中「造出」了薛定谔的猫
机器之心· 2025-10-07 10:53
Core Viewpoint - The 2023 Nobel Prize in Physics was awarded to John Clarke, Michel H. Devoret, and John M. Martinis for their groundbreaking work demonstrating macroscopic quantum tunneling and energy quantization in superconducting circuits, paving the way for next-generation quantum technologies [2][5][11]. Group 1: Experimental Achievements - The laureates conducted a series of experiments in the 1980s that showcased how quantum tunneling effects can be observed at a macroscopic scale, specifically in superconducting circuits [11][12]. - They created a circuit with two superconductors separated by an insulating layer, demonstrating that charged particles in superconductors behave collectively as if they were a single particle [11][12]. - Their experiments confirmed that the superconducting system could escape a zero-voltage state through tunneling, producing measurable voltage and demonstrating quantized energy levels [12][28]. Group 2: Theoretical Implications - The experiments provide significant insights into quantum mechanics, illustrating how macroscopic phenomena can arise from the collective behavior of many microscopic particles [31][33]. - The work draws parallels to Schrödinger's cat thought experiment, suggesting that macroscopic quantum states can exist and be measured, challenging traditional views of quantum mechanics [31][33]. - The findings have implications for the development of quantum technologies, including quantum computing, by utilizing the principles of energy quantization demonstrated in their research [35][37]. Group 3: Future Applications - The research opens new avenues for experimental exploration of quantum phenomena, potentially leading to the creation of artificial atoms that can be used in quantum technology applications [35]. - John Martinis's subsequent work on quantum computers leverages the principles established by the Nobel laureates, indicating a direct application of their findings in advancing quantum computing technology [35].
DeepMind发布代码修复AI智能体CodeMender,实现「被动响应」与「主动防御」一体化
机器之心· 2025-10-07 07:00
Core Viewpoint - The article discusses the introduction of CodeMender, an AI agent developed by DeepMind, designed to automatically repair critical software vulnerabilities while ensuring that the fixes do not introduce new issues, emphasizing the importance of rigorous validation in AI-driven code security solutions [2][10]. Group 1: CodeMender Overview - CodeMender employs a comprehensive approach to address software vulnerabilities, balancing both passive response and proactive defense by immediately patching new vulnerabilities and rewriting existing code to eliminate systemic flaws [4]. - In the past six months, DeepMind has uploaded 72 security patches to open-source projects, with some patches encompassing up to 4.5 million lines of code [5]. - By automating the creation and application of high-quality security patches, CodeMender allows developers to focus on building quality software rather than spending time on vulnerability detection [6]. Group 2: Developer Reactions - The release of CodeMender has sparked discussions among developers, with some highlighting its ability to ensure that fixes do not disrupt other functionalities, marking a significant advancement in automation [8]. - Concerns have been raised that CodeMender could potentially disrupt income streams related to quality assurance, security audits, and bug bounty programs [8]. Group 3: AI Vulnerability Reward Program - Google has recently launched a reward program specifically targeting vulnerabilities in AI products, with bug hunters having earned over $430,000 since the initiative began two years ago [9]. Group 4: CodeMender's Mechanism - CodeMender operates using the latest Gemini deep thinking model, enabling it to automatically debug and repair complex vulnerabilities while ensuring that modifications are logically sound and do not cause additional problems [12]. - The agent utilizes a variety of tools, including debuggers and source code browsers, to accurately identify root causes and design patches [14]. - Advanced program analysis techniques, such as static and dynamic analysis, are employed to systematically examine code patterns and identify vulnerabilities [18]. Group 5: Case Studies - In one case, CodeMender identified a root cause related to stack management in XML parsing, leading to a patch that modified only a few lines of code [15]. - Another instance showcased CodeMender's ability to create a non-trivial patch addressing complex object lifecycle issues, demonstrating its capability to enhance security by rewriting existing code [17]. Group 6: Future Developments - All patches generated by CodeMender undergo human review before submission to upstream projects, ensuring reliability and quality [24]. - DeepMind plans to share further technical papers and reports in the coming months, with the goal of eventually making CodeMender available as a tool for all developers to enhance software security [24].
田渊栋与Russell团队联手,证明Transformer能在训练中自然学会叠加推理
机器之心· 2025-10-07 03:57
Core Insights - The article discusses the emergence of a new reasoning paradigm called "Chain of Continuous Thought" (Coconut), which allows large language models (LLMs) to maintain reasoning trajectories in a continuous latent space rather than discrete token space, leading to significant performance improvements [1][2]. Group 1: Continuous Thought Mechanism - The Coconut method enables models to perform reasoning in a superposition state, allowing them to retain multiple potential reasoning paths in parallel, which is more efficient than traditional methods [3][4]. - A key advantage of this approach is that it can effectively solve directed graph reachability problems using a two-layer Transformer with O(n) continuous thought decoding, where n is the number of nodes in the graph [5]. Group 2: Training Dynamics - Recent research by the teams of Tian Yuandong and Stuart Russell has theoretically confirmed that gradient descent training can naturally converge to this structure, demonstrating the emergence of superposition during training [6][8]. - The training dynamics reveal that even with a single demonstration in each training sample, superposition can spontaneously emerge, maintaining bounded index-matching logits, which is crucial for local search capabilities [9][10]. Group 3: Experimental Results - The experimental setup involved a GPT-2 style decoder with two layers of Transformer, trained over 350 epochs, achieving an accuracy of 96.2% on the test set [13][15]. - The model's attention focused on frontier edges during the reasoning generation phase, leading to a stable logit difference that aligns with theoretical predictions [19][20]. Group 4: Prediction Phase - In the prediction phase, the model utilizes two signals: residual carryover and candidate lift, which help in enhancing the logits of the correct candidates [24][27]. - The dynamics of these signals show that they rise rapidly and stabilize within approximately five epochs, ensuring that the correct candidate's logit is maximized [29][30]. Group 5: Summary of Findings - The study systematically analyzes the spontaneous emergence mechanism of superposition states during continuous thought chain training, highlighting that bounded logits facilitate a balance between exploration and exploitation in reasoning [32][33][34].
清华、NVIDIA、斯坦福提出DiffusionNFT:基于前向过程的扩散强化学习新范式,训练效率提升25倍
机器之心· 2025-10-07 00:14
清华大学朱军教授团队, NVIDIA Deep Imagination 研究组与斯坦福 Stefano Ermon 团队联合提出了一种全新的扩散模型强化学习(RL)范式 —— Diffusion Negative-aware FineTuning (DiffusionNFT) 。该方法首次突破现有 RL 对扩散模型的基本假设,直接在 前向加噪过程(forward process) 上进行优化,在彻底摆 脱似然估计与特定采样器依赖的同时,显著提升了训练效率与生成质量。文章共同一作郑凯文和陈华玉为清华大学计算机系博士生。 近年来,强化学习在大语言模型(LLMs)后训练中的巨大成功,催生了人们将类似方法迁移到扩散模型的探索。例如,FlowGRPO 等方法通过将扩散采样过程离 散化为多步决策问题,从而在反向过程上应用策略梯度优化。然而,这一思路存在多重根本性局限: 论文标题:DiffusionNFT: Online Diffusion Reinforcement with Forward Process 论文链接:https://arxiv.org/abs/2509.16117 代码仓库:https://github ...
刚刚,OpenAI开发者大会重磅发布:AgentKit、Codex正式版、Apps SDK与Sora 2 API
机器之心· 2025-10-07 00:14
Core Insights - OpenAI has achieved significant milestones in the past two years, including 40 million developers, 800 million weekly active ChatGPT users, and an API consumption rate of 60 billion tokens per minute [2][4]. Group 1: New Tools and Features - OpenAI introduced several new tools at the developer conference, including AgentKit, Codex General Availability, ChatGPT built-in applications, and various APIs such as gpt-realtime-mini and Sora 2 API [4][28][32][39][43]. - AgentKit is a comprehensive toolkit for developers and enterprises to build, deploy, and optimize intelligent agents, featuring components like Agent Builder, Connector Registry, and ChatKit [11][14][21][22]. - Codex has been officially launched with new features, including custom tool calls and custom graders, and has seen a tenfold increase in daily active users since August [28][29][30]. - ChatGPT's new applications allow users to interact seamlessly within the chat interface, with initial applications including Booking.com, Canva, and Spotify [32][34]. Group 2: Performance and Usage Metrics - OpenAI reported that the customer service agent built using their tools has handled two-thirds of all tickets for Klarna, while Clay achieved a tenfold growth through sales agents [24]. - Codex has become an integral part of OpenAI's development process, with a 70% increase in the number of pull requests merged weekly since its adoption [31]. - The new Sora API allows developers to create and remix video content programmatically, showcasing OpenAI's advancements in generative media [44][48]. Group 3: Future Plans - OpenAI plans to introduce a standalone Workflows API and agent deployment options for ChatGPT in the near future [26]. - The Apps SDK has been open-sourced, enabling developers to design applications that can reach over 800 million ChatGPT users [37].
EMNLP 2025 | CARE:无需外部工具,让大模型原生检索增强推理实现上下文高保真
机器之心· 2025-10-06 04:00
近日,来自 MetaGPT、蒙特利尔大学和 Mila 研究所、麦吉尔大学、耶鲁大学等机构的研究团队发布 CARE 框架,一个新颖的原生检索增强推理框架, 教会 LLM 将推理过程中的上下文事实与模型自身的检索能力 有机结合 起来 。该框架现已全面开源,包括训练数据集、训练代码、模型 checkpoints 和 评估代码,为社区提供一套完整的、可复现工作。 项目主页:https://foundationagents.github.io/CARE 论文地址:https://arxiv.org/abs/2509.13683 https://huggingface.co/collections/sheryc/care-datasets-emnlp-2025-68be35242afab58f4bed7d97 https://huggingface.co/collections/sheryc/care-checkpoints-emnlp-2025-68be35dbd732816c9d98f258 研究背景 从"外部搜索"到"原生检索"的转变 1、现有方法的困境 开源代码:https://github.com/Founda ...
多个编码智能体同时使用会不会混乱?海外开发者热议
机器之心· 2025-10-06 04:00
Core Insights - The rapid advancement of AI programming tools is transforming the coding landscape, with models like GPT-5 and Gemini 2.5 enabling a degree of automation in development tasks [1][2] - The adoption of AI coding agents has become a norm not only for programmers but also for professionals in product and design roles, leading to an increasing proportion of AI-generated code [3] - Despite the benefits, challenges remain regarding code quality and analysis efficiency, prompting developers to explore the use of multiple AI agents in parallel [3][5] Summary by Sections - **Parallel Coding Agent Lifestyle**: Simon Willison initially had reservations about using multiple AI agents due to concerns over code review bottlenecks. However, he has since embraced this approach, finding it manageable to run multiple small tasks without overwhelming cognitive load [5][6] - **Task Categories for Parallel Agents**: - **Research Tasks**: AI agents can assist in answering questions or providing suggestions without modifying core project code, facilitating rapid prototyping and validation of concepts [7][9] - **System Mechanism Recall**: Modern AI models can quickly provide detailed, actionable answers about system functionalities, aiding in understanding complex codebases [10][11] - **Small Maintenance Tasks**: Low-risk code modifications, such as addressing deprecation warnings, can be delegated to AI agents, allowing developers to focus on primary tasks [13][14] - **Precisely Specified Work**: Reviewing code generated from detailed specifications is less burdensome, as the focus shifts to verifying compliance with established requirements [15] - **Current Usage Patterns**: Willison's primary tools include Claude Code, Codex CLI, and Codex Cloud, among others. He often runs multiple instances in different terminal windows, executing tasks in a YOLO (You Only Live Once) manner for manageable risks [16][19] - **Developer Community Response**: The blog post has garnered significant attention, resonating with current pain points in coding workflows. Many developers are experimenting with parallel AI agents, with some reporting that a substantial portion of their coding work is AI-assisted [21][22] - **Concerns and Discussions**: While some developers express apprehension about the unpredictability of AI-generated code, others, including Willison, advocate for the benefits of parallel agent usage, particularly for non-code-committing research tasks [26][29]
苹果再发论文:精准定位LLM幻觉,GPT-5、o3都办不到
机器之心· 2025-10-06 04:00
机器之心报道 机器之心编辑部 苹果这几天真是进入了论文高产期,时不时就有新的研究发布出来。 就在近日,苹果又发布了一篇引发学界与业界关注的重磅论文。 这篇论文非常有意思,它用强化学习训练模型,让模型能够准确标出答案中哪些部分是幻觉(hallucinated)。 其核心突破在于:模型不再只是笼统地提示有错误,而是能直接指出具体哪一段文字是错误的。这对于需要修改输出或进行事实审查的用户来说,大大节省了时 间。 论文提出的方法名为 RL4HS,它使用了片段级奖励(span-level rewards)和类别感知的 GRPO(Class-Aware Group Relative Policy Optimization),从而避免模型偷 懒、只输出无错误预测。 该方法在片段级幻觉检测任务上,甚至超过了 GPT-5 和 o3。 总体而言,片段级奖励 + 类别平衡机制让模型真正学会了核查依据并精确指出错误内容,这是让大语言模型更可靠、更可审计的重要一步。 来源: https://x.com/rohanpaul_ai/status/1974652007068967315 接下来我们看看论文内容。 论文摘要部分,作者表示大语言 ...