Workflow
机器之心
icon
Search documents
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
机器之心· 2025-09-01 02:49
机器之心报道 编辑:冷猫 GRPO 就像一个树节点,从这里开始开枝散叶。 大语言模型的发展真是日新月异。 从 DeepSeek 横空出世以来,其在大模型后训练的创新 GRPO 一跃成为强化学习黄金范式。 GRPO 已经成为一种大模型通用的强化学习算法,能够用在广泛的后训练任务中,甚至包括让大模型玩 2048: 大众理解的大语言模型的概念似乎很简单,从海量数据中自监督学习出来的一个模型,能够预测文本中下一个出现的词,从而输出语言文本。 但这并不完善,这种理解只突出了大模型「预训练」的过程,而完全忽略了「后训练」这一重要过程。 简单来说,从海量数据中学习的过程称为「预训练」,预训练的结果是让模型掌握了通用语言能力,但仅仅如此,模型生成的内并不一定符合偏好;可能生成冗 长、不准确的内容;可能不符合应用任务的需求。 换句话说, 预训 练后的大模型会说话, 但不一定会「说对话」。 而就在今年,大模型后训练的研究出现了几个重磅结果,包括 Seed 团队的 DAPO,Qwen 团队的 GSPO,微软团队的 GFPO 等等,而他们无一例外都是对 GRPO 范式的改进。 看这些名字都绕晕了,GRPO 到底有什么魔力,能让各大研 ...
首个为具身智能而生的大规模强化学习框架RLinf!清华、北京中关村学院、无问芯穹等重磅开源
机器之心· 2025-09-01 02:49
清华大学、北京中关村学院、无问芯穹联合北大、伯克利等机构重磅开源RLinf:首个面向具身智能的"渲训推一体化"大规模强化学习框架。 人工智能正在经历从 "感知" 到 "行动" 的跨越式发展,融合大模型的具身智能被认为是人工智能的下一发展阶段,成为学术界与工业界共同关注的话题。 机器之心报道 在大模型领域,随着 o1/R1 系列推理模型的发布,模型训练的重心逐渐从数据驱动的预训练 / 后训练转向奖励驱动的强化学习(Reinforcement Learning, RL)。 OpenAI 预测强化学习所需要的算力甚至将超过预训练。与此同时,能够将大规模算力高效利用的 RL infra 的重要性也日益凸显,近期也涌现出一批优秀的框架, 极大地促进了该领域的发展。 机器之心编辑部 图 1 : OpenAI 在红杉资本闭门会上的分享 然而,当前框架对具身智能的支持仍然受限。相比推理大模型这一类纯大脑模型,具身智能领域存在大脑(侧重推理、长程规划,如RoboBrain)、小脑(侧重执 行、短程操作,如OpenVLA)及大小脑联合(快慢系统,如pi 0.5)等多样模型。 其次, 具身智能除了包含Agentic AI的多步决策 ...
那天,AI大模型想起了,被「失忆」所束缚的枷锁
机器之心· 2025-08-31 05:33
Core Insights - The article discusses the advancements in memory capabilities of large language models (LLMs), highlighting how companies like Google, OpenAI, and Anthropic are integrating memory features into their AI systems to enhance user interaction and continuity in conversations [1][3][10]. Memory Capabilities of LLMs - Google's Gemini has introduced memory capabilities that allow it to retain information across multiple conversations, making interactions more natural and coherent [1]. - OpenAI's ChatGPT has implemented a memory feature since February 2024, enabling users to instruct the model to remember specific details, which improves its performance over time [3][42]. - Anthropic's Claude has also added memory functionality, allowing it to recall previous discussions when prompted by the user [3][6]. Types of Memory in LLMs - Memory can be categorized into sensory memory, short-term memory, and long-term memory, with a focus on long-term memory for LLMs [16][17]. - Contextual memory is a form of short-term memory where relevant information is included in the model's context window [18]. - External memory involves storing information in an external database, allowing for retrieval during interactions, which is a common method for building long-term memory [22][23]. - Parameterized memory attempts to encode information directly into the model's parameters, providing a deeper form of memory [24][29]. Innovations in Memory Systems - New startups are emerging, focusing on memory systems for AI, such as Letta AI's MemGPT and RockAI's Yan 2.0 Preview, which aim to enhance memory capabilities [11][12]. - The concept of hybrid memory systems is gaining traction, combining different types of memory to improve AI's adaptability and performance [37][38]. Notable Memory Implementations - OpenAI's ChatGPT allows users to manage their memory entries, while Anthropic's Claude retrieves past conversations only when requested [42][44]. - Gemini supports user input for memory management, enhancing its ability to remember user preferences [45]. - The M3-Agent developed by ByteDance, Zhejiang University, and Shanghai Jiao Tong University integrates long-term memory capabilities across multiple modalities, including video and audio [10][70]. Future Trends in AI Memory - The future of AI memory is expected to evolve towards multi-modal and integrated memory systems, allowing for a more comprehensive understanding of user interactions [97][106]. - There is a growing emphasis on creating memory systems that can autonomously manage and optimize their memory, akin to human cognitive processes [101][106]. - The ultimate goal is to develop AI systems that can exhibit unique personalities and emotional connections through their memory capabilities, potentially leading to the emergence of artificial general intelligence (AGI) [109][110].
这个荒诞网站藏着30个AI「鬼点子」,但我觉得它活不长
机器之心· 2025-08-31 03:54
机器之心报道 最近在 X 上闲逛,淘到了一个神奇的网站 ——「Absurd.website」。 一个绝妙的点子往往是公司最危险的毒药。 正如名字一样,它荒诞、有趣、脑洞大开,里面收录了各种奇葩的小项目,有些甚至能看到 AI 生成的痕迹。 比如项目海报过于光滑的皮肤,一眼 AI: 稍显粗糙的 AI 界面设计: 编辑:杨文 还有 100% AI 项目 Open Celebrity: AI 生成的免费名人照片,无论是做广告、社交媒体还是其他任何用途,完全没有版权问题。 这个网站成立于 2020 年,声称每月推出一个独特的项目和一个仅限会员的秘密项目,不过截至目前也只收录了 30 个项目。 网站链接:https://absurd.website/ 接下来,我们挑几个好玩的项目唠唠。 五花八门的AI小项目 Sexy Math(性感数学) 没想到,数学有朝一日竟能跟性感联系在一起。这款游戏的规则是,答对 10 道乘法题,就能解锁一张美女照片。有网友反馈称,我从未见过我的孩子如此积极地 学习乘法!他们解题速度比以往任何时候都快,甚至还挑战自己提高分数。 由于尺度有点大,进入游戏前先有个「免责问答」:你年满 18 岁吗?可问 ...
R-Zero 深度解析:无需人类数据,AI 如何实现自我进化?
机器之心· 2025-08-31 03:54
本文第一作者黄呈松 (Chengsong Huang) 是圣路易斯华盛顿大学的博士生,Google scholar citation 五百多次,目前的研究的兴趣是强化学习和大语言模型。机器 之心曾经报道过其之前工作 Lorahub 已经被引超过 250 次。 大型语言模型(LLM)的发展长期以来受限于对大规模、高质量人工标注数据的依赖,这不仅成本高昂,也从根本上限制了 AI 超越人类知识边界的潜力 。《R- Zero:从零数据中自我进化的推理大模型》提出了一种全新的范式,旨在打破这一瓶颈。该研究设计了一个名为 R-Zero 的全自主框架,使模型能够从零开始,通 过自我驱动的协同进化生成课程并提升推理能力,为通往更自主的人工智能提供了一条值得深入探讨的路径。 《R-Zero》论文的核心,是构建一个能从「零数据」开始自我进化的 AI 框架 ,主要依赖于两个 AI 角色 挑 战者 (Challenger) 和 解决者(Solver) 。 论文链接: https://www.arxiv.org/abs/2508.05004 挑战者 - 解决者的协同进化 这是一个完全封闭、自我驱动的进化循环。在这个过程中,AI 自己生 ...
混乱、内耗、丑闻:Meta考虑向Google、OpenAI低头
机器之心· 2025-08-31 03:54
机器之心报道 编辑:+0 最近的 Meta,在 AI 圈属实有点扎眼。不过焦点不是模型突破,而是一言难尽的公司管理。 斥资 143 亿美元投资、挖来「行业天才」领军,扎克伯格亲自下场高调地四处挖人,换来的却是数据质量 被指「低下」、核心人才纷纷出走,外加一桩让人侧目的 AI 伦理丑闻。 Agarwal 在告别时还引用了扎克伯格的话:「在一个变化如此之快的世界里,你所能承担的最大风险就是不 冒任何风险」。 这剧情可以拍成《社交网络 3》了。 失控的「超级碗」战队 故事的高潮从今年六月开始。为了追赶 OpenAI 和 Google,扎克伯格下了一步重棋:向数据标注领域的独角 兽 Scale AI 狂掷 143 亿美元,并将其创始人、AI 界的风云人物 Alexandr Wang 请来执掌全新的 Meta 超级智 能实验室(MSL)。 同时,扎克伯格发起了一场激进的「挖人」活动,以招募顶尖的人工智能人才。 扎克伯格甚至被调侃 在 看 OpenAI 直播时都不忘挖人 , 从苹果挖来的基础模型负责人庞若鸣 、 思维链的开山作者 Jason Wei 以及 北大校友孙之清 等人相继加入。 这支队伍星光熠熠,被寄予厚望,堪称 ...
Diffusion 一定比自回归更有机会实现大一统吗?
机器之心· 2025-08-31 01:30
机器之心PRO · 会员通讯 Week 35 --- 本周为您解读 ③ 个值得细品的 AI & Robotics 业内要事 --- 1. Diffusion 一定比自回归更有机会实现大一统吗? 有哪些工作认为 Diffusion 有望取代主流的 AR 架构?Diffusion 做大一统有何理论基础?为什么 Diffusion 的并行生成理论上更高效,但还是比 AR 慢?是什么解锁了 DLM 的文本推理能 力?近期工作挖掘了 DLM 的哪些潜力? ... 2. 合成数据的「毒」与「药」,模型崩溃有何新解? 本期通讯总计 26113 字,可免费试读至 7% 消耗 288 微信豆可兑换完整本期解读(约合人民币 28.8 元) 合成数据为什么会在迭代训练中逐代污染训练集?模型在早期和晚期崩溃表现出了怎样的差异?不同类型生成模型(LLM、VAE、GMM)崩溃机制有何共性和差异?合成数据在预训练、微 调、后训练、评估各阶段分别发挥了哪些功能?在什么情况下会反而降低模型性能?「Token-Level Editing」、「黄金比例混合」和「递归训练样本控制」等方法各自解决了崩溃问题的哪一 环?在实际训练中,如何量化「合成数据 ...
DeepSeek、GPT-5带头转向混合推理,一个token也不能浪费
机器之心· 2025-08-30 10:06
Core Insights - The article discusses the trend of hybrid reasoning models in AI, emphasizing the need for efficiency in computational resource usage while maintaining performance [12][11]. - Companies are increasingly adopting adaptive computing strategies to balance cost and performance, with notable implementations from major AI firms [11][12]. Group 1: Industry Trends - The phenomenon of "overthinking" in AI models leads to significant computational waste, prompting the need for adaptive computing solutions [3][11]. - Major AI companies, including OpenAI and DeepSeek, are implementing models that can switch between reasoning modes to optimize token usage, achieving reductions of 25-80% in token consumption [7][10][11]. - The emergence of hybrid reasoning models is expected to become the new norm in the large model field, with a focus on balancing cost and performance [11][12]. Group 2: Company Developments - OpenAI's GPT-5 introduces a routing mechanism that allows the model to select the appropriate reasoning mode based on user queries, enhancing user experience while managing computational costs [36][41]. - DeepSeek's v3.1 model combines reasoning and non-reasoning capabilities into a single model, offering a cost-effective alternative to competitors like GPT-5 [45][46]. - Other companies, such as Anthropic, Alibaba, and Tencent, are also exploring hybrid reasoning models, each with unique implementations and user control mechanisms [18][19][34][35]. Group 3: Economic Implications - Despite decreasing token costs, subscription fees for AI models are rising due to the demand for state-of-the-art (SOTA) models, which are more expensive to operate [14][16]. - The projected increase in token consumption for advanced AI tasks could lead to significant cost implications for users, with estimates suggesting that deep research calls could rise to $72 per day per user by 2027 [15][16]. - Companies are adjusting subscription models and usage limits to manage costs, indicating a shift in the economic landscape of AI services [16][43]. Group 4: Future Directions - The future of hybrid reasoning will focus on developing models that can intelligently self-regulate their reasoning processes to minimize costs while maximizing effectiveness [57]. - Ongoing research and development in adaptive thinking models are crucial for achieving efficient AI systems that can operate at lower costs [52][57].
CodeAgent 2.0 时代开启|GitTaskBench,颠覆性定义代码智能体实战交付新标准
机器之心· 2025-08-30 10:06
你是否也好奇过:现在的模型在各类榜单分数都那么高,实际体验却不符预期? 我们也看过各种 AI Coding 领域的评测,发现大多停留在了 「代码生成」与「封闭题目」的考核,却忽视了环境配置、依赖处理、跨仓库资源利用等开发者必经 的真实需求 —— 当下众多 Benchmark 仅通过题目,已难以衡量 Code Agent 的实际效果。 为突破现有评测局限, 中科院、北大、港科大、中科大、新加坡国立大学等机构的研究者,与前沿开源学术组织 QuantaAlpha 及阶跃星辰姜大昕团队联合 ,首次 提出并开源了 repo-level 的 测评新范式 GitTaskBench : 1)真正考察 Agent 从 仓库理解 → 环境配置 → 增量开发 / 代码修复 → 项目级交付 的全链路能力,指引了迭代新范式 2)首次把「框架 × 模型」的「经济收益」纳入评测指标,给学界、业界以及创业者都带来了很好的思路启发 GitTaskBench 分布一览 其开源版覆盖了 7 大模态 × 7 个领域 × 24 个子领域 及 54 个真实任务: 对应后端仓库 18 个,包含平均 204 个文件、1,274.78 个函数、52.63k ...
23岁小哥被OpenAI开除,成立对冲基金收益爆表,165页论文传遍硅谷
机器之心· 2025-08-30 04:12
Core Viewpoint - The article discusses the rapid rise of Leopold Aschenbrenner, a former OpenAI employee who was dismissed for allegedly leaking internal information, and his subsequent success in the investment field with a hedge fund that has significantly outperformed the market, particularly in AI-related investments. Group 1: Background of Leopold Aschenbrenner - Aschenbrenner was a member of OpenAI's "Superalignment" team and was considered close to the former chief scientist Ilya Sutskever before being fired for leaking internal information [7]. - He published a 165-page analysis titled "Situational Awareness: The Decade Ahead," which gained widespread attention in Silicon Valley [9][21]. - Aschenbrenner has a strong academic background, having graduated from Columbia University at 19 with degrees in mathematics, statistics, and economics, and previously worked at FTX Future Fund focusing on AI safety [16][17]. Group 2: Investment Strategy and Fund Performance - After leaving OpenAI, Aschenbrenner founded a hedge fund named Situational Awareness, focusing on industries likely to benefit from AI advancements, such as semiconductors and emerging AI companies [10]. - The fund quickly attracted significant investments, reaching a size of $1.5 billion, supported by notable figures in the tech industry [11]. - In the first half of the year, the fund achieved a 47% return, far exceeding the S&P 500's 6% and the tech hedge fund index's 7% [14]. Group 3: Insights on AI Development - Aschenbrenner's analysis emphasizes the exponential growth of AI capabilities, particularly from GPT-2 to GPT-4, and the importance of "Orders of Magnitude" (OOM) in evaluating AI progress [24][26]. - He identifies three main factors driving this growth: scaling laws, algorithmic innovations, and the use of massive datasets [27]. - Aschenbrenner predicts the potential arrival of Artificial General Intelligence (AGI) by 2027, which could revolutionize various industries and enhance productivity [29][30]. Group 4: Implications of AGI - The emergence of AGI could lead to significant advancements in productivity and efficiency across sectors, but it also raises critical issues such as unemployment and ethical considerations [31]. - Aschenbrenner discusses the concept of "intelligence explosion," where AGI could rapidly improve its own capabilities beyond human understanding [31][34]. - He highlights the need for robust governance structures to manage the risks associated with fully autonomous systems [31][36].