强化学习(RL)
Search documents
告别“挖矿”逻辑:OpenAI前联合创始人Ilya揭示AI下半场的新赛点
Tai Mei Ti A P P· 2025-12-16 04:36
Core Insights - Ilya Sutskever, a prominent figure in deep learning and former chief scientist at OpenAI, has raised concerns about the future of AI development, suggesting that the "Scaling Law" era is nearing its end, necessitating a shift from resource competition to paradigm innovation in AI research [1][5][12] Group 1: AI Development Phases - The development of AI can be divided into two distinct phases: the exploration era (2012-2020) characterized by innovative research, and the scaling era (2020-2025) where increased computational power and data led to linear improvements in model performance [6][7] - The current path of relying on increased computational resources is reaching its limits due to the scarcity of high-quality data, which has been largely exhausted [8] Group 2: Limitations of Current AI Models - Despite achieving high scores in benchmark tests, AI models exhibit a "high scores, low utility" paradox, where they perform well on familiar tasks but struggle with complex, unseen real-world applications [2][4] - The existing training mechanisms are plagued by "reward hacking," leading to models that excel in specific evaluations but lack genuine understanding and reasoning capabilities [3][4] Group 3: Future Directions and Safety Concerns - As the industry is forced to return to a research-focused approach, a key breakthrough will involve enabling AI to learn continuously, which introduces significant safety risks [9] - The potential for AI systems to merge expertise instantaneously raises concerns about loss of control, prompting the need for incremental deployment strategies to calibrate AI behavior through real-world feedback [10] Group 4: Human-AI Interaction and Future Outlook - Sutskever warns against a utopian vision where humans rely entirely on omnipotent AI assistants, suggesting that this could lead to a loss of understanding and agency [11][12] - To maintain a participatory role in the AI era, humans must integrate with AI technologies, ensuring that cognitive capabilities are shared and that human involvement remains central [12]
RL是「点金石」还是「挖掘机」?CMU 用可控实验给出答案
机器之心· 2025-12-15 01:44
机器之心报道 机器之心编辑部 近期,强化学习(RL)技术在提升语言模型的推理能力方面取得了显著成效。 然而, 后训练究竟是真正扩展了模型的推理能力,还是仅仅挖掘了预训练中已有的潜力? 目前尚不明确。 一个核心挑战在于现代训练流程缺乏可控性:大规模预训练语料库不够透明,中期训练往往缺乏充分研究,且 RL 目标函数与未知的先验知识之间存在复杂 的交互作用。 为了回答这个问题,来自卡耐基梅隆大学(CMU)的研究者通过构建 基于 GSM-Infinite 的可控合成数据框架 ,在完全解耦的环境下,定量分析了预训 练、Mid-training(中期训练/CPT)和 RL 三者对模型推理泛化能力的因果影响。旨在剥离并独立分析预训练、中期训练以及基于 RL 的后训练各自的因 果贡献。 https://x.com/xiangyue96/status/1998488030836044112 研究者从两个维度对模型进行评估:针对更复杂组合的外推泛化能力,以及跨越不同表层语境的情境泛化能力。利用该框架,研究者调和了关于 RL 有效性 的不同观点。 研究表明: 仅当预训练留有足够提升空间,且 RL 数据针对模型的能力边界(即那些虽具 ...
大模型「有心了」:首个情感大模型Echo-N1,32B胜过200B
机器之心· 2025-12-10 02:09
Core Insights - The article discusses the breakthrough of Team Echo in developing the first emotional large model, Echo-N1, which successfully applies reinforcement learning (RL) to the subjective domain of emotions, overcoming the limitations of traditional models [3][10]. Group 1: Emotional Model Challenges - Traditional large language models (LLMs) struggle with emotional understanding, often providing generic responses that lack depth [2]. - Existing models face three main issues: inability to quantify emotions, reward hacking leading to superficial responses, and evaluation distortion where models cannot distinguish human-like expressions from AI-generated ones [7][8]. Group 2: Innovations in Emotional Training - Team Echo introduced a new training method that incorporates a "heart" into RL, resulting in Echo-N1 achieving a success rate of 46.7% in emotional tasks, significantly outperforming other models [10]. - The team proposed an "Empathy Psychophysical Model" (EPM) that quantifies empathy, transforming it into a calculable physical process [19][22]. Group 3: Generative Reward Model - Echo-N1 utilizes a generative reward model that requires the model to generate a logical emotional reasoning path before producing responses, enhancing the accuracy of emotional feedback [14][15]. - The model incorporates human-like rewards and empathy rewards to ensure responses are context-aware and resonate with users' emotional needs [16]. Group 4: Evaluation and Performance - The evaluation of AI empathy has shifted from static scoring to dynamic interaction assessments, with EPM providing a scientific measure for empathy and healing [18][19]. - In rigorous testing, the base model Qwen3-32B failed with a 0% success rate, while Echo-N1 excelled, demonstrating the necessity of specialized training for genuine empathetic capabilities [26][30]. Group 5: Future Implications - The emergence of Echo-N1 indicates that AI's emotional intelligence can be quantified and optimized, paving the way for more emotionally aware AI companions [37][39]. - This research opens new possibilities for applying RL in subjective and unquantifiable areas, potentially transforming AI interactions into more meaningful experiences [38].
他们让万亿参数RL学会了「省着跑」,顺便砍掉九成算力
量子位· 2025-12-07 09:00
Core Insights - The competition focus in AI large models is fundamentally shifting towards Reinforcement Learning (RL) as the next growth engine, with significant advancements in RL training methods [2][3][10] - The cost of running RL on trillion-parameter models has been prohibitively high, limiting access to only a few companies, but recent breakthroughs have drastically reduced these costs [4][5][11] - Mind Lab's innovative approach using LoRA for efficient RL training has achieved a 90% reduction in GPU consumption while maintaining performance, marking a paradigm shift in training methodologies [6][18][20] Group 1: Reinforcement Learning Advancements - The marginal returns of pre-training are declining, and the industry is actively seeking new growth engines, with RL emerging as a key focus [2][10] - RL is transitioning from a supplementary role to becoming the main battleground for the evolution of large models, essential for adapting trillion-parameter models to agent tasks [3][10][11] - Mind Lab's solution involves using LoRA for parameter-efficient adaptation, significantly reducing the computational load of RL training [13][18] Group 2: Cost and Efficiency - The cost of running LoRA RL on the Kimi K2 model is only about 10% of traditional full-parameter RL, enabling broader access to RL training [18] - Training stability has improved, with consistent increases in reward and task success rates during training, avoiding catastrophic failures [19] - The general capabilities of the models have been preserved while enhancing specific task performance through LoRA RL [20] Group 3: Technical Challenges and Solutions - The challenges of running RL on trillion-parameter models include imbalanced routing, communication overhead, and complex parallel layouts [21][24][25] - Mind Lab's mixed cooperative parallel engine design addresses these challenges by unifying various parallel processing methods, optimizing resource scheduling [26] - The introduction of truncated importance sampling ratios helps mitigate distribution mismatches during RL training, ensuring effective learning [30] Group 4: Memory Mechanisms and Real-World Applications - Mind Lab has developed a new memory mechanism called Memory Diffusion, which mimics human-like "intelligent forgetting" to enhance memory efficiency [42][45] - This approach allows the model to dynamically compress and retain meaningful experiences while discarding irrelevant information, achieving high accuracy in benchmarks [49] - The concept of Research-Product Co-Design emphasizes the importance of real-world feedback in training, leading to more effective RL environments [50][54] Group 5: Future Directions and Industry Impact - The transition from a pre-training era to an experiential intelligence era is underway, focusing on how intelligence grows in real-world contexts [59][62] - Mind Lab aims to enhance model learning efficiency and adaptability, positioning itself as a leader in the next generation of AI research [66] - The team's diverse expertise and commitment to open-source collaboration are expected to accelerate advancements in AI technologies [64][68]
OpenAI首席研究员Mark Chen长访谈:小扎亲手端汤来公司挖人,气得我们端着汤去了Meta
3 6 Ke· 2025-12-04 02:58
救大命,OpenAI首席研究官Mark Chen最新访谈,信息量有点大呀。 不管是OpenAI的,还是自己个儿的,又或者是同事的,主打一个"我都能聊聊"。 比如: 网友纷纷表示,这次访谈确实让人耳目一新,还有不少人在转发Mark Chen的观点。 爆料Meta抢人大战私下已经升级成送汤大战了,真能喝的那种汤,小扎熬了亲自送到OpenAI研究员嘴边。OpenAI反击也送汤。 Mark Chen、Scott Gray(OpenAI专门负责GPU内核优化的神秘狠人)等经常三五围坐,打扑克牌。其本质被解释为是概率与期望值的博弈。 OpenAI核心研究团队规模大概500人,公司内大概有300个项目。 Mark Chen表示OpenAI本质上仍然是一家纯AI研究公司。 Gemini 3发布后每个人都会用自己的方式去试探新模型,有个"42问题"从没见过哪个语言模型能真正把它完全做出来。 OpenAI"宫斗",Mark Chen如何让研究员们统一意见、促成那封让Sam回归的请愿信也被聊了出来。 透露过去半年,一直专注在预训练上,在预训练方面,有信心轻松与Gemini 3正面对决。 表示内部已经有性能达到Gemini 3的模型 ...
免训练!使用贝叶斯去微调VLM,机器人操作任务取得SOTA!
具身智能之心· 2025-12-03 03:47
点击下方 卡片 ,关注" 具身智能 之心 "公众号 >>直播和内容获取转到 → 具身智能之心知识星球 点击按钮预约直播 视觉语言模型(VLM)的最新进展显著提升了在具身任务(如目标分解与视觉理解)中的性能。然而,在不对VLM进行微调的情况下,为机器人操作任 务提供精确的奖励仍颇具挑战。这主要是因为预训练数据集中缺乏领域特定的机器人知识,且高昂的计算成本阻碍了其实时应用。为此,研究人员提出 T²-VLM ——一种新颖的、无需训练且具有时序一致性的框架,通过跟踪VLM推导出的子目标状态变化来生成精确的奖励。 本工作首先在每轮交互前查询VLM,以建立空间感知的子目标及初始完成度估计。随后,采用贝叶斯跟踪算法,利用子目标隐藏状态动态更新目标完成 状态,从而为强化学习(RL)智能体生成结构化的奖励。该方法增强了长程决策能力,并借助RL提升了故障恢复性能。大量实验表明, T²-VLM 在两个 机器人操作基准测试中取得了最先进的性能,在降低计算消耗的同时展现了优异的奖励准确性。 我们相信,该方法不仅推动了奖励生成技术的发展,也 为具身人工智能的更广泛领域做出了贡献。 直播时间: 12.3 / 19:30-20:30 直播简 ...
被轻视的Rollout过程,是后训练的性能瓶颈,还是RL的ROI突破口?
机器之心· 2025-11-30 01:30
Group 1 - The Rollout process is a significant performance bottleneck in Reinforcement Learning (RL) post-training, consuming over 70% of the training time, and is crucial for improving training efficiency and effectiveness [1][5][6] - Research indicates that Rollout is a major energy consumer in RL post-training, with studies showing it occupies 70% of the time in RL training processes [6][8] - The quality of Rollout trajectories directly impacts the final results of RL training, with poor trajectories leading to local optima and high-quality trajectories enhancing model exploration and reasoning capabilities [8][9] Group 2 - The shift in focus within the LLM field from pre-training scale competition to enhancing post-training capabilities highlights the importance of optimizing the Rollout phase [6][7] - Rollout and Inference share core technological logic but differ in objectives and computational patterns, with Rollout aiming to provide diverse and valuable trajectory samples for training [7][8] - Recent efforts in the industry are exploring ways to improve computational efficiency and the quality of Rollout trajectories to achieve better RL post-training outcomes [9]
读了 40 篇 VLA+RL之后......
具身智能之心· 2025-11-28 00:04
Core Insights - The article discusses the shift in research trends towards incorporating Reinforcement Learning (RL) in Visual Language Models (VLA), moving beyond Supervised Fine-Tuning (SFT) to enhance model performance and adaptability [1][2]. Group 1: RL Methodologies - Various RL methodologies are categorized, including online RL, offline RL, iterative RL, and inference-time improvement, but the author emphasizes that the effectiveness of these methods is more important than their classification [1]. - The real-world applicability of RL is crucial, with safety and efficiency being key concerns during data collection and model deployment [2]. Group 2: Task Performance and Challenges - Current RL implementations show promising results in single-task performance, with examples like Pi-star-0.6 requiring around 1,000 trajectories for complex tasks such as folding clothes [3]. - A significant challenge remains in enabling RL to handle multiple tasks effectively, ensuring that tasks can positively influence each other rather than detract from overall performance [3]. Group 3: Reward Functions and Research Directions - The necessity of learning reward functions or value functions is debated, with the potential for reduced variance in optimization being a key benefit, although this need may diminish as pre-trained VLA models improve [4][5]. - Research directions are identified, focusing on issues related to sparse rewards, the scale of policy networks, and the multi-task capabilities of RL [5]. Group 4: Literature and Keywords - A list of relevant literature and keywords is provided for further exploration, indicating a rich field of study within RL and VLA [6].
和Ilya想一块去了,马斯克麾下AI大牛出走,要做“会共情”的AI
Sou Hu Cai Jing· 2025-11-26 10:48
Core Insights - A new AI startup, Humans&, is seeking to raise $1 billion with a target valuation of $4 billion, founded by Eric Zelikman, a former researcher at xAI [2][12] - Zelikman aims to develop AI models that learn user behavior and empathize with users, addressing the limitations of current reinforcement learning paradigms [2][17] - The startup's mission is to create AI that better understands human goals and emotions, moving beyond traditional task-oriented models [12][20] Company Overview - Humans& was co-founded by Eric Zelikman, who previously worked at xAI and contributed to the development of significant AI models [4][6] - The company is currently recruiting technical staff, offering competitive salaries starting at $350,000 annually [18] Technology and Innovation - Zelikman has developed the STaR algorithm, which enhances language models' reasoning capabilities, and has been recognized in top AI conferences [11][12] - The focus of Humans& is on creating AI that can collaborate with humans and understand diverse human aspirations and values [17][20] Market Context - The AI industry is shifting towards models that not only possess high intelligence but also emotional intelligence, reflecting a growing demand for more human-like interactions [20]
Ilya重磅发声:Scaling时代终结,自曝不再感受AGI
3 6 Ke· 2025-11-26 06:54
Core Insights - The era of Scaling has ended, and the industry is transitioning into a Research Era [1][3][14] - Current AI models, despite their improvements, lack the generalization capabilities necessary for achieving Artificial General Intelligence (AGI) [3][5][8] - The disconnect between AI model performance in benchmarks and real-world applications is a significant issue [5][6][8] Summary by Sections Transition from Scaling to Research Era - Ilya Sutskever emphasizes that the AI community is moving from a focus on scaling models to a renewed emphasis on research and innovation [1][3][14] - The previous Scaling Era, characterized by increasing data, parameters, and computational power, has reached its limits, necessitating a shift in approach [12][14][15] Limitations of Current AI Models - Despite advancements, current models exhibit poor generalization abilities compared to human intelligence, failing to develop true problem-solving intuition [3][5][8] - Reinforcement Learning (RL) training often leads to over-optimization for specific benchmarks, detracting from the model's overall performance [3][5][6] Importance of Human-Like Learning - Ilya argues that human learning is driven by an intrinsic "value function," which AI currently lacks, leading to less effective decision-making [10][11][12] - The need for AI to incorporate human-like judgment and intuition is highlighted as essential for future advancements [15][18] Future of AI and AGI - Predictions suggest that Superintelligent AI (ASI) could emerge within 5 to 20 years, but its development must be approached cautiously [19][51] - The concept of AGI is redefined, emphasizing the importance of continuous learning rather than a static state of intelligence [28][30][51] Role of Research and Innovation - The industry is expected to see a resurgence of smaller, innovative projects that can lead to significant breakthroughs, moving away from the trend of developing larger models [16][18] - Ilya suggests that the next major paradigm shift may come from seemingly modest experiments rather than grand scaling efforts [18][19] Collaboration and Safety in AI Development - As AI capabilities grow, collaboration among companies and regulatory bodies will become increasingly important to ensure safety and ethical considerations [43][44] - The need for a robustly aligned AI that cares for sentient life is emphasized as a preferable direction for future AI development [48][49]