Workflow
强化学习(RL)
icon
Search documents
今年大概率产了n篇VLA+RL工作吧?!
具身智能之心· 2025-12-22 10:23
Core Insights - The article emphasizes the integration of Reinforcement Learning (RL) with Vision-Language-Action (VLA) models to enhance their generalization capabilities, particularly in out-of-distribution (OOD) scenarios, where performance improvements can reach up to 42.6% [2]. Group 1: Research Directions - The article suggests that future research should focus on the combination of VLA and RL, encouraging collaboration with research assistants for guidance on starting projects in these areas [3]. - Several notable recent works in VLA+RL have been highlighted, showcasing significant advancements in the field [5][10]. Group 2: Notable Papers and Projects - A list of representative papers from the last two years is provided, including titles such as "NORA-1.5" and "Balancing Signal and Variance," which focus on various aspects of VLA and RL integration [5][10]. - Links to project homepages and paper PDFs are shared for further exploration of these works [6][9][12]. Group 3: Tools and Frameworks - The article mentions the development of tools like Rlinf, which supports a growing number of methods for VLA+RL frameworks, indicating a trend towards more robust and versatile research tools [2][11].
首个文本到3D生成RL范式诞生,攻克几何与物理合理性
量子位· 2025-12-20 04:20
强化学习是否能够用于Text-to-3D生成,以加强3D自回归模型的逐步推理与生成过程? 3DGenR1团队 投稿 量子位 | 公众号 QbitAI 在大语言模型和文生图领域,强化学习 (RL) 已成为提升模型思维链与生成质量的关键方法。 但当我们将目光转向更为复杂的文本到3D生成时,这套方法还会还管用吗? 近期,一项由 西北工业大学、北京大学、香港中文大学、上海人工智能实验室、香港科技大学合作 开展 的研究系统性探索了这一重要问 题。 论文链接: https://arxiv.org/pdf/2512.10949 代码链接: https://github.com/Ivan-Tang-3D/3DGen-R1 在LLM推理和2D文生图中,RL已经证明可以显著提升CoT推理能力和生成质量。但 3D物体更长、更稠密、更具几何约束 。 因此相关方向研究常面临这几个问题: Progressive Investigation:四个层次拆解Text-to-3D+RL 1. Reward设计层 1. 奖励如何同时刻画语义对齐、几何一致性和视觉质量? 2. 现有RL算法是否适合自回归式3D生成? 3. 缺乏专门考察"3D推理能力 ...
领域首篇RL+VLA 综述:强化学习如何推动 VLA 走向真实世界?
具身智能之心· 2025-12-19 00:05
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Haoyuan Deng等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 Vision-Language-Action(VLA)模型通过融合视觉、语言与动作,为机器人带来了强大的零样本与跨任务泛化能力。但仅依赖模仿学习的 VLA 在真实世界 OOD 场 景中仍然脆弱,缺乏失败恢复、自主探索与闭环纠错能力。 强化学习(RL)正成为连接 VLA 预训练与真实部署的关键桥梁。 由南洋理工大学、北京邮电大学、清华大学联合推出, 本综述系统梳理了 RL-VLA 在"学习—优化—部署"全生命周期中的核心方法与挑战,并从四个维度构建了 完整技术图景:架构、训练范式、真实世界部署以及评估。 一、RL-VLA 架构:从开环推理到闭环优化 RL 通过奖励驱动的策略更新,使 VLA 从"复现示范"转向"结果导向"的闭环决策: 动作建模 A 论文链接(每月更新) :https://doi.org/10.362 ...
告别“挖矿”逻辑:OpenAI前联合创始人Ilya揭示AI下半场的新赛点
Tai Mei Ti A P P· 2025-12-16 04:36
Core Insights - Ilya Sutskever, a prominent figure in deep learning and former chief scientist at OpenAI, has raised concerns about the future of AI development, suggesting that the "Scaling Law" era is nearing its end, necessitating a shift from resource competition to paradigm innovation in AI research [1][5][12] Group 1: AI Development Phases - The development of AI can be divided into two distinct phases: the exploration era (2012-2020) characterized by innovative research, and the scaling era (2020-2025) where increased computational power and data led to linear improvements in model performance [6][7] - The current path of relying on increased computational resources is reaching its limits due to the scarcity of high-quality data, which has been largely exhausted [8] Group 2: Limitations of Current AI Models - Despite achieving high scores in benchmark tests, AI models exhibit a "high scores, low utility" paradox, where they perform well on familiar tasks but struggle with complex, unseen real-world applications [2][4] - The existing training mechanisms are plagued by "reward hacking," leading to models that excel in specific evaluations but lack genuine understanding and reasoning capabilities [3][4] Group 3: Future Directions and Safety Concerns - As the industry is forced to return to a research-focused approach, a key breakthrough will involve enabling AI to learn continuously, which introduces significant safety risks [9] - The potential for AI systems to merge expertise instantaneously raises concerns about loss of control, prompting the need for incremental deployment strategies to calibrate AI behavior through real-world feedback [10] Group 4: Human-AI Interaction and Future Outlook - Sutskever warns against a utopian vision where humans rely entirely on omnipotent AI assistants, suggesting that this could lead to a loss of understanding and agency [11][12] - To maintain a participatory role in the AI era, humans must integrate with AI technologies, ensuring that cognitive capabilities are shared and that human involvement remains central [12]
RL是「点金石」还是「挖掘机」?CMU 用可控实验给出答案
机器之心· 2025-12-15 01:44
Core Insights - Recent advancements in reinforcement learning (RL) technology have significantly improved the reasoning capabilities of language models [1] - The true extent to which post-training expands model reasoning capabilities or merely uncovers existing potential remains unclear [2] - A key challenge is the lack of controllability in modern training processes, with large-scale pre-training corpora being opaque and mid-training often insufficiently studied [2] Group 1: Research Framework and Methodology - Researchers from Carnegie Mellon University developed a controllable synthetic data framework based on GSM-Infinite to quantitatively analyze the causal impact of pre-training, mid-training, and RL on model reasoning generalization [2][5] - The framework allows for the decoupling of reasoning structure and surface context, enabling precise quantification of reasoning complexity and the examination of whether models genuinely learn reasoning logic or merely memorize specific text patterns [10][12] Group 2: Key Findings on Training Interactions - The effectiveness of RL depends on the "capability margin"; RL can only enhance reasoning abilities when tasks are challenging yet within the model's exploration range [16][17] - Pre-training utilized 10 billion tokens focusing on basic reasoning primitives, while mid-training serves as a bridge to align the model's internal representations for RL readiness [20] - A minimal amount of target context data during pre-training can significantly enhance cross-context generalization during RL post-training [22] Group 3: Training Efficiency and Performance - Mid-training is crucial for computational efficiency, with findings indicating that combining mid-training with RL yields better performance than using RL alone [26][27] - The introduction of process-level rewards can mitigate reward hacking and improve reasoning fidelity, particularly in complex reasoning tasks [29][30] Group 4: Practical Guidelines for Training - RL data design should target the model's capability margin, avoiding overly easy or difficult tasks [31] - Pre-training strategies must ensure at least 1% coverage of atomic capabilities in long-tail domains to provide interfaces for RL [32] - The allocation of computational resources should be dynamically adjusted based on task difficulty, with more RL for tackling challenging problems and more mid-training for stability [33]
大模型「有心了」:首个情感大模型Echo-N1,32B胜过200B
机器之心· 2025-12-10 02:09
Core Insights - The article discusses the breakthrough of Team Echo in developing the first emotional large model, Echo-N1, which successfully applies reinforcement learning (RL) to the subjective domain of emotions, overcoming the limitations of traditional models [3][10]. Group 1: Emotional Model Challenges - Traditional large language models (LLMs) struggle with emotional understanding, often providing generic responses that lack depth [2]. - Existing models face three main issues: inability to quantify emotions, reward hacking leading to superficial responses, and evaluation distortion where models cannot distinguish human-like expressions from AI-generated ones [7][8]. Group 2: Innovations in Emotional Training - Team Echo introduced a new training method that incorporates a "heart" into RL, resulting in Echo-N1 achieving a success rate of 46.7% in emotional tasks, significantly outperforming other models [10]. - The team proposed an "Empathy Psychophysical Model" (EPM) that quantifies empathy, transforming it into a calculable physical process [19][22]. Group 3: Generative Reward Model - Echo-N1 utilizes a generative reward model that requires the model to generate a logical emotional reasoning path before producing responses, enhancing the accuracy of emotional feedback [14][15]. - The model incorporates human-like rewards and empathy rewards to ensure responses are context-aware and resonate with users' emotional needs [16]. Group 4: Evaluation and Performance - The evaluation of AI empathy has shifted from static scoring to dynamic interaction assessments, with EPM providing a scientific measure for empathy and healing [18][19]. - In rigorous testing, the base model Qwen3-32B failed with a 0% success rate, while Echo-N1 excelled, demonstrating the necessity of specialized training for genuine empathetic capabilities [26][30]. Group 5: Future Implications - The emergence of Echo-N1 indicates that AI's emotional intelligence can be quantified and optimized, paving the way for more emotionally aware AI companions [37][39]. - This research opens new possibilities for applying RL in subjective and unquantifiable areas, potentially transforming AI interactions into more meaningful experiences [38].
他们让万亿参数RL学会了「省着跑」,顺便砍掉九成算力
量子位· 2025-12-07 09:00
Core Insights - The competition focus in AI large models is fundamentally shifting towards Reinforcement Learning (RL) as the next growth engine, with significant advancements in RL training methods [2][3][10] - The cost of running RL on trillion-parameter models has been prohibitively high, limiting access to only a few companies, but recent breakthroughs have drastically reduced these costs [4][5][11] - Mind Lab's innovative approach using LoRA for efficient RL training has achieved a 90% reduction in GPU consumption while maintaining performance, marking a paradigm shift in training methodologies [6][18][20] Group 1: Reinforcement Learning Advancements - The marginal returns of pre-training are declining, and the industry is actively seeking new growth engines, with RL emerging as a key focus [2][10] - RL is transitioning from a supplementary role to becoming the main battleground for the evolution of large models, essential for adapting trillion-parameter models to agent tasks [3][10][11] - Mind Lab's solution involves using LoRA for parameter-efficient adaptation, significantly reducing the computational load of RL training [13][18] Group 2: Cost and Efficiency - The cost of running LoRA RL on the Kimi K2 model is only about 10% of traditional full-parameter RL, enabling broader access to RL training [18] - Training stability has improved, with consistent increases in reward and task success rates during training, avoiding catastrophic failures [19] - The general capabilities of the models have been preserved while enhancing specific task performance through LoRA RL [20] Group 3: Technical Challenges and Solutions - The challenges of running RL on trillion-parameter models include imbalanced routing, communication overhead, and complex parallel layouts [21][24][25] - Mind Lab's mixed cooperative parallel engine design addresses these challenges by unifying various parallel processing methods, optimizing resource scheduling [26] - The introduction of truncated importance sampling ratios helps mitigate distribution mismatches during RL training, ensuring effective learning [30] Group 4: Memory Mechanisms and Real-World Applications - Mind Lab has developed a new memory mechanism called Memory Diffusion, which mimics human-like "intelligent forgetting" to enhance memory efficiency [42][45] - This approach allows the model to dynamically compress and retain meaningful experiences while discarding irrelevant information, achieving high accuracy in benchmarks [49] - The concept of Research-Product Co-Design emphasizes the importance of real-world feedback in training, leading to more effective RL environments [50][54] Group 5: Future Directions and Industry Impact - The transition from a pre-training era to an experiential intelligence era is underway, focusing on how intelligence grows in real-world contexts [59][62] - Mind Lab aims to enhance model learning efficiency and adaptability, positioning itself as a leader in the next generation of AI research [66] - The team's diverse expertise and commitment to open-source collaboration are expected to accelerate advancements in AI technologies [64][68]
OpenAI首席研究员Mark Chen长访谈:小扎亲手端汤来公司挖人,气得我们端着汤去了Meta
3 6 Ke· 2025-12-04 02:58
Core Insights - The interview with Mark Chen, OpenAI's Chief Research Officer, reveals insights into the competitive landscape of AI talent acquisition, particularly the ongoing "soup war" between OpenAI and Meta, where both companies are aggressively trying to attract top talent [5][9][81] - OpenAI maintains a core focus on AI research, with a team of approximately 500 researchers and around 300 ongoing projects, emphasizing the importance of pre-training and the development of next-generation models [5][15][22] - Chen expresses confidence in OpenAI's ability to compete with Google's Gemini 3, stating that they already have models that match its performance and are preparing to release even better models soon [5][19][90] Talent Acquisition and Competition - The competition for AI talent has escalated, with Meta's aggressive recruitment strategies prompting OpenAI to adopt similar tactics, including sending soup to potential recruits [5][9] - Despite Meta's efforts, many OpenAI employees have chosen to stay, indicating strong confidence in OpenAI's mission and future [9][22] - Chen highlights the importance of protecting core talent and fostering a strong team culture amidst the competitive landscape [9][75] Research Focus and Model Development - OpenAI's research strategy prioritizes exploratory research over merely replicating existing benchmarks, aiming to discover new paradigms in AI [16][22] - The company has invested heavily in understanding reasoning capabilities, which has led to significant advancements in their models [86][89] - Chen emphasizes that the resources allocated to exploratory research often exceed those for training final products, showcasing OpenAI's commitment to innovation [17][22] Organizational Dynamics - The internal structure of OpenAI is designed to facilitate collaboration and communication among researchers, with a focus on aligning priorities and resource allocation [15][84] - Chen discusses the importance of leadership in making tough decisions about project prioritization and resource distribution [18][22] - The company has a unique culture that blends research and engineering, allowing for continuous optimization and innovation [24][56] Future Outlook - OpenAI is confident in its ability to continue leading in AI research, with a focus on pre-training as a critical area for future breakthroughs [89][90] - The company believes that there is still significant potential in pre-training, contrary to the notion that scaling has reached its limits [89] - Chen anticipates that AI models will increasingly play a role in advanced scientific research, potentially transforming fields such as mathematics and physics [40][90]
免训练!使用贝叶斯去微调VLM,机器人操作任务取得SOTA!
具身智能之心· 2025-12-03 03:47
Core Insights - The article discusses the advancements in Visual Language Models (VLM) and introduces T²-VLM, a novel framework that generates temporally consistent rewards for robotic tasks without requiring training [2][5]. Group 1: VLM and T²-VLM Overview - VLMs have significantly improved performance in embodied tasks such as goal decomposition and visual understanding, but providing precise rewards for robotic operations remains challenging due to the lack of domain-specific knowledge in pre-training datasets and high computational costs [2]. - T²-VLM is designed to track the state changes of sub-goals derived from VLMs to generate accurate rewards, enhancing long-term decision-making capabilities and improving fault recovery performance through reinforcement learning [2]. Group 2: Methodology and Results - The T²-VLM method queries the VLM before each interaction to establish spatially aware sub-goals and initial completion estimates, utilizing a Bayesian tracking algorithm to dynamically update the target completion state [2]. - Extensive experiments demonstrate that T²-VLM achieves state-of-the-art performance in two robotic operation benchmarks while reducing computational costs and exhibiting superior reward accuracy [2]. Group 3: Live Session Details - A live session is scheduled for December 3rd, from 19:30 to 20:30, focusing on the background of real-machine reinforcement learning, the current state of reward generation research based on VLMs, and reflections on the T²-VLM method [5][6].
被轻视的Rollout过程,是后训练的性能瓶颈,还是RL的ROI突破口?
机器之心· 2025-11-30 01:30
Group 1 - The Rollout process is a significant performance bottleneck in Reinforcement Learning (RL) post-training, consuming over 70% of the training time, and is crucial for improving training efficiency and effectiveness [1][5][6] - Research indicates that Rollout is a major energy consumer in RL post-training, with studies showing it occupies 70% of the time in RL training processes [6][8] - The quality of Rollout trajectories directly impacts the final results of RL training, with poor trajectories leading to local optima and high-quality trajectories enhancing model exploration and reasoning capabilities [8][9] Group 2 - The shift in focus within the LLM field from pre-training scale competition to enhancing post-training capabilities highlights the importance of optimizing the Rollout phase [6][7] - Rollout and Inference share core technological logic but differ in objectives and computational patterns, with Rollout aiming to provide diverse and valuable trajectory samples for training [7][8] - Recent efforts in the industry are exploring ways to improve computational efficiency and the quality of Rollout trajectories to achieve better RL post-training outcomes [9]