Workflow
强化学习(RL)
icon
Search documents
用SFT打出RL的效果?微软联合提出高效后训练算法
机器之心· 2026-03-25 07:44
Core Insights - The article discusses the importance of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) in the post-training phase of large models, highlighting their respective strengths and weaknesses [2] - A new approach, "Towards On-Policy SFT," is proposed to combine the advantages of SFT and RL by generating On-policy data and training efficiently [3] Group 1: On-Policy Data and Its Measurement - On-policy data is defined as data generated by the model using its current capabilities, contrasting with Off-policy data, which is derived from external sources [4] - Traditional metrics like Perplexity (PPL) and Log-Likelihood are insufficient for measuring the distribution shift between On-policy and Off-policy data due to noise from problem difficulty [6] - The article introduces a new quantification metric, Centered Log-Likelihood (CLL), which separates the noise and provides a clearer distinction between data sources [7] Group 2: Challenges of Supervised Fine-Tuning - SFT operates under the assumption that every word in the training set is an absolute truth, leading to severe penalties for prediction errors, which can cause catastrophic forgetting [12][13] - The article proposes In-Distribution Fine-Tuning (IDFT) as a solution to mitigate the issues caused by rigid fitting and noise in training data [14][17] Group 3: Hinted Decoding and Data Transformation - Hinted Decoding is introduced as a method to convert datasets into On-policy versions by allowing the model to rewrite examples while maintaining its style [20] - The approach involves switching between Self-distillation and normal training based on the entropy of the Teacher model, which improves the model's distribution metrics [22] Group 4: Experimental Results - The new methods proposed in the article outperform well-known Offline RL algorithms while using significantly fewer resources [25] - The results indicate that the adaptive switching mechanism based on entropy is crucial for achieving better performance [25] Group 5: Broader Implications - The work has potential applications across various fields, including CoT completion and On-policy Distillation, indicating its relevance beyond the immediate context [28]
零样本 Sim-to-Real !实现五指灵巧手力控抓取与手内操作
机器之心· 2026-03-24 12:29
Core Viewpoint - The article discusses a significant advancement in robotics, specifically in achieving human-level dexterity through a new reinforcement learning framework developed by ByteDance Seed, which enables zero-shot deployment of dexterous manipulation strategies in real-world scenarios without the need for additional real data [2][5]. Group 1: Key Technologies - The research addresses the "Reality Gap" between simulation and reality in tactile perception, contact physics, and actuator dynamics, proposing a comprehensive Sim-to-Real solution [5][6]. - The framework consists of three core technologies that facilitate seamless transition from simulation training to real-world deployment [6]. Group 2: Efficient Tactile Simulation - A novel distance-field-based tactile simulation method is introduced, which provides high-resolution and high-frequency tactile feedback necessary for reinforcement learning while maintaining physical realism [7]. - This method significantly enhances simulation efficiency, allowing for thorough exploration of complex contact dynamics [7][9]. Group 3: Current-Torque Calibration - The research introduces a current-torque calibration mechanism that maps normalized current signals to joint torque inputs, enabling explicit perception and control of interaction forces without the need for expensive torque sensors [10][12]. Group 4: Actuator Dynamics Modeling - The study models real actuator dynamics, including backlash and torque-speed saturation, and employs extensive domain randomization to improve the robustness of Sim-to-Real transfer [13]. Group 5: Full-State Policy and Innovative Training Paradigms - The framework successfully trains and deploys two key dexterous manipulation skills: Force-Adaptive Grasping and In-Hand Object Reorientation [15]. - An innovative inverted "catching" training paradigm is proposed to enhance sample efficiency and robustness, simplifying the exploration process [16]. Group 6: Force-Adaptive Grasping - In this task, the strategy dynamically adjusts grasping forces based on user input, utilizing a composite reward function that balances contact force and joint torque penalties for robust force control [17]. Group 7: In-Hand Object Rotation - The in-hand rotation task requires coordinated finger movements to rotate an object while maintaining stable contact, demonstrating the critical role of high-resolution tactile feedback in complex manipulations [19]. Group 8: Hardware Support - The DexManip framework's zero-shot deployment capability is supported by the Star Epoch's self-developed XHAND1 dexterous hand, which provides essential hardware features for effective application [23]. - The XHAND1 is equipped with a high-resolution tactile array that captures fine contact changes, crucial for complex operations [25]. - The seamless integration of high-precision URDF models with tactile simulation models ensures accurate alignment between virtual and real-world sensors, reducing the reality gap [26]. Group 9: Direct-Drive Architecture - The direct-drive architecture of the XHAND1 enhances the current-torque calibration process, allowing for precise force control and rapid response to varying force commands [27]. - This advancement marks a significant breakthrough in overcoming the Sim-to-Real gap, paving the way for broader applications of dexterous manipulation in real-world scenarios [28].
训练即服务!让模型训练回归算法语义,150行代码跑通RL
量子位· 2026-03-11 01:18
Core Viewpoint - The Twinkle framework, developed by the ModelScope team, offers a new path for achieving both usability and flexibility in post-training paradigms for large models, particularly in reinforcement learning (RL) scenarios [1][6]. Group 1: Framework Features - Twinkle adopts a Client-Server architecture, supporting over 20 algorithm components including Dataset, Model, and Sampler, allowing developers to orchestrate complex RL training loops with approximately 150 lines of code [1][6]. - The framework provides multiple operational modes, including local integrated training deployment, remote cluster deployment, and direct use of public training services [8][11]. - Twinkle supports a modular design, enabling dynamic updates to core components without service restarts, enhancing flexibility in training processes [24][20]. Group 2: Training Paradigms - The framework allows for concurrent multi-tenant training on shared foundational models, enabling different users to train their models in isolation while utilizing the same base model [27][32]. - Twinkle supports various training types, including pre-training and fine-tuning based on LoRA, as well as custom RL implementations [19][18]. - The design emphasizes a decoupled architecture, allowing users to focus on algorithm logic while the framework manages complex training processes [12][14]. Group 3: API and Usability - Twinkle provides a rich set of training APIs for fine-grained control over training processes, including dynamic component configuration and remote data processing capabilities [22][23]. - The framework maintains compatibility with Tinker API, allowing developers to transition smoothly between Tinker and Twinkle services [38][21]. - The team encourages developers to utilize the provided cookbook for customizing datasets, advantage functions, rewards, and templates, facilitating rapid algorithm development [47][41]. Group 4: Performance Evaluation - Twinkle has been evaluated against the veRL framework, showing similar reward trends during training, with Twinkle achieving an average time of approximately 70 seconds per global batch compared to veRL's 80 seconds [54][49]. - The framework's training efficiency is further enhanced through optimizations for domestic hardware, particularly in collaboration with local technology teams [56][59]. Group 5: Future Directions - The team envisions Twinkle as a catalyst for industry-wide collaboration in advancing methodologies for large model training and usage, with aspirations for API-driven training processes that could integrate into agent frameworks for self-evolving models [60][61].
陶哲轩对谈 OpenAI 高管:“试错成本”无限趋零,AI 正在把数学变成一门重工业
AI科技大本营· 2026-03-10 08:26
Core Viewpoint - The article discusses the evolving role of AI in mathematics and scientific research, highlighting its potential to revolutionize the field through enhanced collaboration and efficiency, while also addressing the limitations and challenges that remain in AI's capabilities [1][4][34]. Group 1: AI's Role in Mathematics - AI has found a unique environment in mathematics where the cost of trial and error is minimal, allowing for rapid experimentation and learning [24][25]. - The dialogue between mathematician Terence Tao and AI researcher Mark Chen reveals that AI tools have significantly improved over the past year, becoming more integrated into daily research practices [10][12]. - AI can now assist in deep research tasks, such as literature searches and code generation, which were previously cumbersome for mathematicians [10][12]. Group 2: AI's Development Metrics - OpenAI measures AI progress not just by parameters but by a metric called "Meter Plot," which tracks how long a model can operate autonomously without failure [5][15]. - The duration of effective operation has increased from minutes to days, indicating a significant reduction in error rates and an increase in reliability [16][17]. Group 3: Collaboration and Division of Labor - AI enables a division of labor in mathematical research, allowing mathematicians to focus on critical aspects while offloading tedious tasks to AI [20][21]. - The introduction of AI tools has led to a cultural shift in the mathematics community, encouraging exploration of previously overlooked problems [21][22]. Group 4: Challenges and Limitations - Despite advancements, AI struggles with complex reasoning and generating new concepts, often relying on existing knowledge rather than creating novel frameworks [32][33]. - The balance between AI's cooperative capabilities and its reasoning skills remains a challenge, as attempts to make AI more user-friendly can diminish its analytical power [26][27]. Group 5: Future Prospects - The potential for AI to tackle a vast array of medium-difficulty problems in mathematics is significant, paving the way for breakthroughs that were previously unattainable due to human resource limitations [33][34]. - As AI continues to evolve, it is expected to serve as a powerful tool for human scientists, enabling them to focus on higher-level theoretical advancements [34].
Z Tech|清华吴翼:离开OpenAI,我有后悔过吗?
Z Potentials· 2026-03-06 03:17
Group 1 - OpenAI was initially perceived as a "second-tier team" compared to giants like Google Brain and Facebook AI Research, which were staffed by renowned PhDs, while OpenAI had a more diverse and less conventional team composition [2][3][4] - The early projects at OpenAI, such as using AI to play Dota, were viewed skeptically within the academic community, as they seemed to lack the rigor and prestige associated with leading research organizations [4][5] - OpenAI's strength lay in its unified mission and engineering focus, which contrasted with the more fragmented and exploratory nature of other research labs like Facebook AI Research [5][6] Group 2 - The discussion highlights the randomness of career opportunities and the importance of making rational choices based on the present rather than dwelling on missed opportunities [6][7] - OpenAI's environment fostered a strong emphasis on scaling and large-scale systems, which resonated with the interviewee's interests and led to significant personal and professional growth [8][9] Group 3 - The interviewee reflects on the evolving nature of academic and industrial boundaries, suggesting that the distinction is becoming less clear as opportunities in both realms continue to merge [12][13] - The current landscape of AI development in China is characterized by a focus on distillation and maintaining competitive benchmarks, with companies like Doubao and Kimi making notable strides despite limited resources [15][16] Group 4 - The conversation touches on the challenges faced by traditional enterprises in adapting to AI, emphasizing the need for top-down transformations and the complexities involved in integrating AI into established organizational structures [20][21] - The academic community is encouraged to pursue innovative and unconventional ideas, as the value of research lies not in replicating large companies but in exploring unique concepts that may not have immediate commercial value [22][23] Group 5 - The potential for multi-agent systems is discussed, with the assertion that they are most beneficial in scenarios requiring parallel processing or asynchronous tasks, although the necessity for such systems may diminish as model capabilities improve [24][25] - Reinforcement learning (RL) is highlighted as a critical area for future development, particularly in addressing challenges related to unclear rewards and the need for human verification in complex tasks [27][28] Group 6 - The concept of AGI (Artificial General Intelligence) is explored, with the interviewee suggesting that current AI capabilities may already meet some definitions of AGI, though societal expectations continue to evolve [35][36] - The future of AI is seen as potentially transformative, with multi-modal systems and coding capabilities being central to advancements, while the integration of visual and generative models could unlock new possibilities [37]
推荐系统进入「双动力」时代!首篇LLM-RL协同推荐综述深度解析
机器之心· 2026-03-03 02:55
Group 1 - The core viewpoint of the article emphasizes the transformative potential of integrating Large Language Models (LLMs) with Reinforcement Learning (RL) in recommendation systems, leading to a new paradigm of LLM-RL synergistic recommendation systems [2][5][29] - The evolution of recommendation systems is outlined as a transition from static prediction to dynamic decision-making, and finally to cognitive collaboration, highlighting the shift from simple matching mechanisms to intelligent decision engines [6][8] Group 2 - The introduction of LLMs is described as a fundamental reshaping of recommendation systems, enhancing their capabilities in representation space, agent positioning, environment modeling, and interaction paradigms [8][10] - Five main collaborative paradigms are proposed for LLM-RL integration, which include reshaping representation space, agent positioning, environment modeling, and interaction paradigms [10][11] Group 3 - The article discusses the standard evaluation protocols for LLM-RL collaborative recommendation systems, focusing on tasks, datasets, evaluation strategies, and metrics [15][20] - Various tasks are identified, including LLM as Policy, Reasoner, Representer, and Explainer, each playing a crucial role in enhancing the recommendation process [17][18] Group 4 - The challenges and future directions for LLM-RL collaborative recommendation systems are highlighted, including algorithmic bias, privacy and security concerns, computational efficiency, and managing hallucinations in outputs [26][28] - The article concludes that the integration of RL and LLMs marks a clear path from automation to intelligence in recommendation systems, positioning them as more than just efficiency tools but as intelligent partners [29]
首次证实RL能让3D模型学会推理,复杂文本描述下生成质量跃升
3 6 Ke· 2026-02-27 02:33
Core Insights - The research introduces the first systematic integration of reinforcement learning (RL) into text-to-3D autoregressive generation, addressing unique challenges in 3D generation compared to 2D [1][3][17] - The study emphasizes the importance of designing reward models specifically for 3D generation, with human preference scores (HPS v2.1) identified as the most effective single reward signal [6][12][17] Group 1: Challenges in 3D Generation - 3D objects lack a "standard view," making it difficult to evaluate geometric consistency, texture quality, and semantic alignment from multiple perspectives [5][6] - The long-range dependencies in 3D generation lead to sparser reward signals, complicating the model's ability to detect errors during the generation process [5][6] Group 2: Reward Model Design - The research tested various reward combinations, concluding that HPS v2.1 alone provides the strongest results, while semantic alignment and aesthetic quality can enhance performance when combined with HPS [6][12] - A surprising finding is that general large models (Qwen2.5-VL) are more robust in assessing 3D consistency than specialized models, filling the gap in reward signals for 3D generation [6][12] Group 3: Algorithm Selection and Training Paradigms - The study reveals that token-level optimization is more suitable for 3D generation than sequence-level operations, which can hinder performance [7][12] - Data diversity is more critical than training duration in RL training for 3D generation, as doubling the training data is effective, while tripling iterations can lead to overfitting [12][17] Group 4: Evaluation Metrics - Existing 3D generation benchmarks fail to assess models' implicit reasoning capabilities under complex text descriptions, leading to the development of the MME-3DR benchmark [10][17] - MME-3DR includes 249 carefully selected complex 3D objects and evaluates multi-view geometric consistency, semantic detail alignment, and texture realism [10][17] Group 5: Model Performance and Contributions - The final model, AR3D-R1, outperformed existing state-of-the-art methods on both MME-3DR and Toys4K benchmarks, demonstrating significant improvements in reasoning capabilities [13][18] - The research establishes a systematic framework for integrating RL into 3D generation, highlighting the need for tailored rewards, algorithms, and training paradigms rather than simply transferring 2D experiences [17][18]
ICLR 2026 Workshop二轮征稿开启:聚焦终身智能体的学习、对齐、演化
机器之心· 2026-02-05 07:52
Core Insights - Artificial Intelligence is at a new turning point, with AI Agents based on Large Language Models (LLM), Reinforcement Learning (RL), and Embodied AI rapidly emerging, showcasing multi-dimensional capabilities such as planning, reasoning, tool usage, and autonomous decision-making [2] - The current mainstream paradigm faces critical bottlenecks, necessitating a shift towards Lifelong Agents that can continuously learn, align over the long term, evolve autonomously, perceive resources, and be sustainably deployed [2] Workshop Overview - The Lifelong Agent Workshop, initiated by institutions like UIUC, Edinburgh, Oxford, and Princeton during the ICLR 2026 conference, aims to create a cross-disciplinary forum to systematically advance the Lifelong Agent research paradigm [3] - The workshop will address key issues related to Lifelong Agents, including language intelligence, reinforcement learning, embodied systems, multi-agent collaboration, and AI for science, defining the next technological milestone for Agent development [3] Challenges in Lifelong Learning - The phenomenon of catastrophic forgetting remains a significant challenge when models face dynamic and out-of-distribution (OOD) tasks, leading to decreased alignment consistency as user goals, environmental feedback, and contextual constraints evolve over time [4] - Real-world operational constraints such as computational power, token, energy, and interaction costs hinder the sustainability of these systems [4] Workshop Details - The workshop is scheduled for April 26-27, 2026, in Rio de Janeiro, featuring a hybrid format for participation [8] - The expected attendance is between 200-400 in-person participants and 500-600 online attendees [8] Submission Topics - The workshop encourages cross-disciplinary research focused on long-term operational Agents, particularly in areas such as Lifelong Learning, Lifelong Alignment, Self-Evolving Agents, and Embodied & Real-World Lifelong Agents [7] - Specific topics include memory-augmented RL, continual exploration, user goal change modeling, and multi-agent lifelong collaboration ecosystems [9][10] Future Directions - Lifelong Agents represent an upgrade in intelligent paradigms, aiming to create stable, autonomous, and sustainably growing systems that can contribute to scientific discovery and cross-modal interaction [11] - The workshop seeks to push Lifelong Agents towards becoming the next significant advancement in the field, addressing challenges related to resource-constrained learning and reasoning [12]
Clawdbot 之后,我们离能规模化落地的 Agent 还差什么?
Founder Park· 2026-02-03 12:31
Core Insights - Monolith Capital is an investment management firm focusing on technology and innovation-driven sectors, including technology, software, life sciences, and consumer fields [2] - The current state of AI Agents is more of impressive demos rather than scalable products, highlighting the need for sustainable systems rather than one-off tasks [5][4] - The discussion at the "After the Model" technology salon emphasized that AI Agents must overcome several hard metrics: stability, high throughput, cost control, and precise state management [5] Challenges in AI Agent Development - OpenClaw, previously known as Clawdbot, has gained significant attention, but it presents challenges in enterprise environments, such as high costs, lack of control, privacy issues, and collaboration difficulties [3][7] - The current barriers for AI Agents primarily lie in data and infrastructure, with high costs associated with human expertise required for data labeling [9][10] - The reliance on human labor for data annotation is unsustainable, pushing the industry towards Reinforcement Learning (RL) to reduce dependency on expensive human data [11][12] Infrastructure and Training Issues - The training of AI Agents faces a paradox of high-speed GPU capabilities being hindered by slow operating systems, leading to inefficient resource utilization [16][18] - The complexity of GUI Agent environments results in sparse rewards and long feedback loops, making traditional training methods inadequate [20][21] - Solutions proposed include decoupling sampling and training processes to enhance efficiency and reduce waiting times, leading to a significant increase in environment utilization [25][26] Innovations in Agent Infrastructure - The Dart framework proposes a decoupled architecture that separates sampling from training, allowing for asynchronous data production and improved efficiency [23][24] - A modular framework approach is suggested to lower the barriers for small teams, enabling easier adaptation and modification of algorithms [29][30] - The need for lighter, modular middleware is emphasized to make AI Agent training accessible for smaller teams, presenting a significant entrepreneurial opportunity in the infrastructure space [33][34] Memory Management and State Handling - Current AI models lack effective state management, which is critical for complex tasks, leading to issues in logical reasoning and task execution [36][38] - New architectures are being explored to enhance state management capabilities, allowing models to better handle long-term dependencies and complex reasoning [39][40] - The concept of "Code Thinking" is introduced, suggesting that models should learn to think in code for better state management and precision in task execution [42][44] Future of AI Agents - The competitive landscape is shifting from model capabilities to system integration capabilities, with a focus on infrastructure, data loops, and memory management as key differentiators [48][49] - The need for new infrastructure tailored for AI Agents is highlighted, moving away from traditional cloud computing to specialized environments that support asynchronous training and memory systems [52] - The future data barriers will depend on the ability to create realistic simulation environments for self-evolving Agents, rather than merely accumulating large datasets [53]
当世界模型、VLA和强化学习三者结合起来,能取得什么惊艳效果?
具身智能之心· 2026-01-15 00:32
Core Insights - The article discusses the potential of the Vision-Language-Action (VLA) model in general robotic operations, highlighting its reliance on expert demonstration data which limits its ability to learn from failures and self-correct [2] - It introduces WMPO, a world model-based policy optimization method that enhances sample efficiency and overall performance in reinforcement learning (RL) without needing real-world interaction [3] Group 1 - The VLA model shows strong potential in robotic tasks but struggles with self-improvement due to its dependence on expert data [2] - Reinforcement learning can address the limitations of VLA models by enabling self-improvement through autonomous interaction with physical environments, although it faces high sample complexity when applied to real robots [2] - WMPO focuses on pixel-based prediction tasks, aligning "imagined" trajectories with VLA features pre-trained on large-scale network images, leading to superior performance compared to traditional offline methods [3] Group 2 - WMPO demonstrates significant advantages, including improved sample efficiency, better overall performance, emergence of self-correcting behaviors, and robust generalization and lifelong learning capabilities [3] - The article provides a link to the research paper on WMPO and its project homepage for further exploration [4]