RLVR

Search documents
RL Infra 行业全景:环境和 RLaaS 如何加速 RL 的 GPT-3 时刻
海外独角兽· 2025-09-24 05:02
Core Insights - RL Scaling is transitioning AI from the "Human Data Era" to the "Agent Experience Era," necessitating new infrastructure to bridge the "sim-to-real" gap for AI agents [2][3] - The RL Infra landscape is categorized into three main modules: RL Environment, RLaaS, and Data/Evaluation, with each representing different business ambitions [3][12] - The industry is expected to experience a "GPT-3 moment" for RL, significantly increasing the scale of RL data to pre-training levels [3][8] Group 1: Need for RL Infra - The shift to the Era of Experience emphasizes the need for dynamic environments, moving away from static data, as the performance improvements from static datasets are diminishing [6][8] - Current RL training data is limited, with examples like DeepSeek-R1 training on only 600,000 math problems, while GPT-3 utilized 300 billion tokens [8][9] - Existing RL environments are basic and cannot simulate the complexity of real-world tasks, leading to a "Production Environment Paradox" where real-world learning is risky [9][10] Group 2: RL Infra Mapping Framework - Emerging RL infrastructure startups are divided into two categories: those providing RL environments and those offering RL-as-a-Service (RLaaS) solutions [12][13] - RL environment companies focus on creating high-fidelity simulation environments for AI agents, aiming for scalability and standardization [13][14] - RLaaS companies work closely with enterprises to customize RL solutions for specific business needs, often resulting in high-value contracts [14][30] Group 3: RL Environment Development - Companies in this space aim to build realistic simulation environments that allow AI agents to train under near-real conditions, addressing challenges like sparse rewards and incomplete information [16][17] - Key components of a simulation environment include a state management system, task scenarios, and a reward/evaluation system [17][18] - Various types of RL environments are emerging, including application-specific sandboxes and general-purpose browser/desktop environments [18][19] Group 4: Case Studies in RL Environment - Mechanize is a platform that focuses on replication learning, allowing AI agents to reproduce existing software functionalities as training tasks [20][21] - Veris AI targets high-risk industries by creating secure training environments that replicate clients' unique internal tools and workflows [23][24] - Halluminate offers a computer use environment platform that combines realistic sandboxes with data/evaluation services to enhance agent performance [27][29] Group 5: RLaaS Development - RLaaS providers offer managed RL training platforms, helping enterprises implement RL in their workflows [30][31] - The process includes reward modeling, automated scoring, and model customization, allowing for continuous improvement of AI agents [32][33] - Companies like Fireworks AI and Applied Compute exemplify the RLaaS model, focusing on deep integration with enterprise needs and high-value contracts [34][36] Group 6: Future Outlook - The relationship between RL environments and data is crucial, with ongoing debates about the best approach to training agents [37][40] - RLaaS is expected to create vertical monopolies, with providers embedding themselves deeply within client operations to optimize specific business metrics [44][45]
Qwen&清华团队颠覆常识:大模型强化学习仅用20%关键token,比用全部token训练还好
量子位· 2025-06-05 10:28
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 近期arxiv最热门论文, Qwen&清华LeapLab 团队最新成果: 在强化学习训练大模型推理能力时, 仅仅20%的高熵token就能撑起整个训练效果 ,甚至比用全部token训练还要好。 团队用这个发现在Qwen3-32B上创造了新的SOTA记录:AIME'24上达到63.5分,AIME'25上达到56.7分, 这是600B参数以下直接从base模型训练的最高分。 最大响应长度从20k延长到29k,AIME'24的分数更是飙升到了68.1分。 揭开Chain-of-Thought的熵分布密码 要理解这项研究,需要先从一个有趣的观察说起: 团队发现,当大模型进行链式思考(Chain-of-Thought)推理时,token的熵分布呈现出一个独特的模式: 大部分token的熵都很低,只有少 数token表现出高熵特征 。 具体来说,超过50%的token熵值低于0.01,而只有20%的token熵值大于0.672。 经典的二八法则(或帕累托法则)指出,通常80%的结果由20%的关键因素驱动,但剩下80%也是不能轻易舍弃的。 但是在大模型强化学习这里,80 ...