RLVR - filings, earnings calls, financial reports, news

RLVR

Search documents

Sebastian Raschka万字年终复盘：2025，属于「推理模型」的一年

机器之心· 2026-01-02 09:30

Core Insights - The AI field continues to evolve rapidly, with significant advancements in reasoning models and algorithms such as RLVR and GRPO, marking 2025 as a pivotal year for large language models (LLMs) [1][4][19] - DeepSeek R1's introduction has shifted the focus from merely stacking parameters to enhancing reasoning capabilities, demonstrating that high-performance models can be developed at a fraction of previously estimated costs [9][10][12] - The importance of collaboration between humans and AI is emphasized, reflecting on the boundaries of this partnership and the evolving role of AI in various tasks [1][4][66] Group 1: Reasoning Models and Algorithms - The year 2025 has been characterized as a "year of reasoning," with RLVR and GRPO algorithms gaining prominence in the development of LLMs [5][19] - DeepSeek R1's release showcased that reasoning behavior can be developed through reinforcement learning, enhancing the accuracy of model outputs [6][19] - The estimated training cost for the DeepSeek R1 model is significantly lower than previous assumptions, around $5.576 million, indicating a shift in cost expectations for advanced model training [10][12] Group 2: Focus Areas in LLM Development - Key focus areas for LLM development have evolved over the years, with 2025 emphasizing RLVR and GRPO, following previous years' focus on RLHF and LoRA techniques [20][22][24] - The trend of "Benchmaxxing" has emerged, highlighting the overemphasis on benchmark scores rather than real-world applicability of LLMs [60][63] - The integration of tools in LLM training has improved performance, allowing models to access external information and reduce hallucination rates [54][56] Group 3: Architectural Trends - The architecture of LLMs is converging towards using mixture of experts (MoE) layers and efficient attention mechanisms, indicating a shift towards more scalable and efficient models [43][53] - Despite advancements, traditional transformer architectures remain prevalent, with ongoing improvements in efficiency and engineering adjustments [43][53] Group 4: Future Directions - Future developments are expected to focus on expanding RLVR applications beyond mathematics and coding, incorporating reasoning evaluation into training signals [25][27] - Continuous learning is anticipated to gain traction, addressing challenges such as catastrophic forgetting while enhancing model adaptability [31][32] - The need for domain-specific data is highlighted as a critical factor for LLMs to establish a foothold in various industries, with proprietary data being a significant concern for companies [85][88]

Artificial Intelligence

Artificial Intelligence

DeepSeek R1

2025年AI大模型资料汇编

Sou Hu Cai Jing· 2025-12-24 10:45

2025 年 AI 大模型行业迎来结构性变革，竞争从单纯的能力竞赛转向可持续性比拼，技术范式、市场格局、应用形态与全球治理四大维度的深刻转变，共同重塑行业发展轨迹。技术层面实现多重突破性演进。训练范式从依赖主观反馈的 RLHF 全面转向客观可验证的 RLVR，模型通过自我检验实现推理能力飞跃，成为年度最关键技术拐点。混合专家（MoE）架构强势回潮，以稀疏激活模式平衡参数规模与计算成本，实现性价比极致追求。多智能体自我博弈与合成数据微调成为常态，模型摆脱对人类标注的依赖，同时检索增强生成（RAG）成为企业级应用标配，有效解决幻觉与知识时效性问题。此外，模型呈现 "锯齿化" 能力结构，在数学、编程等形式化智力领域突飞猛进，却在常识推理上仍存短板。新王登基: 京谷歌Gemini 3全面国模型以惊人的成市场格局呈现集中化与民主化双重张力。谷歌 Gemini 3 凭借自研 TPU v5 芯片与多模态优势，终结 OpenAI 长期领先地位，中国模型以成本效益实现弯道超车。市场向头部集中，Anthropic 等顶尖初创企业获巨额融资，二三线玩家面临出清，但开源浪潮形成制衡，阿里通义千问、01.ai Yi ...

RL Infra 行业全景：环境和 RLaaS 如何加速 RL 的 GPT-3 时刻

海外独角兽· 2025-09-24 05:02

Core Insights - RL Scaling is transitioning AI from the "Human Data Era" to the "Agent Experience Era," necessitating new infrastructure to bridge the "sim-to-real" gap for AI agents [2][3] - The RL Infra landscape is categorized into three main modules: RL Environment, RLaaS, and Data/Evaluation, with each representing different business ambitions [3][12] - The industry is expected to experience a "GPT-3 moment" for RL, significantly increasing the scale of RL data to pre-training levels [3][8] Group 1: Need for RL Infra - The shift to the Era of Experience emphasizes the need for dynamic environments, moving away from static data, as the performance improvements from static datasets are diminishing [6][8] - Current RL training data is limited, with examples like DeepSeek-R1 training on only 600,000 math problems, while GPT-3 utilized 300 billion tokens [8][9] - Existing RL environments are basic and cannot simulate the complexity of real-world tasks, leading to a "Production Environment Paradox" where real-world learning is risky [9][10] Group 2: RL Infra Mapping Framework - Emerging RL infrastructure startups are divided into two categories: those providing RL environments and those offering RL-as-a-Service (RLaaS) solutions [12][13] - RL environment companies focus on creating high-fidelity simulation environments for AI agents, aiming for scalability and standardization [13][14] - RLaaS companies work closely with enterprises to customize RL solutions for specific business needs, often resulting in high-value contracts [14][30] Group 3: RL Environment Development - Companies in this space aim to build realistic simulation environments that allow AI agents to train under near-real conditions, addressing challenges like sparse rewards and incomplete information [16][17] - Key components of a simulation environment include a state management system, task scenarios, and a reward/evaluation system [17][18] - Various types of RL environments are emerging, including application-specific sandboxes and general-purpose browser/desktop environments [18][19] Group 4: Case Studies in RL Environment - Mechanize is a platform that focuses on replication learning, allowing AI agents to reproduce existing software functionalities as training tasks [20][21] - Veris AI targets high-risk industries by creating secure training environments that replicate clients' unique internal tools and workflows [23][24] - Halluminate offers a computer use environment platform that combines realistic sandboxes with data/evaluation services to enhance agent performance [27][29] Group 5: RLaaS Development - RLaaS providers offer managed RL training platforms, helping enterprises implement RL in their workflows [30][31] - The process includes reward modeling, automated scoring, and model customization, allowing for continuous improvement of AI agents [32][33] - Companies like Fireworks AI and Applied Compute exemplify the RLaaS model, focusing on deep integration with enterprise needs and high-value contracts [34][36] Group 6: Future Outlook - The relationship between RL environments and data is crucial, with ongoing debates about the best approach to training agents [37][40] - RLaaS is expected to create vertical monopolies, with providers embedding themselves deeply within client operations to optimize specific business metrics [44][45]

Artificial Intelligence

Artificial Intelligence

RL Infra

Qwen&清华团队颠覆常识：大模型强化学习仅用20%关键token，比用全部token训练还好

量子位· 2025-06-05 10:28

梦晨发自凹非寺量子位 | 公众号 QbitAI 近期arxiv最热门论文， Qwen&清华LeapLab 团队最新成果：在强化学习训练大模型推理能力时，仅仅20%的高熵token就能撑起整个训练效果，甚至比用全部token训练还要好。团队用这个发现在Qwen3-32B上创造了新的SOTA记录：AIME'24上达到63.5分，AIME'25上达到56.7分，这是600B参数以下直接从base模型训练的最高分。最大响应长度从20k延长到29k，AIME'24的分数更是飙升到了68.1分。揭开Chain-of-Thought的熵分布密码要理解这项研究，需要先从一个有趣的观察说起：团队发现，当大模型进行链式思考（Chain-of-Thought）推理时，token的熵分布呈现出一个独特的模式：大部分token的熵都很低，只有少数token表现出高熵特征。具体来说，超过50%的token熵值低于0.01，而只有20%的token熵值大于0.672。经典的二八法则（或帕累托法则）指出，通常80%的结果由20%的关键因素驱动，但剩下80%也是不能轻易舍弃的。但是在大模型强化学习这里，80 ...