Workflow
RLVR
icon
Search documents
Sebastian Raschka万字年终复盘:2025,属于「推理模型」的一年
机器之心· 2026-01-02 09:30
Core Insights - The AI field continues to evolve rapidly, with significant advancements in reasoning models and algorithms such as RLVR and GRPO, marking 2025 as a pivotal year for large language models (LLMs) [1][4][19] - DeepSeek R1's introduction has shifted the focus from merely stacking parameters to enhancing reasoning capabilities, demonstrating that high-performance models can be developed at a fraction of previously estimated costs [9][10][12] - The importance of collaboration between humans and AI is emphasized, reflecting on the boundaries of this partnership and the evolving role of AI in various tasks [1][4][66] Group 1: Reasoning Models and Algorithms - The year 2025 has been characterized as a "year of reasoning," with RLVR and GRPO algorithms gaining prominence in the development of LLMs [5][19] - DeepSeek R1's release showcased that reasoning behavior can be developed through reinforcement learning, enhancing the accuracy of model outputs [6][19] - The estimated training cost for the DeepSeek R1 model is significantly lower than previous assumptions, around $5.576 million, indicating a shift in cost expectations for advanced model training [10][12] Group 2: Focus Areas in LLM Development - Key focus areas for LLM development have evolved over the years, with 2025 emphasizing RLVR and GRPO, following previous years' focus on RLHF and LoRA techniques [20][22][24] - The trend of "Benchmaxxing" has emerged, highlighting the overemphasis on benchmark scores rather than real-world applicability of LLMs [60][63] - The integration of tools in LLM training has improved performance, allowing models to access external information and reduce hallucination rates [54][56] Group 3: Architectural Trends - The architecture of LLMs is converging towards using mixture of experts (MoE) layers and efficient attention mechanisms, indicating a shift towards more scalable and efficient models [43][53] - Despite advancements, traditional transformer architectures remain prevalent, with ongoing improvements in efficiency and engineering adjustments [43][53] Group 4: Future Directions - Future developments are expected to focus on expanding RLVR applications beyond mathematics and coding, incorporating reasoning evaluation into training signals [25][27] - Continuous learning is anticipated to gain traction, addressing challenges such as catastrophic forgetting while enhancing model adaptability [31][32] - The need for domain-specific data is highlighted as a critical factor for LLMs to establish a foothold in various industries, with proprietary data being a significant concern for companies [85][88]
2025年AI大模型资料汇编
Sou Hu Cai Jing· 2025-12-24 10:45
2025 年 AI 大模型行业迎来结构性变革,竞争从单纯的能力竞赛转向可持续性比拼,技术范式、市场格局、应用形态与全球治理四大维度的深刻转变,共同 重塑行业发展轨迹。 技术层面实现多重突破性演进。训练范式从依赖主观反馈的 RLHF 全面转向客观可验证的 RLVR,模型通过自我检验实现推理能力飞跃,成为年度最关键技 术拐点。混合专家(MoE)架构强势回潮,以稀疏激活模式平衡参数规模与计算成本,实现性价比极致追求。多智能体自我博弈与合成数据微调成为常态, 模型摆脱对人类标注的依赖,同时检索增强生成(RAG)成为企业级应用标配,有效解决幻觉与知识时效性问题。此外,模型呈现 "锯齿化" 能力结构,在 数学、编程等形式化智力领域突飞猛进,却在常识推理上仍存短板。 新王登基: 京 谷歌Gemini 3全面 国模型以惊人的成 市场格局呈现集中化与民主化双重张力。谷歌 Gemini 3 凭借自研 TPU v5 芯片与多模态优势,终结 OpenAI 长期领先地位,中国模型以成本效益实现弯道超 车。市场向头部集中,Anthropic 等顶尖初创企业获巨额融资,二三线玩家面临出清,但开源浪潮形成制衡,阿里通义千问、01.ai Yi ...
RL Infra 行业全景:环境和 RLaaS 如何加速 RL 的 GPT-3 时刻
海外独角兽· 2025-09-24 05:02
Core Insights - RL Scaling is transitioning AI from the "Human Data Era" to the "Agent Experience Era," necessitating new infrastructure to bridge the "sim-to-real" gap for AI agents [2][3] - The RL Infra landscape is categorized into three main modules: RL Environment, RLaaS, and Data/Evaluation, with each representing different business ambitions [3][12] - The industry is expected to experience a "GPT-3 moment" for RL, significantly increasing the scale of RL data to pre-training levels [3][8] Group 1: Need for RL Infra - The shift to the Era of Experience emphasizes the need for dynamic environments, moving away from static data, as the performance improvements from static datasets are diminishing [6][8] - Current RL training data is limited, with examples like DeepSeek-R1 training on only 600,000 math problems, while GPT-3 utilized 300 billion tokens [8][9] - Existing RL environments are basic and cannot simulate the complexity of real-world tasks, leading to a "Production Environment Paradox" where real-world learning is risky [9][10] Group 2: RL Infra Mapping Framework - Emerging RL infrastructure startups are divided into two categories: those providing RL environments and those offering RL-as-a-Service (RLaaS) solutions [12][13] - RL environment companies focus on creating high-fidelity simulation environments for AI agents, aiming for scalability and standardization [13][14] - RLaaS companies work closely with enterprises to customize RL solutions for specific business needs, often resulting in high-value contracts [14][30] Group 3: RL Environment Development - Companies in this space aim to build realistic simulation environments that allow AI agents to train under near-real conditions, addressing challenges like sparse rewards and incomplete information [16][17] - Key components of a simulation environment include a state management system, task scenarios, and a reward/evaluation system [17][18] - Various types of RL environments are emerging, including application-specific sandboxes and general-purpose browser/desktop environments [18][19] Group 4: Case Studies in RL Environment - Mechanize is a platform that focuses on replication learning, allowing AI agents to reproduce existing software functionalities as training tasks [20][21] - Veris AI targets high-risk industries by creating secure training environments that replicate clients' unique internal tools and workflows [23][24] - Halluminate offers a computer use environment platform that combines realistic sandboxes with data/evaluation services to enhance agent performance [27][29] Group 5: RLaaS Development - RLaaS providers offer managed RL training platforms, helping enterprises implement RL in their workflows [30][31] - The process includes reward modeling, automated scoring, and model customization, allowing for continuous improvement of AI agents [32][33] - Companies like Fireworks AI and Applied Compute exemplify the RLaaS model, focusing on deep integration with enterprise needs and high-value contracts [34][36] Group 6: Future Outlook - The relationship between RL environments and data is crucial, with ongoing debates about the best approach to training agents [37][40] - RLaaS is expected to create vertical monopolies, with providers embedding themselves deeply within client operations to optimize specific business metrics [44][45]
Qwen&清华团队颠覆常识:大模型强化学习仅用20%关键token,比用全部token训练还好
量子位· 2025-06-05 10:28
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 近期arxiv最热门论文, Qwen&清华LeapLab 团队最新成果: 在强化学习训练大模型推理能力时, 仅仅20%的高熵token就能撑起整个训练效果 ,甚至比用全部token训练还要好。 团队用这个发现在Qwen3-32B上创造了新的SOTA记录:AIME'24上达到63.5分,AIME'25上达到56.7分, 这是600B参数以下直接从base模型训练的最高分。 最大响应长度从20k延长到29k,AIME'24的分数更是飙升到了68.1分。 揭开Chain-of-Thought的熵分布密码 要理解这项研究,需要先从一个有趣的观察说起: 团队发现,当大模型进行链式思考(Chain-of-Thought)推理时,token的熵分布呈现出一个独特的模式: 大部分token的熵都很低,只有少 数token表现出高熵特征 。 具体来说,超过50%的token熵值低于0.01,而只有20%的token熵值大于0.672。 经典的二八法则(或帕累托法则)指出,通常80%的结果由20%的关键因素驱动,但剩下80%也是不能轻易舍弃的。 但是在大模型强化学习这里,80 ...