Workflow
可验证强化学习(RLVR)
icon
Search documents
港股异动丨迈富时大涨超18%,9个交易日累计涨幅超40%,股价创近3个月新高
Ge Long Hui· 2026-01-12 04:53
迈富时是GEO(生成式引擎优化)全链路AI解决方案标杆,作为全球领先的AI应用平台,累计服务超20万 家企业,前瞻性布局GEO领域并形成技术闭环。依托自研AI-Agentforce 2.0智能体中台与Tforce营销大 模型,推出GEO智能助手及工作台,构建"内容投喂-模型交互-效果追踪"全流程服务链路,可精准识别 品牌在AI生态中的可见度并提供优化建议。 消息上,清华大学基础模型北京市重点实验室发起AGI-Next前沿峰会,引发业界关注。会议认为,大 模型竞争已从"Chat"转向"Agent"阶段,重心从榜单分数位移至真实环境的复杂任务执行。行业预判 2026年为商业价值落地元年,技术路径正向可验证强化学习(RLVR)演进。 港股市场AI应用股集体飙升,其中迈富时(2556.HK)大幅上涨超18%,9个交易日累计涨幅超40%,报 45.88港元创近3个月新高,总市值117.5亿港元。 ...
中国“AI四巨头”罕见同台,阿里、腾讯、Kimi与智谱“论剑”:大模型的下一步与中国反超的可能性
硬AI· 2026-01-11 11:12
Core Insights - The competition in large models has shifted from "Chat" to "Agent," focusing on executing complex tasks in real environments rather than just scoring on leaderboards. The industry anticipates 2026 as the year when commercial value will be realized, with a technological evolution towards verifiable reinforcement learning (RLVR) [2][4][5]. Group 1: Competition Landscape - The engineering challenges of the Chat era have largely been resolved, and future success will depend on the ability to complete complex, long-chain real tasks. The core value of AI is transitioning from "providing information" to "delivering productivity" [4]. - The bottleneck for Agents lies not in cognitive depth but in environmental feedback. Future training paradigms will shift from manual labeling to RLVR, enabling models to self-iterate in systems with clear right or wrong judgments [5][6]. - The industry consensus suggests that while China has a high chance of catching up in the old paradigm (engineering replication, local optimization, toC applications), its probability of leading in new paradigms (underlying architecture innovation, long-term memory) is likely below 20% due to significant differences in computational resource allocation [5][11]. Group 2: Strategic Opportunities - Opportunities for catching up lie in two variables: the global shift towards "intelligent efficiency" as scaling laws encounter diminishing returns, and the potential paradigm shift driven by academia around 2026 as computational conditions improve [5][19]. - The ultimate variable for success is not leaderboard scores but the tolerance for uncertainty. True advancement depends on the willingness to invest resources in uncertain but potentially transformative new paradigms rather than merely chasing scores in the old paradigm [5][10]. Group 3: Perspectives from Industry Leaders - Industry leaders express cautious optimism regarding China's potential to lead, with probabilities of success varying. For instance, Lin Junyang estimates a 20% chance of leading due to structural differences in computational resource allocation and usage [11][12]. - Tang Jie acknowledges the existing gap in enterprise AI lab research but bets on a paradigm shift occurring around 2026, driven by improved academic participation and the emergence of new algorithms and training paradigms [15][19]. - Yang Qiang believes that China may excel in toC applications first, drawing parallels to the internet history, while emphasizing the need for China to develop its own toB solutions to bridge existing gaps [20][24]. Group 4: Technological Innovations - The future of AI will require advancements in multi-modal capabilities, memory structures, and self-reflective abilities, which are essential for achieving higher levels of intelligence and functionality [68][70][73]. - The introduction of new optimization techniques, such as the MUON optimizer, aims to enhance token efficiency and long-context processing, which are critical for the performance of agent-based models [110][116]. - The development of linear attention mechanisms is expected to improve efficiency and performance in long-context tasks, addressing the limitations of traditional attention models [116]. Group 5: Future Directions - The industry is focused on distinguishing between scaling known paths through data and computational increases and exploring unknown paths to discover new paradigms [98][99]. - The potential for AI to participate in scientific research is anticipated to expand significantly, opening new possibilities for innovation and application [101].
SimKO:缓解RLVR训练中的概率过度集中,优化pass@K性能
机器之心· 2025-11-08 04:02
Core Insights - The article discusses the limitations of existing Reinforcement Learning with Verified Rewards (RLVR) methods in enhancing the performance of large language models, particularly in terms of pass@K metrics, which show a decline compared to base models despite improvements in pass@1 performance [2][3][12]. Group 1: Problem Analysis - The decline in exploration capability of RLVR methods is attributed to the models concentrating probabilities on a single reasoning path, thus sacrificing the ability to explore diverse correct solutions [3][12]. - Current RLVR algorithms, such as GRPO and DAPO, reinforce the probability of correct answers while punishing incorrect ones, leading to a concentration of probability on rank-1 candidates and inhibiting exploration of other potential correct paths [8][23]. - The use of entropy as a diversity metric is limited, as it does not accurately reflect the shape of the probability distribution, which can lead to misleading conclusions about the model's exploration capabilities [9][12]. Group 2: Proposed Solution - The research team introduces SimKO (Simple Pass@K Optimization), a new algorithm designed to improve pass@K performance by addressing the issue of probability concentration [4][17]. - SimKO employs an asymmetric gradient adjustment strategy, applying label smoothing to correct paths while imposing precise penalties on incorrect paths, thus balancing exploration and exploitation [17][23]. - The algorithm identifies key tokens with high entropy in reasoning paths, applying updates only to these critical nodes to enhance the model's exploration capabilities [18][20]. Group 3: Experimental Results - SimKO was evaluated on multiple mathematical reasoning benchmarks, demonstrating significant improvements in pass@K performance while maintaining or slightly enhancing pass@1 accuracy [21][27]. - In comparison to GRPO, SimKO showed a 31.6% increase in pass@1 and a 26.3% increase in pass@128 on in-distribution tasks, while also performing well on out-of-distribution tasks [27][26]. - The results indicate that SimKO effectively mitigates the issue of probability concentration, thereby enhancing the model's exploration ability and improving overall performance metrics [26][27].
混合数学编程逻辑数据,一次性提升AI多领域强化学习能力 | 上海AI Lab
量子位· 2025-08-14 04:08
Core Insights - The article discusses significant advancements in AI large models, particularly in reasoning capabilities across various domains such as mathematics, programming, and logic puzzles, highlighting the potential of Reinforcement Learning with Verified Results (RLVR) technology [1][3]. Group 1: Multi-Domain Evaluation Framework - The research team developed a multi-domain evaluation framework encompassing Math, Code, and Puzzle data, with customized reward strategies for different training datasets [3][14]. - The experiments utilized the Qwen2.5-7B series model, achieving an overall average performance of 56.57 after joint training across the three domains, outperforming any dual-domain combinations [3][31]. Group 2: Key Findings from Experiments - The interaction between Puzzle and Math data significantly enhances overall model performance, indicating a synergistic effect [6]. - Instruct models demonstrate better generalization of coding abilities to other domains compared to Base models, showcasing the cross-domain mixing effect [7]. - Diverse data can improve model robustness, but complex designs are necessary to address potential conflicts among Math, Code, and Puzzle domains [8]. Group 3: Training Methodologies and Strategies - Incorporating Supervised Fine-Tuning (SFT) before reinforcement learning can significantly enhance model performance [9]. - Template consistency is crucial; mismatched training and evaluation templates can lead to substantial performance drops, indicating challenges in generalization robustness [10][29]. - Regularly updating reference models and optimizer states during curriculum learning can improve model stability and performance [11]. Group 4: Performance in Specific Domains - In single-domain training, models show significant performance improvements on specific tasks, but cross-domain effects can be complex, leading to both synergistic and detrimental interactions [19]. - The Base model's accuracy improved by approximately 75 percentage points on the CountDown task after targeted training, but optimizing for Math may negatively impact Code tasks [20]. - In the Code domain, SFT enhances programming task performance, with Instruct models showing higher performance ceilings compared to Base models [21]. Group 5: Cross-Domain Interactions - The Math + Puzzle combination improved Math task performance to 49.72, demonstrating effective cross-domain knowledge transfer, while Code tasks benefited from the addition of Puzzle or Math data [25]. - The combination of all three domains resulted in superior overall performance and robustness, avoiding performance collapse in specific tasks [31]. Group 6: Curriculum Learning and Reward Design - Curriculum learning has proven effective in SFT, and its application in RLVR is still being explored, with a focus on difficulty gradients and the "Policy Refresh" strategy [33]. - Reward design is critical; different strategies yield varying results based on task complexity and data sparsity, influencing the success of RLVR [35][37]. Group 7: Future Directions - The research team calls for the expansion of data categories into new fields such as Science and General Reasoning, and the exploration of model adaptability for Llama and DeepSeek [39].