强化学习
Search documents
Gemini2.5弯道超车背后的灵魂人物
Hu Xiu· 2025-06-05 03:14
《硅谷101》创始人泓君邀请了Energent.ai联合创始人Kimi Kong和HeyRevia创始人Shaun Wei,一起和两 位前Google的技术专家聊聊Gemini模型登顶背后的底层逻辑。 以下是这次对话内容的精选: 一、Gemini2.5崛起背后的底层逻辑 泓君:谷歌此次发布的Gemini 2.5 Pro,在当前各项评测中的数据都是所有大模型中最好的,Kimi你可 以分析一下它是如何做到的吗? 从去年在大会前夜被OpenAI的4o模型"精准狙击",到今年Gemini 2.5 Pro全面霸榜。短短一年时间, Gemini是如何完成从追赶者到领跑者的逆转? Kimi:我已经离开DeepMind快一年时间了,也不太清楚我的前同事们在这一年中又做了哪些新的创 新。但大语言模型训练根本的步骤是不变的,包括以下三点:Pre-training(预训练)、SFT(Supervised Fine-tuning,监督微调)和利用RLHF(基于人类反馈的强化学习)技术做的Alignment(对齐)。 大概在去年的NeurIPS(神经信息处理系统大会)上,业内已经普遍承认,公开网络数据基本都已经抓 完了,就像化石燃料已 ...
人形机器人“擂台赛”,南京这样“打”
Nan Jing Ri Bao· 2025-06-05 00:21
Core Insights - The article highlights the rapid development and competitive landscape of humanoid robots in Nanjing, showcasing various events and advancements in technology [1][11] - Nanjing aims to establish itself as a "Robot City" by promoting humanoid robot research and production capabilities, with a focus on enhancing core components and overall system integration [10] Group 1: Technological Advancements - Humanoid robots are utilizing reinforcement learning to improve their movement and balance, showcasing significant progress in their capabilities [2][3] - The current humanoid robots employ two main technical routes: electric servo and electro-hydraulic servo, with the latter being more powerful but complex [4] - The development of humanoid robots is expected to take around 10 years to become commonplace in households, with a focus on reducing manufacturing costs [6][7] Group 2: Industry Applications - The article discusses the application of humanoid robots in various sectors, including industrial manufacturing, healthcare, and elder care, with specific examples of robots being tested in these fields [5][7] - Nanjing's Tianchuang Electronics has introduced the world's first explosion-proof humanoid robot, targeting high-risk operational needs in industries like chemical handling and mining [7] Group 3: Future Directions - Nanjing is focusing on innovation in mechanisms and core components to enhance the competitiveness of its humanoid robot industry, with plans to support research and development in this area [8][10] - The establishment of a humanoid robot training ground is suggested to facilitate the application of robots in various scenarios and to gather data for further development [9]
高新技术助力新能源发电系统高质量运行
Xin Hua Ri Bao· 2025-06-04 20:56
Group 1: Electric Automation Technology - Electric automation technology integrates multiple disciplines such as electronic technology, computer technology, and control technology, characterized by intelligence, efficiency, networking, and environmental protection, which is crucial for the high-quality operation of new energy generation systems [1] Group 2: Optimization of Energy Storage Systems - Power companies can enhance the stability and reliability of energy storage systems by implementing Model Predictive Control (MPC) algorithms, which predict photovoltaic power generation changes based on real-time solar intensity [2] - The application of reinforcement learning in the charging and discharging processes of energy storage systems can optimize control strategies to improve stability and reliability [2] - When determining the capacity of energy storage devices, historical generation data should be analyzed to understand the maximum, minimum, and fluctuation ranges of power generation, ensuring that energy storage systems can meet load demands during low generation periods [2] Group 3: Smart Grid Technology - Smart grids are modernized power networks built on integrated, high-speed bidirectional communication networks, incorporating advanced sensing, measurement, and control technologies [3] - Distribution automation systems based on electric automation technology can utilize smart meters and distributed sensors to collect real-time operational data, enabling intelligent operation and management of the distribution network [3] - Automation systems in substations allow remote operation and control of equipment, improving accuracy and efficiency while reducing safety risks associated with human error [3] Group 4: Energy Management Systems (EMS) - Energy Management Systems (EMS) are core control systems for grid operations, and optimizing them through electric automation technology significantly enhances the management capabilities of new energy generation and grid loads [4] - High-precision power sensors and high-speed communication networks enable real-time monitoring of new energy generation output, allowing for in-depth data analysis [4] - Intelligent algorithms and model predictive control techniques can optimize scheduling strategies, balancing new energy and traditional energy generation proportions [4] Group 5: Intelligent Diagnosis and Maintenance of Generation Equipment - The use of IoT technology in solar power plants allows for real-time monitoring of key equipment parameters, aiding in quick fault diagnosis and response [5] - Wind power plants can monitor operational parameters of wind turbines, enabling remote control and health monitoring of critical components [5] - Preventive maintenance plans should be developed based on health and fault prediction models, incorporating regular inspections and equipment maintenance to reduce failure rates and costs [5][6]
超越GPT-4o!华人团队新框架让Qwen跨领域推理提升10%,刷新12项基准测试
量子位· 2025-06-04 00:17
Core Insights - A new reinforcement learning method called General-Reasoner has significantly improved the performance of the Qwen series models, surpassing GPT-4o in various benchmarks [1][2]. Group 1: Methodology and Innovations - The General-Reasoner framework enhances cross-domain reasoning accuracy by nearly 10%, addressing limitations of existing Zero-RL methods that focus on single-domain data and rigid validation methods [2][4]. - The research team created a comprehensive reasoning dataset, WebInstruct-verified, consisting of approximately 230,000 high-quality, verifiable reasoning questions across multiple fields such as physics, chemistry, and finance [5][9]. - The dataset was derived from WebInstruct, which initially included around 5 million natural instructions, with a rigorous filtering process to ensure quality and relevance [6][7]. Group 2: Validation Mechanism - A new generative answer verifier, General-Verifier, was developed to replace traditional rule-based validation, significantly improving the accuracy of answer verification across diverse domains [13]. - General-Verifier, with only 1.5 billion parameters, generates reasoning processes and outputs binary correctness judgments, providing accurate and interpretable feedback for reinforcement learning [13]. Group 3: Performance Metrics - The General-Reasoner framework was tested on 12 benchmark tests, showing a 10% improvement in cross-domain tasks compared to the base models, with specific accuracy rates such as 58.9% for Qwen2.5-7B-Base in MMLU-Pro [15]. - The optimal model, General-Reasoner-Qwen3-14B, achieved competitive results against GPT-4o, with accuracy rates of 56.1% in GPQA and 54.4% in TheoremQA [15][16]. Group 4: Future Directions - The research team aims to further optimize model performance, expand high-quality reasoning data across more domains, and enhance the robustness of the verifier to facilitate broader applications of large language models in complex real-world tasks [17].
AGI的不归之途
虎嗅APP· 2025-06-03 13:52
Core Insights - The article discusses the rapid advancements in AI technologies, particularly focusing on the emergence of intelligent agents and their potential to replace a significant portion of entry-level jobs, with predictions that they could take over 50% of such roles by 2026 [3][4][5]. - The competition between the US and China in AI development is intensifying, with Chinese models like DeepSeek showing significant performance improvements and closing the gap with US counterparts [5][6][11]. Group 1: AI Advancements - The introduction of advanced models such as OpenAI's o3 and Gemini 2.5 pro has accelerated the development of intelligent agents, which are now capable of handling increasingly complex tasks [3][4]. - OpenAI's annual revenue has reached $10 billion, while Anthropic's revenue has surged from $1 billion to $3 billion within six months, indicating a strong market demand for AI applications [4]. Group 2: Global AI Competition - China's DeepSeek model has surpassed Gemini 2.5 pro in performance, showcasing the rapid advancements in Chinese AI technology [5][6]. - The gap between Chinese and US AI models has narrowed from two years at the time of ChatGPT's release to less than three months, highlighting China's competitive edge in AI development [11]. Group 3: Geopolitical Implications - AI is viewed as a significant economic lever and a source of geopolitical influence by both the US and China, with both nations investing heavily in AI infrastructure and talent acquisition [36][37]. - The article suggests that the next phase of AI commercialization may not follow a "winner-takes-all" model but rather a fusion and restructuring of platforms and specialized vendors [35].
挑战强化学习后训练霸权!全新无监督方法仅需1条数据+10步优化
量子位· 2025-06-01 03:40
Ubiquant团队 投稿 量子位 | 公众号 QbitAI 无需标注数据、无需繁琐奖励设计,只用10步就能见效—— 「熵最小化」或许比强化学习更适合大语言模型快速升级 。 强化学习(RL)近年来在大语言模型(LLM)的微调中大获成功,但高昂的数据标注成本、复杂的奖励设计和漫长的训练周期,成为制约RL 进一步应用的瓶颈。 Ubiquant研究团队提出了一种极为简单有效的无监督方法——One Shot熵最小化(Entropy Minimization,EM),仅用一条无标签数据, 训练10步内即可显著提升LLM性能,甚至超过使用成千上万数据的RL方法。 一、从RL到EM:LLM微调的困境与新思路 当前,大语言模型(LLM)在经过海量数据预训练后,展现出了惊人的通用能力。然而,要让模型在特定、复杂的推理任务(例如数学、物理 或编程)上达到顶尖水平,后训练(post-training)主流后训练方法是采用强化学习(RL),特别是结合可验证奖励的强化学习(RLVR)。 尽管基于RL的微调在提升模型性能上取得了显著进展,但其过程却面临着一系列明显的弊端,使得这种方法成本巨大且过程繁琐。 相较之下,熵最小化(EM)提出了 ...
见证历史!DeepSeek 跃居全球第二 AI 实验室,R1 登顶开源王座,R2 全网催更
程序员的那些事· 2025-06-01 02:04
Core Viewpoint - DeepSeek has officially announced the completion of the R1-0528 upgrade, which significantly enhances its model performance, making it a leading open-source AI model and the second-largest AI laboratory globally [1][9][46]. Performance Enhancements - The upgraded DeepSeek-R1-0528 model exhibits performance comparable to top models like o3 and Gemini 2.5 Pro in various benchmark tests, particularly in mathematics, programming, and general logic [2][15]. - The model's accuracy in complex reasoning tasks has improved significantly, with AIME 2025 test accuracy rising from 70% to 87.5% [16]. - In benchmark tests, DeepSeek-R1-0528 achieved notable scores, such as 91.4% in AIME 2024 and 87.5% in AIME 2025 [17]. Reduction in Hallucination Rate - The hallucination rate of DeepSeek-R1-0528 has been reduced by 45%-50% compared to its predecessor, addressing previous concerns about high hallucination rates [20][24]. - This improvement allows the model to provide more accurate and reliable results in tasks such as summarization and reading comprehension [25][26]. Enhanced Functionality - DeepSeek-R1-0528 supports tool calls, enabling it to summarize articles by fetching content from links, achieving competitive scores in Tau-Bench [31]. - The model's front-end code generation capabilities have been enhanced, allowing for the rapid creation of applications with comprehensive features [33]. Distillation of Qwen3-8B - Alongside the R1 upgrade, DeepSeek has distilled the R1-0528 model's reasoning chain into a new version, DeepSeek-R1-0528-Qwen3-8B, which shows strong performance in mathematical tests, surpassing Qwen3-8B [6][37]. - The Qwen3-8B model, despite having significantly fewer parameters, demonstrates competitive performance, indicating the effectiveness of the distillation process [38]. Industry Positioning - Following the R1 upgrade, DeepSeek has been recognized as the second-largest AI laboratory globally, surpassing competitors like xAI, Meta, and Anthropic [44][46]. - The model's intelligence index score has increased from 60 to 68, reflecting a significant advancement comparable to OpenAI's improvements [46][47].
从打分器到思考者:RM-R1用推理重塑模型价值判断
机器之心· 2025-05-31 04:00
「知其然,亦知其所以然。」 文章验证了三个核心发现: 1. 规模带来增益:随着模型变大、计算力增强,RM-R1 的推理链训练方法效果越好,性能几乎线性提升; 这句儒家命题强调,真正的理解不仅在于结果,更在于推理过程。如今,在大型语言模型的后训练阶段,奖励模型承担着桥接模型行为与人类价值的重要职 责;但现有模型往往只给出一个分数,却难以解释其依据。缺乏推理的奖励,就如「知其然而不知其所以然」,既难以建立信任,也难以指导更优的学习。 伊利诺伊大学香槟分校的研究团队提出了 RM-R1 框架,将奖励建模重新定义为推理任务,提出了推理奖励模型(Reasoning Reward Models, ReasRMs)。RM-R1 关注于如何通过整合推理能力来增强奖励模型,使其能够更准确地对模型输出进行评估和打分,从而更好地与人类偏好对齐。RM- R1 通过生成结构化的评估标准和推理过程,提升了奖励模型的可解释性和性能。 2. 简单套用旧 RL 策略行不通:想让模型「会推理」,得精准划分问题类型、并对推理过程进行定向蒸馏训练,才能带来真正泛化的提升; 3. 推理比直接输出答案更通用:相比传统的直接监督,RM-R1 的推理能力更稳 ...
斯坦福意外用AI生成超强CUDA内核,性能比人类专家优化得还要好!翻倍碾压原生PyTorch,华人主创
量子位· 2025-05-31 03:34
Core Insights - AI unexpectedly generated kernels outperform those optimized by human experts, showcasing significant performance improvements in deep learning operations [1][2][4] Performance Metrics - AI-optimized kernels achieved up to 400% performance improvement over native PyTorch in common deep learning operations [2] - Specific performance metrics include: - Matrix multiplication (Matmul, FP32): 101.3% of PyTorch's torch.matmul - 2D convolution (Conv2D): 179.9% of torch.nn.Conv2D - Softmax: 111.8% of torch.softmax - Layer normalization (LayerNorm): 484.4% of torch.nn.LayerNorm - Conv2D + ReLU + MaxPool combination: 290.1% of PyTorch reference implementation and 189.0% of torch.compile() reference implementation [6] Research Methodology - The research team initially aimed to generate synthetic data for training kernel generation models but discovered that the synthetic data itself could produce high-performance kernels [3][40] - The optimization process involved a language reasoning step between iterations, encouraging diverse search processes [9][10] - The team employed a multi-branch exploration strategy, allowing multiple implementations to evolve from each idea, selecting the best-performing kernel for subsequent rounds [16][19] Implementation Details - Kernels were written in pure CUDA-C without relying on libraries like CUTLASS and Triton [13] - The optimization approach diverged from traditional sequential modifications, instead utilizing natural language to generate optimization ideas before translating them into code [14][15] - The research demonstrated that the generated kernels utilized advanced optimizations and hardware features previously considered difficult to implement [41] Future Prospects - The research team expressed optimism about future developments, noting that their initial goal of generating functional kernels has evolved into achieving significant performance improvements [47][48] - They highlighted ongoing optimization efforts, particularly in FP16 Matmul and FP16 Flash Attention, with current performance at 52% and 9% of torch.matmul and torch.nn.functional.scaled_dot_product_attention, respectively [46]
不用GPU,大模型每2秒吃透一道高数大题!这就是华为的实力
雷峰网· 2025-05-30 09:48
Core Viewpoint - Huawei defines the benchmark for domestic large model training through technological innovation, achieving breakthroughs in computing power utilization and post-training throughput [1][4]. Group 1: Technological Innovations - Huawei's "Ascend + Pangu Ultra MoE" combination has unlocked a fully controllable training loop for domestic computing power and models, achieving industry-leading performance in cluster training systems [4][5]. - The pre-training phase saw the Ascend Atlas 800T A2 cluster's model training utilization (MFU) increase to 41%, while the post-training phase achieved a throughput of 35K Tokens/s on a single CloudMatrix 384 super node [5][36]. - Huawei disclosed key technologies in its technical report, highlighting the efficient integration of sparse MoE reinforcement learning post-training frameworks [6][7]. Group 2: Challenges in Current Training Processes - Six main challenges were identified in the current MoE pre-training and reinforcement learning post-training processes, including difficulties in parallel strategy configuration, communication bottlenecks, uneven system load distribution, excessive operator scheduling overhead, complex training process management, and limitations in large-scale expansion [10][11]. Group 3: Solutions to Enhance Training Efficiency - Huawei proposed a complete end-to-end solution to address these challenges, focusing on enhancing training cluster utilization through intelligent parallel strategy selection, deep integration of computation and communication, and global dynamic load balancing [12][14]. - The first strategy involved optimizing parallel configurations, achieving a deployment that included 16 pipeline parallelism, 8 tensor parallelism, and 32 expert parallelism [15][16]. - The second strategy focused on releasing computing power at the single-node level, doubling the micro-batch size (MBS) and optimizing operator scheduling to fully utilize Ascend node capabilities [20][21]. Group 4: Reinforcement Learning Innovations - Huawei introduced the RL Fusion training and inference co-card technology, which supports flexible deployment modes and achieves a doubling of cluster utilization in post-training [28][29]. - The design of a semi-asynchronous mechanism, StaleSync, allows different tasks to execute in parallel while maintaining model accuracy, resulting in a 50% increase in overall training throughput [30]. Group 5: Performance Metrics and Future Prospects - The Pangu Ultra MoE model, with 718 billion parameters, demonstrated high performance during training, achieving a model utilization rate of 41% and a throughput of 35K Tokens/s in post-training [35][36]. - The system is designed to support ultra-large-scale clusters and models, with expectations for future iterations to achieve even higher utilization rates [35][36].