Workflow
强化学习
icon
Search documents
强化学习之父Richard Sutton:人类数据耗尽,AI正在进入“经验时代”!
AI科技大本营· 2025-06-06 10:18
Core Viewpoint - The article emphasizes that true intelligence in AI should stem from experience rather than pre-set human data and knowledge, marking a shift towards an "Era of Experience" in AI development [5][16]. Summary by Sections Introduction to the Era of Experience - The current era in AI is characterized by a transition from reliance on human-generated data to a focus on experiential learning, where AI systems learn through interaction with the world [9][16]. Key Insights from Richard Sutton's Speech - Richard Sutton argues that genuine AI must have a dynamic data source that evolves with its capabilities, as static datasets will become inadequate [6][9]. - He highlights that the essence of intelligence lies in the ability to predict and control sensory inputs, which is fundamental to AI and intelligence [13]. The Learning Process - The learning process in both humans and animals is based on interaction with the environment, where actions determine the information received, leading to a deeper understanding [10][11]. - Sutton illustrates that AI should emulate this learning process by engaging with the world to generate new data and enhance its capabilities [10][12]. Transition from Human Data to Experience - The article outlines a timeline of AI evolution, indicating that the current "Human Data Era" is nearing its end, paving the way for the "Experience Era" where AI learns through real-world interactions [14][16]. - Sutton emphasizes that the future of AI lies in its ability to continuously learn from experiences, which is essential for unlocking the full potential of the "Experience Era" [17]. Decentralized Cooperation - The concept of "decentralized cooperation" is introduced as a framework for understanding social organization, where multiple agents pursue their own goals while collaborating for mutual benefit [24][25]. - Sutton argues that human prosperity and the future of AI should be built on this foundation of decentralized cooperation rather than centralized control [27][28]. Conclusion - The article concludes by encouraging a shift in perspective towards viewing interactions between humans and AI through the lens of decentralized cooperation versus centralized control, which could provide valuable insights into future developments in AI [28].
类R1训练不再只看结果对错!港中文推出SophiaVL-R1模型
机器之心· 2025-06-06 09:36
Core Insights - The article discusses the evolution of reasoning models, particularly focusing on the introduction of the SophiaVL-R1 model, which incorporates a "thinking reward" mechanism to enhance reasoning quality and generalization capabilities [3][5][13]. Group 1: Model Development - The SophiaVL-R1 model represents a significant advancement over previous models by not only rewarding correct answers but also evaluating the reasoning process behind those answers [3][7]. - This model has demonstrated superior performance in various mathematical and multimodal benchmark tests, outperforming larger models such as LLaVA-OneVision-72B, which has ten times the parameters [5][20]. Group 2: Thinking Reward Mechanism - The introduction of the "thinking reward" mechanism allows for a more comprehensive assessment of the reasoning process, ensuring that models learn effective reasoning strategies rather than relying on shortcuts [7][13]. - A specially designed dataset was created to score the reasoning processes, which includes diverse thinking patterns and errors, leading to the development of a "thinking scoring model" [10][11]. Group 3: Trust-GRPO Algorithm - To address the issue of reward hacking, the SophiaVL-R1 model employs the Trust-GRPO training algorithm, which assesses the credibility of thinking rewards based on comparative analysis of correct and incorrect answers [17][18]. - This algorithm enhances the stability and reliability of the training process by adjusting the credibility scores of rewards when discrepancies are detected [18]. Group 4: Performance Metrics - In various evaluation benchmarks, SophiaVL-R1-7B has shown remarkable reasoning and generalization abilities, achieving scores that directly compete with or exceed those of significantly larger models [20][21]. - The model's performance in specific benchmarks includes a score of 61.3 in MMMU and 2403.8 in MME, showcasing its effectiveness [21][23]. Group 5: Experimental Validation - Ablation studies indicate that all components of the SophiaVL-R1 model contribute effectively to its overall performance, with evidence showing faster and better training outcomes [22][23].
阿里智能体多轮推理超越GPT-4o,开源模型也能做Deep Research
量子位· 2025-06-06 04:01
Group 1 - The core viewpoint of the article is the introduction of WebDancer, an advanced autonomous information retrieval agent developed by Tongyi Lab, which addresses the growing demand for multi-step information retrieval capabilities in an era of information overload [1][2][3]. Group 2 - Background: The traditional search engines are insufficient for users' needs for deep, multi-step information retrieval across various fields such as medical research, technological innovation, and business decision-making [3]. - Challenges: Building autonomous agents faces significant challenges, particularly in obtaining high-quality training data necessary for complex multi-step reasoning [4]. Group 3 - Innovative Data Synthesis: WebDancer proposes two innovative data synthesis methods, ReAct framework and E2HQA, to address data scarcity [5][6]. - ReAct Framework: This framework involves a cycle of Thought-Action-Observation, enabling the agent to generate thoughts, take structured actions, and receive feedback iteratively [5]. Group 4 - Training Strategies: WebDancer employs a two-phase training strategy, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), to enhance the agent's adaptability and decision-making capabilities in dynamic environments [12][13]. - Data Quality Assurance: A multi-stage data filtering strategy is implemented to ensure high-quality training data, enhancing the agent's learning efficiency [9][10]. Group 5 - Experimental Results: WebDancer has demonstrated outstanding performance in various information retrieval benchmark tests, particularly excelling in the GAIA and WebWalkerQA datasets [17][18][19]. - Performance Metrics: The best-performing models achieved a Pass@3 score of 61.1% on the GAIA benchmark and 54.6% on the WebWalkerQA benchmark, showcasing their robust capabilities [20]. Group 6 - Future Prospects: WebDancer aims to integrate more complex tools and expand its capabilities to handle open-domain long-text writing tasks, enhancing the agent's reasoning and generative abilities [29][30]. - Emphasis on Agentic Models: The focus is on developing foundational models that inherently support reasoning, decision-making, and multi-step tool invocation, reflecting a philosophy of simplicity and universality in engineering [30][31].
赛道Hyper | 字节跳动VMR²L系统实现工程秒级推理
Hua Er Jie Jian Wen· 2025-06-06 03:22
Core Insights - ByteDance's ByteBrain team, in collaboration with UC Merced and UC Berkeley, has developed VMR²L, a deep reinforcement learning-based virtual machine rescheduling system that achieves near-optimal performance while reducing inference time to 1.1 seconds, thus unifying system performance with industrial deployability [1][2]. Group 1: VMR²L System Features - VMR²L utilizes a hierarchical attention network to capture resource dependencies between virtual and physical machines, combined with asynchronous policy gradient algorithms for distributed training, enabling state evaluation and action selection within milliseconds [2]. - The dynamic graph pruning technology allows for real-time elimination of ineffective computation nodes, enhancing inference speed by 270 times compared to traditional Mixed Integer Programming (MIP) methods, reducing migration time from 50 minutes to 1.1 seconds with only a 3% higher fragmentation rate than the optimal solution [2]. - The system's two-stage agent architecture filters illegal actions through explicit constraints, naturally adhering to industrial scheduling rules such as resource capacity and affinity, with a generalization error of less than 5% across different load scenarios [2]. Group 2: Market Impact and Efficiency - In typical cloud computing clusters, VMR²L can improve resource utilization by 18%-22% and reduce migration time from minutes to seconds, providing a feasible solution for real-time resource scheduling in high-density data centers [2][3]. - The system reduces resource fragmentation by 20% and saves over 5% in annual server procurement costs, while maintaining performance fluctuations of less than 8% across various industry load models [4]. - The lightweight model, with only 1.2GB of parameters, supports edge deployment, reducing data transmission by 70% and improving response times at edge nodes by five times [4]. Group 3: Technological Advancements and Future Directions - VMR²L's event-driven communication protocol reduces inter-node latency to 5 milliseconds, supporting distributed decision-making for large-scale clusters with tens of thousands of nodes, improving task completion efficiency by 40% compared to traditional polling mechanisms [5]. - The system's standardized interface design provides compatibility with major cloud platforms like OpenStack and Kubernetes, significantly lowering the technical migration costs for enterprises [5]. - The development of VMR²L marks a shift in reinforcement learning from "algorithm competition" to "value creation," directly enhancing resource utilization for IaaS providers and supporting latency-sensitive fields such as autonomous driving and industrial robotics [5][6]. Group 4: Broader Implications - The emergence of VMR²L reflects the deep integration of artificial intelligence with the real economy, offering a universal solution for real-time decision-making in smart manufacturing and smart city applications [6]. - Despite challenges in areas like autonomous driving certification and quantum computing integration, this achievement outlines a clear industrialization path for reinforcement learning technology, focusing on balancing efficiency, cost, and reliability [6][7].
12.1万高难度数学题让模型性能大涨,覆盖FIMO/Putnam等顶级赛事难度,腾讯上海交大出品
量子位· 2025-06-06 00:58
DeepTheorem团队 投稿 量子位 | 公众号 QbitAI 12.1万道IMO级难度数学"特训题",让AI学会像人类一样 推导数学证明 ! "特训"过后,模型定理证明性能大涨 ,7B模型性能比肩或超越现有的开源模型和Claude3.7等商业模型 。 "特训题"为 Deep Theore m ,是首个基于自然语言的数学定理证明框架与数据集,由腾讯AI Lab与上海交大团队联合推出。 团队表示,定理证明是数学前沿的重要组成部分,但当前大语言模型 (LLM) 在数学推理,特别是通过强化学习 (RL) 进行训练时,往往 需要可以自动验证的答案,导致大模型无法像数学家那样通过自然语言进行定理证明。 图(b)展示经过强化学习训练的DeepTheorem-7B模型性能,比肩或超越现有的开源模型和商业模型 (Gemini2.0-flash, Qwen2.5-72B- Instruct, Claude3.7 等 ) ,仅次于o1、o3以及Gemini2.5-pro强推理模型。 DeepTheorem-121K 1、规模与难度:专为"极限挑战"而生 DeepTheorem训练集的显著特点是其大规模与高难度。其包含121K ...
Gemini2.5弯道超车背后的灵魂人物
Hu Xiu· 2025-06-05 03:14
Group 1: Core Insights on Gemini 2.5 - Gemini 2.5 Pro has achieved the best performance metrics among large models, showcasing a significant leap from being a follower to a leader in the AI model landscape [2][20] - The training process of Gemini 2.5 emphasizes three fundamental steps: Pre-training, Supervised Fine-tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF) for alignment [2][3] - The focus on reinforcement learning, particularly in tasks with clear objectives like mathematics and programming, has contributed to Gemini's impressive performance [3][4] Group 2: Competitive Landscape and Model Development - Google has accumulated substantial foundational training experience from previous versions of Gemini, which has been enhanced by a greater emphasis on reinforcement learning [3][4] - Other companies like Anthropic have prioritized coding capabilities in their models, leading to a notable quality difference in code generation compared to competitors [4][5] - The shift in focus from human preference outputs to programming capabilities has been a strategic move for Google, allowing it to catch up with competitors like OpenAI [10][11] Group 3: Key Personnel and Organizational Dynamics - Key figures in Google's AI development include Jeff Dean, Oriol Vinyals, and Noam Shazeer, who have significantly influenced the model's capabilities through their expertise in pre-training, reinforcement learning, and natural language processing [15][16] - The integration of Google and DeepMind's strengths has created a powerful synergy, enhancing the overall capabilities of the Gemini model [17] - Sergey Brin's return to Google has reinvigorated the company's culture, fostering a more ambitious and motivated environment among employees [20] Group 4: API Pricing Strategy - Gemini's API pricing is significantly lower than competitors, with token costs being approximately one-fifth to one-tenth of OpenAI's [21][22] - Google's long-term investment in TPU technology has allowed it to reduce dependency on external GPU suppliers, contributing to lower operational costs [22][23] - The ability to customize hardware and leverage extensive infrastructure resources has enabled Google to optimize model performance and pricing effectively [23][24]
人形机器人“擂台赛”,南京这样“打”
Nan Jing Ri Bao· 2025-06-05 00:21
Core Insights - The article highlights the rapid development and competitive landscape of humanoid robots in Nanjing, showcasing various events and advancements in technology [1][11] - Nanjing aims to establish itself as a "Robot City" by promoting humanoid robot research and production capabilities, with a focus on enhancing core components and overall system integration [10] Group 1: Technological Advancements - Humanoid robots are utilizing reinforcement learning to improve their movement and balance, showcasing significant progress in their capabilities [2][3] - The current humanoid robots employ two main technical routes: electric servo and electro-hydraulic servo, with the latter being more powerful but complex [4] - The development of humanoid robots is expected to take around 10 years to become commonplace in households, with a focus on reducing manufacturing costs [6][7] Group 2: Industry Applications - The article discusses the application of humanoid robots in various sectors, including industrial manufacturing, healthcare, and elder care, with specific examples of robots being tested in these fields [5][7] - Nanjing's Tianchuang Electronics has introduced the world's first explosion-proof humanoid robot, targeting high-risk operational needs in industries like chemical handling and mining [7] Group 3: Future Directions - Nanjing is focusing on innovation in mechanisms and core components to enhance the competitiveness of its humanoid robot industry, with plans to support research and development in this area [8][10] - The establishment of a humanoid robot training ground is suggested to facilitate the application of robots in various scenarios and to gather data for further development [9]
高新技术助力新能源发电系统高质量运行
Xin Hua Ri Bao· 2025-06-04 20:56
Group 1: Electric Automation Technology - Electric automation technology integrates multiple disciplines such as electronic technology, computer technology, and control technology, characterized by intelligence, efficiency, networking, and environmental protection, which is crucial for the high-quality operation of new energy generation systems [1] Group 2: Optimization of Energy Storage Systems - Power companies can enhance the stability and reliability of energy storage systems by implementing Model Predictive Control (MPC) algorithms, which predict photovoltaic power generation changes based on real-time solar intensity [2] - The application of reinforcement learning in the charging and discharging processes of energy storage systems can optimize control strategies to improve stability and reliability [2] - When determining the capacity of energy storage devices, historical generation data should be analyzed to understand the maximum, minimum, and fluctuation ranges of power generation, ensuring that energy storage systems can meet load demands during low generation periods [2] Group 3: Smart Grid Technology - Smart grids are modernized power networks built on integrated, high-speed bidirectional communication networks, incorporating advanced sensing, measurement, and control technologies [3] - Distribution automation systems based on electric automation technology can utilize smart meters and distributed sensors to collect real-time operational data, enabling intelligent operation and management of the distribution network [3] - Automation systems in substations allow remote operation and control of equipment, improving accuracy and efficiency while reducing safety risks associated with human error [3] Group 4: Energy Management Systems (EMS) - Energy Management Systems (EMS) are core control systems for grid operations, and optimizing them through electric automation technology significantly enhances the management capabilities of new energy generation and grid loads [4] - High-precision power sensors and high-speed communication networks enable real-time monitoring of new energy generation output, allowing for in-depth data analysis [4] - Intelligent algorithms and model predictive control techniques can optimize scheduling strategies, balancing new energy and traditional energy generation proportions [4] Group 5: Intelligent Diagnosis and Maintenance of Generation Equipment - The use of IoT technology in solar power plants allows for real-time monitoring of key equipment parameters, aiding in quick fault diagnosis and response [5] - Wind power plants can monitor operational parameters of wind turbines, enabling remote control and health monitoring of critical components [5] - Preventive maintenance plans should be developed based on health and fault prediction models, incorporating regular inspections and equipment maintenance to reduce failure rates and costs [5][6]
超越GPT-4o!华人团队新框架让Qwen跨领域推理提升10%,刷新12项基准测试
量子位· 2025-06-04 00:17
Core Insights - A new reinforcement learning method called General-Reasoner has significantly improved the performance of the Qwen series models, surpassing GPT-4o in various benchmarks [1][2]. Group 1: Methodology and Innovations - The General-Reasoner framework enhances cross-domain reasoning accuracy by nearly 10%, addressing limitations of existing Zero-RL methods that focus on single-domain data and rigid validation methods [2][4]. - The research team created a comprehensive reasoning dataset, WebInstruct-verified, consisting of approximately 230,000 high-quality, verifiable reasoning questions across multiple fields such as physics, chemistry, and finance [5][9]. - The dataset was derived from WebInstruct, which initially included around 5 million natural instructions, with a rigorous filtering process to ensure quality and relevance [6][7]. Group 2: Validation Mechanism - A new generative answer verifier, General-Verifier, was developed to replace traditional rule-based validation, significantly improving the accuracy of answer verification across diverse domains [13]. - General-Verifier, with only 1.5 billion parameters, generates reasoning processes and outputs binary correctness judgments, providing accurate and interpretable feedback for reinforcement learning [13]. Group 3: Performance Metrics - The General-Reasoner framework was tested on 12 benchmark tests, showing a 10% improvement in cross-domain tasks compared to the base models, with specific accuracy rates such as 58.9% for Qwen2.5-7B-Base in MMLU-Pro [15]. - The optimal model, General-Reasoner-Qwen3-14B, achieved competitive results against GPT-4o, with accuracy rates of 56.1% in GPQA and 54.4% in TheoremQA [15][16]. Group 4: Future Directions - The research team aims to further optimize model performance, expand high-quality reasoning data across more domains, and enhance the robustness of the verifier to facilitate broader applications of large language models in complex real-world tasks [17].
AGI的不归之途
虎嗅APP· 2025-06-03 13:52
Core Insights - The article discusses the rapid advancements in AI technologies, particularly focusing on the emergence of intelligent agents and their potential to replace a significant portion of entry-level jobs, with predictions that they could take over 50% of such roles by 2026 [3][4][5]. - The competition between the US and China in AI development is intensifying, with Chinese models like DeepSeek showing significant performance improvements and closing the gap with US counterparts [5][6][11]. Group 1: AI Advancements - The introduction of advanced models such as OpenAI's o3 and Gemini 2.5 pro has accelerated the development of intelligent agents, which are now capable of handling increasingly complex tasks [3][4]. - OpenAI's annual revenue has reached $10 billion, while Anthropic's revenue has surged from $1 billion to $3 billion within six months, indicating a strong market demand for AI applications [4]. Group 2: Global AI Competition - China's DeepSeek model has surpassed Gemini 2.5 pro in performance, showcasing the rapid advancements in Chinese AI technology [5][6]. - The gap between Chinese and US AI models has narrowed from two years at the time of ChatGPT's release to less than three months, highlighting China's competitive edge in AI development [11]. Group 3: Geopolitical Implications - AI is viewed as a significant economic lever and a source of geopolitical influence by both the US and China, with both nations investing heavily in AI infrastructure and talent acquisition [36][37]. - The article suggests that the next phase of AI commercialization may not follow a "winner-takes-all" model but rather a fusion and restructuring of platforms and specialized vendors [35].