强化学习
Search documents
攻克强化学习「最慢一环」!交大字节联手,让大模型RL训练速度飙升2.6倍
量子位· 2025-09-13 08:06
Core Insights - The article discusses the inefficiencies in reinforcement learning (RL) training, particularly highlighting the rollout phase, which consumes over 80% of the training time and is limited by memory bandwidth and autoregressive characteristics [1][2]. Group 1: RhymeRL Framework - Shanghai Jiao Tong University and ByteDance's research team introduced RhymeRL, which enhances RL training throughput by 2.6 times without sacrificing accuracy by leveraging historical data [2][21]. - RhymeRL is based on two key components: HistoSpec and HistoPipe [7]. Group 2: HistoSpec - HistoSpec innovatively incorporates speculative decoding, using previous historical responses as the "best script," which transforms the rollout process from a token-by-token generation to a batch verification process [9][10]. - This method significantly increases computational density and speeds up response generation by allowing high acceptance rates of drafts derived from historical sequences [13][14]. Group 3: HistoPipe - HistoPipe optimizes GPU resource utilization by implementing a scheduling strategy that minimizes idle time, allowing for efficient processing of tasks of varying lengths [15][19]. - It employs a "cross-step complement" approach to balance workloads across GPUs, ensuring that resources are fully utilized without idle periods [17][18]. Group 4: Performance Improvement - The combination of HistoSpec and HistoPipe results in a remarkable performance boost, achieving a 2.61 times increase in end-to-end training throughput for tasks such as mathematics and coding [21]. - This advancement allows researchers and companies to train more powerful models with fewer resources and in shorter timeframes, accelerating the iteration of AI technologies [22]. Group 5: Significance of RhymeRL - RhymeRL proposes a new paradigm in reinforcement learning by utilizing historical information to enhance training efficiency, demonstrating the potential for better resource allocation and compatibility with existing training algorithms [23].
如何准备RL面试相关的问题?
自动驾驶之心· 2025-09-12 16:03
Core Insights - The article discusses the GRPO (Group Relative Policy Optimization) framework, primarily categorizing it as on-policy but acknowledging its potential off-policy adaptations [5][6][7] - It emphasizes the importance of understanding the data sources and the implications of using old policy data in the context of on-policy and off-policy learning [10][11] GRPO Framework - GRPO is typically considered on-policy as it estimates group-relative advantage using data generated by the current behavior policy [5][6] - Recent works have explored off-policy adaptations of GRPO, utilizing data from older policies to enhance sample efficiency and stability [4][5][7] - The original implementation of GRPO relies on current policy data to estimate gradients and advantages, aligning with traditional on-policy definitions [6][10] Importance Sampling - Importance Sampling (IS) is a key method in off-policy evaluation, allowing the use of data from a behavior policy to assess the value of a target policy [8][9] - The article outlines the mathematical formulation of IS, highlighting its role in correcting biases arising from differences in sampling distributions [12][14] - Weighted Importance Sampling is introduced as a solution to the high variance problem associated with basic IS [15][16][17] GSPO and DAPO - GSPO (Group Sequence Policy Optimization) addresses high variance and instability issues in GRPO/PPO by shifting the focus to sequence-level importance ratios [18][21] - DAPO (Decoupled Clip & Dynamic Sampling Policy Optimization) enhances training stability and sample efficiency in long chain-of-thought tasks through various engineering techniques [20][24] - Both GSPO and DAPO aim to improve the robustness of training processes in large-scale language models, particularly in handling long sequences and mitigating entropy collapse [20][24][27] Entropy Collapse - Entropy collapse refers to the rapid decrease in policy randomness during training, leading to reduced exploration and potential suboptimal convergence [28][30] - The article discusses various strategies to mitigate entropy collapse, including entropy regularization, KL penalties, and dynamic sampling [32][33][34] - It emphasizes the need for a balance between exploration and exploitation to maintain effective training dynamics [37][41] Relationship Between Reward Hacking and Entropy Collapse - Reward hacking occurs when an agent finds shortcuts to maximize rewards, often leading to entropy collapse as the policy becomes overly deterministic [41][42] - The article outlines the cyclical relationship between reward hacking and entropy collapse, suggesting that addressing one can help mitigate the other [41][42] - Strategies for managing both issues include refining reward functions, enhancing training stability, and ensuring diverse sampling [47][48]
GPT-5 为啥不 “胡说” 了?OpenAI 新论文讲透了
腾讯研究院· 2025-09-12 08:58
Core Viewpoint - The article discusses the advancements and challenges of OpenAI's GPT-5, particularly focusing on the significant reduction in hallucination rates compared to previous models, while also highlighting the underlying mechanisms and implications of these changes [5][6][25]. Group 1: Hallucination Rates and Mechanisms - GPT-5 has a hallucination rate that is approximately 45% lower than GPT-4 and about 80% lower than OpenAI's earlier models [6]. - The reduction in hallucination rates is attributed to enhanced reinforcement learning techniques that allow models to refine their reasoning processes and recognize their errors [8][9]. - The paper published by OpenAI indicates that hallucinations are an inevitable byproduct of the statistical learning nature of language models, making it more challenging to generate reliable information than to assess its reliability [12][16]. Group 2: Theoretical Framework - OpenAI introduces a theoretical "Is-It-Valid" (IIV) judgment mechanism that determines the validity of generated sentences based on their internal probabilities [13]. - The model's tendency to generate plausible-sounding but incorrect information is exacerbated by data sparsity, complexity, and noise in training data [14][16]. - The mathematical conclusion presented in the paper suggests that the error rate of generative models is at least double that of the IIV judgment errors, indicating a compounding effect of judgment mistakes on hallucinations [15][16]. Group 3: Post-Training Challenges - Post-training processes have not effectively mitigated hallucinations, as current evaluation metrics tend to reward models for providing confident but potentially incorrect answers [18][24]. - The article critiques the binary scoring systems used in mainstream AI evaluations, which penalize uncertainty and discourage models from expressing "I don't know" [21][24]. - The reinforcement learning processes that utilize binary reward paths may inadvertently promote overconfidence in models, leading to increased hallucination rates [27][29]. Group 4: Future Directions and Solutions - The article suggests that introducing a penalty-based scoring mechanism during post-training could help models better calibrate their confidence levels and reduce hallucinations [33]. - A shift from a score-optimization focus to a truth-oriented approach is proposed as a potential solution to the hallucination problem [34].
一夜刷屏,27岁姚顺雨离职OpenAI,清华姚班天才转型做产品经理?
3 6 Ke· 2025-09-12 04:04
Core Insights - The news highlights the significant attention surrounding Shunyu Yao, a prominent AI talent, and the implications of his potential recruitment by Tencent, which has been officially denied [1][6] - Yao's expertise and contributions to OpenAI's Deep Research make him a highly sought-after figure in the AI industry, with rumors of a salary of 100 million RMB circulating, reflecting the competitive landscape for top AI talent [3][4] Group 1: Shunyu Yao's Background and Achievements - Shunyu Yao, aged 27, is a graduate of Tsinghua University and Princeton University, recognized for his exceptional academic performance and contributions to AI research [7][11] - He has been a core contributor to OpenAI's projects, including the development of intelligent agents and digital automation tools, which are pivotal for advancing AI capabilities [5][11] - His research has garnered significant recognition, with over 15,000 citations, indicating his influence in the field of AI [11][12] Group 2: Industry Implications - The recruitment of top AI talent like Yao signifies a deeper shift in the global AI talent ecosystem, as companies vie for expertise to drive innovation [6][19] - Yao's perspective on the importance of evaluation over training in AI development suggests a potential paradigm shift in how AI models are assessed and improved, emphasizing the need for practical applications [18][20] - The competitive salary offers from companies like Meta, which reportedly reached 100 million USD for core researchers, highlight the escalating financial stakes in attracting leading AI professionals [3][4]
外滩大会速递(1):萨顿提出AI发展新范式,强化学习与多智能体协作成关键
Haitong Securities International· 2025-09-12 02:47
Investment Rating - The report does not explicitly provide an investment rating for the industry or specific companies within it. Core Insights - Richard Sutton proposes that we are entering an "Era of Experience" characterized by autonomous interaction and environmental feedback, emphasizing the need for systems that can create new knowledge through direct interaction with their environments [1][8] - Sutton argues that public fears regarding AI, such as bias and unemployment, are overstated, and that multi-agent cooperation can lead to win-win outcomes [9] - The report highlights the importance of continual learning and meta-learning as key areas for unlocking the potential of reinforcement learning [3][13] Summary by Sections Event - Sutton's presentation at the 2025 INCLUSION Conference outlines a shift from static knowledge transfer to dynamic agent-environment interactions, marking a transition to an "Era of Experience" [1][8] - He identifies reinforcement learning as crucial for this transition, but notes that its full potential is contingent on advancements in continual and meta-learning [1][8] Commentary - The report discusses the shift from "data as experience" to "capability as interaction," suggesting that firms need to develop systems that can actively engage with their environments to generate new knowledge [2][11] - It emphasizes that the real bottleneck in reinforcement learning is not model parameters but the ability to handle time and task sequences, highlighting the need for continual and meta-learning capabilities [3][13] Technical Bottlenecks - The report identifies two main constraints in reinforcement learning: the need for continual learning to avoid catastrophic forgetting and the need for meta-learning to enable rapid adaptation across tasks [3][13] - It suggests that R&D should focus on long-horizon evaluation and the integration of memory mechanisms and planning architectures [3][13] Decentralized Collaboration - The report posits that decentralized collaboration is not only a technical choice but also a governance issue, requiring clear incentives and transparent protocols to function effectively [4][12] - It outlines three foundational institutional requirements for effective decentralized collaboration: open interfaces, cooperation-competition testbeds, and auditability [4][12] Replacement Dynamics - Sutton's view on "replacement" suggests that it will occur at the task level rather than entire job roles, urging organizations to proactively deconstruct tasks and redesign processes for human-AI collaboration [5][15] - The report recommends establishing a human-AI division of labor and reforming performance metrics to focus on collaborative efficiency [5][15]
外滩大会再证蚂蚁的底色:金融科技公司
Mei Ri Shang Bao· 2025-09-11 23:04
Group 1: Conference Overview - The 2025 Inclusion·Bund Conference opened in Shanghai with the theme "Reshaping Innovative Growth," featuring 550 guests from 16 countries and regions, including notable figures like Richard Sutton and Yuval Noah Harari [1] - The conference focused on five main topics: "Financial Technology," "Artificial Intelligence and Industry," "Innovation and Investment Ecology," "Global Dialogue and Cooperation," and "Responsible Innovation and Inclusive Future," comprising one main forum and 44 insight forums [1] - The event is recognized as one of Asia's three major financial technology conferences, attracting global attention for its openness, diversity, and forward-looking nature [1] Group 2: Insights from Richard Sutton - Richard Sutton, the 2024 Turing Award winner, emphasized that artificial intelligence is entering an "experience era," where the potential for AI exceeds previous capabilities [2] - He noted that current machine learning methods are reaching the limits of human data, and there is a need for new data sources generated through direct interaction between intelligent agents and the world [2] - Sutton defined "experience" as the interaction of observation, action, and reward, which is essential for learning and intelligence [2][3] Group 3: Insights from Wang Xingxing - Wang Xingxing, CEO of Yushutech, expressed regret for not pursuing AI earlier, highlighting the rapid development of large models that now allow for the integration of AI with robotics [4] - He discussed the emergence of a new embodied intelligence industry, where robots can possess AGI capabilities, enabling them to perceive, plan, and act autonomously [4] - Wang is optimistic about the future of innovation and entrepreneurship, stating that the barriers to entry have significantly lowered, creating a favorable environment for young innovators [4] Group 4: Ant Group's Technological Advancements - Ant Group is recognized as a leading technology financial company, with significant investments in AI and various sectors [5][6] - The conference showcased Ant Group's new AI assistant "Xiao Zheng," which integrates multiple large models to streamline government services [6] - Ant Group's CTO announced the launch of the "Agentic Contract," which will be natively deployed on their new Layer2 blockchain, Jovay [6]
腾讯研究院AI速递 20250912
腾讯研究院· 2025-09-11 16:01
Group 1 - Thinking Machines has released its first research blog addressing non-determinism in LLM inference, focusing on batch invariance [1] - The research team improved RMSNorm, matrix multiplication, and attention mechanisms to achieve fully reproducible inference results with acceptable performance loss [1] - The company's valuation has reached $12 billion, with a founding team primarily from OpenAI, and its first product is named Connection Machine [1] Group 2 - OpenAI announced that ChatGPT now officially supports MCP (Model Context Protocol), allowing Plus and Pro users to automate operations with a single prompt [2] - MCP standardizes interactions between AI models, tools, and data sources, enabling different models to share context and support plug-and-play functionality [2] - Users can connect third-party services (like Stripe) in developer mode to complete complex tasks, although this cannot be used simultaneously with other ChatGPT features [2] Group 3 - WeChat official account has launched an "Intelligent Reply" feature supported by Tencent's Hunyuan large model, addressing the issue of operators not being able to respond to reader inquiries in a timely manner [3] - This feature automatically learns from the account's historical articles and reply styles, marking replies as "intelligent replies" and referencing relevant historical articles [3] - Tencent Hunyuan will also introduce Roleplay models and AI avatar applications to provide immersive dialogue experiences, which individual creators can enable in the PC backend of the official account [3] Group 4 - Kimi has open-sourced a new middleware called checkpoint-engine, capable of updating trillion-parameter models across thousands of GPUs in 20 seconds, significantly enhancing reinforcement learning efficiency [4] - This technology employs a hybrid co-location architecture to manage parameter states through a distributed checkpoint engine, enabling parallel processing of parameter broadcasting and reloading [4] - The system design supports complete decoupling of training and inference engines, using a pipeline approach for parameter updates to enhance stability against single-point failures [4] Group 5 - NVIDIA has released a new AI Blueprint that allows 3D artists to quickly create scene prototypes using generative AI technology, generating up to 20 3D models from text prompts [5] - It integrates Microsoft TRELLIS and NVIDIA NIM microservices, achieving speeds 20% faster than native applications, and supports RTX 50 and 40 series GPUs with over 16GB of memory [5] - The workflow automates the conversion from concept to 3D model, with generated models exportable to platforms like Blender for further optimization, significantly reducing prototype design time for artists [5] Group 6 - Baidu Academic has completed an AI reconstruction, launching features like AI academic search, AI literature summarization, AI reading, and paper mapping, creating the first one-stop AI academic platform in the industry [7] - The platform covers the entire academic chain of "search, read, create, and edit," providing literature summarization, full-text translation, topic recommendations, and professional formatting, greatly enhancing research efficiency [7] - It has indexed 690 million literature resources, covering 1.04 million academic sites, and established 4.2 million scholar profiles, with plans to build an academic identity system supported by Baidu's full traffic [7] Group 7 - Tencent Meeting has launched an AI hosting feature in collaboration with Yuanbao, allowing users to have the AI listen to meetings in advance and record in real-time, addressing issues like tardiness and overlapping meetings [8] - Users can activate "AI hosting" on the meeting page or list, enabling Yuanbao to automatically join the meeting and generate intelligent AI minutes, ensuring no content is missed [8] - After the meeting, users can directly ask Yuanbao about the meeting content to assist in decision-making, ensuring that key meetings are always "present" [8] Group 8 - Wang Xingxing, founder of Yushu Technology, expressed regret for not focusing on AI since 2011, believing that the current fields for AI application remain "desolate" [9] - Yushu Technology has announced its IPO plan, expecting to submit an application by the end of 2025, with projected revenue exceeding 1 billion yuan in 2024 and four consecutive years of profitability, aiming to become the largest "quadruped and humanoid robot" stock globally [9] - Wang revised his previous views on data, acknowledging that both robot data and models are core issues, advising young entrepreneurs to embrace current AI technological innovations [9] Group 9 - Sutton, known as the "father of reinforcement learning," stated in a speech that AI is entering an "experience era," where intelligence will be gained from continuous learning rather than static knowledge accumulation [10] - He emphasized that fears surrounding AI are exaggerated, suggesting that AI and human prosperity stem from decentralized collaboration, allowing intelligent agents to coexist peacefully under different objectives [10] - Sutton proposed four predictive principles, asserting that human intelligence will be surpassed, power will shift to the smartest agents, and AI is an inevitable next step in the evolution of the universe [10]
预见AI:人类进入新“经验时代” 唯有人造太阳能喂饱AI
Nan Fang Du Shi Bao· 2025-09-11 15:58
Group 1: AI and Innovation - The 2025 Inclusion·Bund Conference in Shanghai focused on "Reshaping Innovation Growth," featuring discussions on AI as a key theme, with over 40 forums and a significant technology exhibition [1] - Richard Sutton, the 2024 Turing Award winner, emphasized that humanity is entering a new "Era of Experience," where AI's replacement is inevitable, and the data era is nearing its end [3][4] - Sutton highlighted that the core of intelligence lies in experience, which involves observation, action, and reward, and pointed out the need for continual learning and meta-learning technologies to unlock AI's full potential [3] Group 2: Industry Perspectives - Wang Jian, founder of Alibaba Cloud, stated that open data and computing resources are essential for advancing AI, marking a shift from code open-sourcing to resource sharing [5][6] - Wang also introduced the concept of "computing satellites," which will leverage AI in space exploration, indicating a new frontier for AI applications beyond traditional devices [6] - Wang Xingxing, CEO of Yushu Technology, expressed optimism about the AI era, noting that small organizations will increasingly have explosive growth potential, despite existing challenges in data quality and model algorithms [7][8] Group 3: Organizational Challenges - McKinsey's China Chairman, Li Yili, identified organizational culture as the biggest bottleneck in AI development, advocating for CEO-led transformations focused on profitability rather than just application scenarios [8][9] - Li outlined three stages of globalization for Chinese enterprises, emphasizing the need for a global perspective and diverse collaboration models to enhance growth opportunities [10] Group 4: Energy and AI - Professor Sun Xuan from the University of Science and Technology of China proposed that nuclear fusion is the key to meeting the energy demands of AI, with 1 gram of fusion fuel equating to the energy of 8 tons of oil [11][12] - Sun highlighted the significant energy gap that AI could create, predicting that AI's energy consumption could exceed 20% of the Earth's total energy supply in the future [11] - The fusion industry is seeing increased investment, with a total of $7.1 billion raised globally, indicating a growing interest in commercializing fusion technology [12]
金融大模型步入“价值”攻坚战,如何跨越三道门槛?
Di Yi Cai Jing· 2025-09-11 10:11
Core Insights - The year 2025 is identified as a pivotal year for the large-scale implementation of AI in China's financial industry, transitioning from mere usage to creating real value [1][2] - Financial institutions are increasingly focusing on the collaboration between technology and business departments to achieve actual benefits and cost control, with "value" becoming a common consensus in the industry [2][3] AI Application in Finance - AI applications in finance have evolved from simple human assistance to intelligent agents capable of perception, learning, action, and decision-making, applicable in areas like market analysis, risk assessment, and wealth management [2][3] - The participation of business departments in AI development has significantly increased from 18% to 74%, indicating a shift towards practical applications of AI [3] Accelerated Implementation - Major banks are rapidly expanding AI applications, with examples such as ICBC's "Navi AI+" initiative introducing over 100 new AI application scenarios in key business areas [3] - Postal Savings Bank has developed over 230 AI model scenarios, showcasing the industry's commitment to integrating AI into their operations [3] Strategic Considerations - Financial institutions are beginning to systematically consider their AI strategies, aiming to become more agile and better manage light capital businesses [3] - There is a consensus that while AI can reshape business processes, it will take time to fully realize its potential, emphasizing the importance of building a robust AI framework in the next 1-2 years [3] Data Utilization Challenges - Companies face challenges in converting data resources into assets, with a need to bridge the gap between data, technology, and algorithms to support decision-making [4][5] - The concept of insight platforms is proposed to activate approximately 70% of "sleeping" data, transforming it into valuable resources for AI model training [4] Security and Trust Issues - The application of domestic AI models in finance is transitioning from isolated breakthroughs to ecosystem reconstruction, but issues like algorithm bias and privacy breaches remain unresolved [6] - The financial sector requires high precision in decision-making, making the introduction of reinforcement learning technology crucial for enhancing decision accuracy [6][7] Uncertainty in AI Deployment - The introduction of AI brings new challenges, particularly regarding uncertainty in investment returns and business outcomes, necessitating innovation in strategic planning and organizational design [7]
对AI的恐惧被夸大了,“强化学习之父”萨顿外滩演讲:四条原则预言AI未来
3 6 Ke· 2025-09-11 08:34
Group 1 - The core idea presented is that the human data dividend is nearing its limit, and artificial intelligence (AI) is entering an "experience era" centered on continuous learning, which has the potential to exceed previous capabilities [1][9][44] - AI's current training methods are primarily focused on transferring existing human knowledge to static models without autonomous learning capabilities, leading to a recognition of the limitations of this approach [10][14] - The future of AI relies on the development of two currently immature technologies: continual learning and meta-learning, which are essential for unlocking the full potential of experience-based learning [16][14] Group 2 - AI has become a highly politicized issue, with public fears about bias, unemployment, and even human extinction being exaggerated and fueled by certain organizations and individuals [16][18][25] - The call for regulation and control of AI reflects a broader societal tendency to fear the unknown, which can hinder collaborative efforts necessary for progress [24][28] - The concept of decentralized collaboration is emphasized as a superior alternative to centralized control, allowing for coexistence among diverse intelligent agents with different goals [20][26][21] Group 3 - Four principles are proposed to predict the future of AI: the absence of a unified global opinion on how the world should operate, the eventual understanding and creation of intelligence by humans, the inevitable surpassing of current human intelligence by superintelligent entities, and the flow of power and resources towards the most intelligent agents [35][36][37] - The inevitability of AI's replacement of human roles is acknowledged, framing it as a natural progression in the evolution of intelligence [38][44] - The role of humans as catalysts and pioneers in the "design era" is highlighted, emphasizing the unique ability to push design to its limits through AI [42][43]