强化学习
Search documents
一文读懂GPT-5的绝招,这是决定AI未来的隐形武器
3 6 Ke· 2025-09-16 10:43
Core Insights - The article discusses the significance of the "Universal Verifier" in the evolution of AI models, particularly in the context of GPT-5 and its performance enhancements [2][3] - It highlights the limitations of previous reinforcement learning methods, particularly "Reinforcement Learning with Verifiable Rewards" (RLVR), in complex real-world scenarios where answers are not binary [2][4] - The article outlines two main approaches to developing the Universal Verifier: enhancing the evaluation criteria and allowing models to self-assess their outputs [36][44] Group 1: Universal Verifier and Its Importance - The Universal Verifier is seen as a potential breakthrough in AI, addressing the shortcomings of RLVR by enabling models to evaluate answers in a more nuanced manner [2][10] - The need for a more sophisticated evaluation system arises from the complexity of real-world problems, especially in fields like healthcare and education, where answers are not simply right or wrong [2][11] - The article emphasizes that understanding the Universal Verifier is crucial for grasping the future of AI technology and competition [3] Group 2: Approaches to Developing the Universal Verifier - The first approach involves using large language models (LLMs) as judges to create a more complex evaluation standard, which has been explored in various research papers [4][5][6] - The second approach focuses on self-assessment, where models evaluate their own outputs based on internal confidence levels, reducing reliance on external validation [44][45] - The RaR (Rubrics as Rewards) framework is introduced as a method to create detailed scoring criteria for evaluating model outputs, leading to significant performance improvements in specific domains [19][21][22] Group 3: Performance Improvements and Results - The article presents data showing that models trained using the RaR framework achieved substantial performance gains, with scores in medical evaluations increasing nearly fourfold [21][22] - Comparisons with other evaluation methods indicate that RaR outperformed traditional approaches, demonstrating its effectiveness in complex reasoning tasks [22][24] - The Rubicon framework further enhances the scoring system by incorporating over 10,000 evaluation criteria, leading to improved performance in subjective areas like creative writing [27][28] Group 4: Future Directions and Challenges - The article discusses the limitations of current approaches, noting that while RaR and Rubicon show promise, they still rely on expert-defined criteria, which may hinder scalability [69][70] - The INTUITOR method represents a shift towards internal feedback mechanisms, allowing models to learn without predefined answers, but it also faces challenges in generalizability [59][60] - The OaK architecture is proposed as a long-term vision for AI, aiming for a system that learns and evolves through interaction with the environment, though it remains a distant goal [70][77]
上汽通用汽车“至境L7”公开亮相
Zhong Zheng Wang· 2025-09-16 06:13
Core Viewpoint - SAIC-GM's Buick brand has launched its flagship electric sedan, the Buick Zhijing L7, which aims to compete in the high-end electric vehicle market with advanced technology and features [1] Group 1: Product Launch - The Buick Zhijing L7 made its national debut on September 15 in Shanghai [1] - The vehicle is now available in Buick dealerships across the country and has initiated an early bird program offering lifetime free maintenance for orders placed before September 28 [1] Group 2: Technology and Features - The Zhijing L7 utilizes "True Dragon" range extension technology and is equipped with the "Xiaoyao Zhixing" driver assistance system [1] - It features the Momenta R6 flywheel model based on end-to-end "reinforcement learning" and Qualcomm's latest SA8775P chip, providing a top-tier intelligent electric experience [1] - The vehicle boasts a pure electric range of 302 km and a comprehensive range of 1420 km [1] Group 3: Market Positioning - The Zhijing L7 combines global automotive expertise with local innovation, aiming to enter the first tier of the electric vehicle market [1] - The vehicle is expected to create new opportunities for the Buick brand's development in the new era, leveraging industry-leading range extension technology and luxury experience [1]
蚂蚁集团大模型数据智能算法工程师招聘(可内推)
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the responsibilities and requirements for a position focused on developing advanced algorithms for large model data production, emphasizing the importance of data knowledge systems, automatic classification, authoritative evaluation sets, quality assessment, and innovative solutions in the field of artificial intelligence and deep learning [1][2][3]. Group 1: Responsibilities - The role involves designing and developing algorithms to address key issues in large model data production, including data knowledge system generation, automatic corpus classification, authoritative evaluation set construction, and quality assessment of training data [1][5]. - Specific tasks include researching automatic knowledge graph generation based on LLM, developing classification algorithms, and creating standardized evaluation sets to assess model performance [1][5]. - The position also requires establishing a data-driven system for quality assessment, identifying low-quality data, and synthesizing training data to improve model performance [1][5]. Group 2: Requirements - Candidates should possess a master's degree or higher in computer science, artificial intelligence, deep learning, or related fields, and be proficient in deep learning frameworks such as PyTorch and TensorFlow [2][6]. - Strong problem-solving skills, self-motivation, and the ability to analyze and address issues are essential, along with effective communication and coordination abilities [2][6]. - Preference is given to candidates with practical experience in large model data system design, corpus classification, evaluation set construction, and data annotation algorithms [3][4][6].
论文解读之港科PLUTO:首次超越Rule-Based的规划器!
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the development and features of the PLUTO model within the end-to-end autonomous driving domain, emphasizing its unique two-stage architecture and its direct encoding of structured perception outputs for downstream control tasks [1][2]. Summary by Sections Overview of PLUTO - PLUTO is characterized by its three main losses: regression loss, classification loss, and imitation learning loss, which collectively contribute to the model's performance [7]. - Additional auxiliary losses are incorporated to aid model convergence [9]. Course Introduction - The article introduces a new course titled "End-to-End and VLA Autonomous Driving," developed in collaboration with top algorithm experts from domestic leading manufacturers, aimed at addressing the challenges faced by learners in this rapidly evolving field [12][15]. Learning Challenges - The course addresses the difficulties learners face due to the fast-paced development of technology and the fragmented nature of knowledge across various domains, making it hard for beginners to grasp the necessary concepts [13]. Course Features - The course is designed to provide quick entry into the field, build a framework for research capabilities, and combine theory with practical applications [15][16][17]. Course Outline - The course consists of several chapters covering topics such as the history and evolution of end-to-end algorithms, background knowledge on various technologies, and detailed discussions on both one-stage and two-stage end-to-end methods [20][21][22][29]. Practical Application - The course includes practical assignments, such as RLHF fine-tuning, allowing students to apply their theoretical knowledge in real-world scenarios [31]. Instructor Background - The instructor, Jason, has a strong academic and practical background in cutting-edge algorithms related to end-to-end and large models, contributing to the course's credibility [32]. Target Audience and Expected Outcomes - The course is aimed at individuals with a foundational understanding of autonomous driving and related technologies, with the goal of elevating their skills to the level of an end-to-end autonomous driving algorithm engineer within a year [36].
字节跳动这篇论文对理想有帮助的
理想TOP2· 2025-09-15 15:32
25年9月11日字节跳动发布 Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents 对理想的帮助之处在于,理想要做agent,大概率会参考的,一样会遇到类似 学习信号的强度(梯度 大小)与模型决策时的不确定性(熵)存在一种天生的、有害的耦合关系的问题 实际和人类学习挺像的,只要结果正确,就容易过渡强化其步骤正确性(类比销量高了,做啥都是对 的),遇到一个错误的路径,如果非常自信,容易不反思,无法矫正错误。迷茫探索时遇到错误,容 易畏手畏脚,不敢继续探索。 本应该被大力强化的自信且正确的步骤,只得到了微调 。本应该被严厉惩罚的自信且错误的步骤, 也只得到了微调 。而那些本应被谨慎对待的不确定的探索步骤,却承受了最剧烈的奖惩,导致训练 非常不稳定 。 字节这篇论文给出了解决这类问题的思路。 以下为更细化论述: 本质是在讲 解决一个当前LLM Agent训练中的核心困境:如何在最终结果"非成即败"(即稀疏奖励) 的漫长任务中,知道该奖励或惩罚哪一步决策 。 在传统的强化学习中,智能体(Agent) ...
进击新能源第一阵营 “增程豪华轿车新标杆”别克至境L7全国首秀
Yang Zi Wan Bao Wang· 2025-09-15 13:57
Core Viewpoint - The Buick Zhijing L7, a luxury electric vehicle, has been unveiled as the flagship model of Buick's high-end electric sub-brand, showcasing advanced technology and luxury features aimed at redefining the range-extended vehicle segment [1][3]. Group 1: Vehicle Features - The Zhijing L7 is built on the new Buick "Xiaoyao" super fusion architecture, integrating top technologies in driving, assisted driving, and luxury comfort [3]. - It features the "Zhenlong" range-extending system, which offers a maximum power output of 252 kW, equivalent to a 3.0T V6 engine, achieving 0-100 km/h in just 5.9 seconds and a combined fuel consumption of only 0.5L per 100 km [5][7]. - The vehicle boasts a pure electric range of 302 km and a total range of 1420 km, addressing common concerns about range anxiety [5][7]. Group 2: Intelligent Driving and Experience - The Zhijing L7 introduces the "Xiaoyao Zhixing" assisted driving system, featuring the Momenta R6 flywheel model based on end-to-end reinforcement learning, capable of handling complex driving scenarios [8]. - The vehicle has accumulated over 1 billion kilometers of safe driving with its assisted driving technology, positioning it among the top tier of intelligent driving experiences [8]. Group 3: Interior and Comfort - The interior design of the Zhijing L7 emphasizes luxury with a spacious cabin, featuring the industry's first dual 120° zero-gravity seats for enhanced comfort [18][20]. - It is equipped with a 27-speaker Buick Sound theater-level audio system, providing an immersive sound experience akin to being in a top-tier concert hall [18]. Group 4: Design and Aesthetics - The Zhijing L7 showcases a striking exterior design inspired by nature, with a luxurious silhouette and advanced features such as laser radar and high-end lighting [14][16]. - The vehicle's interior utilizes a new pure floating island design aesthetic, creating a sophisticated and elegant atmosphere [16]. Group 5: Market Positioning - As a representative of Buick's redefined brand value in the new energy era, the Zhijing L7 aims to compete in the first tier of the new energy vehicle market, leveraging its advanced range-extending technology and superior luxury experience [20].
张小珺对话OpenAI姚顺雨:生成新世界的系统
Founder Park· 2025-09-15 05:59
Core Insights - The article discusses the evolution of AI, particularly focusing on the transition to the "second half" of AI development, emphasizing the importance of language and reasoning in creating more generalizable AI systems [4][62]. Group 1: AI Evolution and Language - The concept of AI has evolved from rule-based systems to deep reinforcement learning, and now to language models that can reason and generalize across tasks [41][43]. - Language is highlighted as a fundamental tool for generalization, allowing AI to tackle a variety of tasks by leveraging reasoning capabilities [77][79]. Group 2: Agent Systems - The definition of an "Agent" has expanded to include systems that can interact with their environment and make decisions based on reasoning, rather than just following predefined rules [33][36]. - The development of language agents represents a significant shift, as they can perform tasks in more complex environments, such as coding and internet navigation, which were previously challenging for AI [43][54]. Group 3: Task Design and Reward Mechanisms - The article emphasizes the importance of defining effective tasks and environments for AI training, suggesting that the current bottleneck lies in task design rather than model training [62][64]. - A focus on intrinsic rewards, which are based on outcomes rather than processes, is proposed as a key factor for successful reinforcement learning applications [88][66]. Group 4: Future Directions - The future of AI development is seen as a combination of enhancing agent capabilities through better memory systems and intrinsic rewards, as well as exploring multi-agent systems [88][89]. - The potential for AI to generalize across various tasks is highlighted, with coding and mathematical tasks serving as prime examples of areas where AI can excel [80][82].
攻克强化学习「最慢一环」!交大字节联手,让大模型RL训练速度飙升2.6倍
量子位· 2025-09-13 08:06
Core Insights - The article discusses the inefficiencies in reinforcement learning (RL) training, particularly highlighting the rollout phase, which consumes over 80% of the training time and is limited by memory bandwidth and autoregressive characteristics [1][2]. Group 1: RhymeRL Framework - Shanghai Jiao Tong University and ByteDance's research team introduced RhymeRL, which enhances RL training throughput by 2.6 times without sacrificing accuracy by leveraging historical data [2][21]. - RhymeRL is based on two key components: HistoSpec and HistoPipe [7]. Group 2: HistoSpec - HistoSpec innovatively incorporates speculative decoding, using previous historical responses as the "best script," which transforms the rollout process from a token-by-token generation to a batch verification process [9][10]. - This method significantly increases computational density and speeds up response generation by allowing high acceptance rates of drafts derived from historical sequences [13][14]. Group 3: HistoPipe - HistoPipe optimizes GPU resource utilization by implementing a scheduling strategy that minimizes idle time, allowing for efficient processing of tasks of varying lengths [15][19]. - It employs a "cross-step complement" approach to balance workloads across GPUs, ensuring that resources are fully utilized without idle periods [17][18]. Group 4: Performance Improvement - The combination of HistoSpec and HistoPipe results in a remarkable performance boost, achieving a 2.61 times increase in end-to-end training throughput for tasks such as mathematics and coding [21]. - This advancement allows researchers and companies to train more powerful models with fewer resources and in shorter timeframes, accelerating the iteration of AI technologies [22]. Group 5: Significance of RhymeRL - RhymeRL proposes a new paradigm in reinforcement learning by utilizing historical information to enhance training efficiency, demonstrating the potential for better resource allocation and compatibility with existing training algorithms [23].
如何准备RL面试相关的问题?
自动驾驶之心· 2025-09-12 16:03
Core Insights - The article discusses the GRPO (Group Relative Policy Optimization) framework, primarily categorizing it as on-policy but acknowledging its potential off-policy adaptations [5][6][7] - It emphasizes the importance of understanding the data sources and the implications of using old policy data in the context of on-policy and off-policy learning [10][11] GRPO Framework - GRPO is typically considered on-policy as it estimates group-relative advantage using data generated by the current behavior policy [5][6] - Recent works have explored off-policy adaptations of GRPO, utilizing data from older policies to enhance sample efficiency and stability [4][5][7] - The original implementation of GRPO relies on current policy data to estimate gradients and advantages, aligning with traditional on-policy definitions [6][10] Importance Sampling - Importance Sampling (IS) is a key method in off-policy evaluation, allowing the use of data from a behavior policy to assess the value of a target policy [8][9] - The article outlines the mathematical formulation of IS, highlighting its role in correcting biases arising from differences in sampling distributions [12][14] - Weighted Importance Sampling is introduced as a solution to the high variance problem associated with basic IS [15][16][17] GSPO and DAPO - GSPO (Group Sequence Policy Optimization) addresses high variance and instability issues in GRPO/PPO by shifting the focus to sequence-level importance ratios [18][21] - DAPO (Decoupled Clip & Dynamic Sampling Policy Optimization) enhances training stability and sample efficiency in long chain-of-thought tasks through various engineering techniques [20][24] - Both GSPO and DAPO aim to improve the robustness of training processes in large-scale language models, particularly in handling long sequences and mitigating entropy collapse [20][24][27] Entropy Collapse - Entropy collapse refers to the rapid decrease in policy randomness during training, leading to reduced exploration and potential suboptimal convergence [28][30] - The article discusses various strategies to mitigate entropy collapse, including entropy regularization, KL penalties, and dynamic sampling [32][33][34] - It emphasizes the need for a balance between exploration and exploitation to maintain effective training dynamics [37][41] Relationship Between Reward Hacking and Entropy Collapse - Reward hacking occurs when an agent finds shortcuts to maximize rewards, often leading to entropy collapse as the policy becomes overly deterministic [41][42] - The article outlines the cyclical relationship between reward hacking and entropy collapse, suggesting that addressing one can help mitigate the other [41][42] - Strategies for managing both issues include refining reward functions, enhancing training stability, and ensuring diverse sampling [47][48]
GPT-5 为啥不 “胡说” 了?OpenAI 新论文讲透了
腾讯研究院· 2025-09-12 08:58
Core Viewpoint - The article discusses the advancements and challenges of OpenAI's GPT-5, particularly focusing on the significant reduction in hallucination rates compared to previous models, while also highlighting the underlying mechanisms and implications of these changes [5][6][25]. Group 1: Hallucination Rates and Mechanisms - GPT-5 has a hallucination rate that is approximately 45% lower than GPT-4 and about 80% lower than OpenAI's earlier models [6]. - The reduction in hallucination rates is attributed to enhanced reinforcement learning techniques that allow models to refine their reasoning processes and recognize their errors [8][9]. - The paper published by OpenAI indicates that hallucinations are an inevitable byproduct of the statistical learning nature of language models, making it more challenging to generate reliable information than to assess its reliability [12][16]. Group 2: Theoretical Framework - OpenAI introduces a theoretical "Is-It-Valid" (IIV) judgment mechanism that determines the validity of generated sentences based on their internal probabilities [13]. - The model's tendency to generate plausible-sounding but incorrect information is exacerbated by data sparsity, complexity, and noise in training data [14][16]. - The mathematical conclusion presented in the paper suggests that the error rate of generative models is at least double that of the IIV judgment errors, indicating a compounding effect of judgment mistakes on hallucinations [15][16]. Group 3: Post-Training Challenges - Post-training processes have not effectively mitigated hallucinations, as current evaluation metrics tend to reward models for providing confident but potentially incorrect answers [18][24]. - The article critiques the binary scoring systems used in mainstream AI evaluations, which penalize uncertainty and discourage models from expressing "I don't know" [21][24]. - The reinforcement learning processes that utilize binary reward paths may inadvertently promote overconfidence in models, leading to increased hallucination rates [27][29]. Group 4: Future Directions and Solutions - The article suggests that introducing a penalty-based scoring mechanism during post-training could help models better calibrate their confidence levels and reduce hallucinations [33]. - A shift from a score-optimization focus to a truth-oriented approach is proposed as a potential solution to the hallucination problem [34].