强化学习 - filings, earnings calls, financial reports, news - Reportify

强化学习

Search documents

论文解读之港科PLUTO：首次超越Rule-Based的规划器！

自动驾驶之心· 2025-09-15 23:33

Core Viewpoint - The article discusses the development and features of the PLUTO model within the end-to-end autonomous driving domain, emphasizing its unique two-stage architecture and its direct encoding of structured perception outputs for downstream control tasks [1][2]. Summary by Sections Overview of PLUTO - PLUTO is characterized by its three main losses: regression loss, classification loss, and imitation learning loss, which collectively contribute to the model's performance [7]. - Additional auxiliary losses are incorporated to aid model convergence [9]. Course Introduction - The article introduces a new course titled "End-to-End and VLA Autonomous Driving," developed in collaboration with top algorithm experts from domestic leading manufacturers, aimed at addressing the challenges faced by learners in this rapidly evolving field [12][15]. Learning Challenges - The course addresses the difficulties learners face due to the fast-paced development of technology and the fragmented nature of knowledge across various domains, making it hard for beginners to grasp the necessary concepts [13]. Course Features - The course is designed to provide quick entry into the field, build a framework for research capabilities, and combine theory with practical applications [15][16][17]. Course Outline - The course consists of several chapters covering topics such as the history and evolution of end-to-end algorithms, background knowledge on various technologies, and detailed discussions on both one-stage and two-stage end-to-end methods [20][21][22][29]. Practical Application - The course includes practical assignments, such as RLHF fine-tuning, allowing students to apply their theoretical knowledge in real-world scenarios [31]. Instructor Background - The instructor, Jason, has a strong academic and practical background in cutting-edge algorithms related to end-to-end and large models, contributing to the course's credibility [32]. Target Audience and Expected Outcomes - The course is aimed at individuals with a foundational understanding of autonomous driving and related technologies, with the goal of elevating their skills to the level of an end-to-end autonomous driving algorithm engineer within a year [36].

端到端自动驾驶

多模态大模型

端到端自动驾驶

多模态大模型

字节跳动这篇论文对理想有帮助的

理想TOP2· 2025-09-15 15:32

25年9月11日字节跳动发布 Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents 对理想的帮助之处在于，理想要做agent，大概率会参考的，一样会遇到类似学习信号的强度（梯度大小）与模型决策时的不确定性（熵）存在一种天生的、有害的耦合关系的问题实际和人类学习挺像的，只要结果正确，就容易过渡强化其步骤正确性（类比销量高了，做啥都是对的），遇到一个错误的路径，如果非常自信，容易不反思，无法矫正错误。迷茫探索时遇到错误，容易畏手畏脚，不敢继续探索。本应该被大力强化的自信且正确的步骤，只得到了微调。本应该被严厉惩罚的自信且错误的步骤，也只得到了微调。而那些本应被谨慎对待的不确定的探索步骤，却承受了最剧烈的奖惩，导致训练非常不稳定。字节这篇论文给出了解决这类问题的思路。以下为更细化论述：本质是在讲解决一个当前LLM Agent训练中的核心困境：如何在最终结果"非成即败"（即稀疏奖励）的漫长任务中，知道该奖励或惩罚哪一步决策。在传统的强化学习中，智能体（Agent） ...

LLM Agent训练

Artificial Intelligence

熵调制策略梯度(EMPG)

LLM Agent训练

Artificial Intelligence

熵调制策略梯度(EMPG)

进击新能源第一阵营 “增程豪华轿车新标杆”别克至境L7全国首秀

Yang Zi Wan Bao Wang· 2025-09-15 13:57

Core Viewpoint - The Buick Zhijing L7, a luxury electric vehicle, has been unveiled as the flagship model of Buick's high-end electric sub-brand, showcasing advanced technology and luxury features aimed at redefining the range-extended vehicle segment [1][3]. Group 1: Vehicle Features - The Zhijing L7 is built on the new Buick "Xiaoyao" super fusion architecture, integrating top technologies in driving, assisted driving, and luxury comfort [3]. - It features the "Zhenlong" range-extending system, which offers a maximum power output of 252 kW, equivalent to a 3.0T V6 engine, achieving 0-100 km/h in just 5.9 seconds and a combined fuel consumption of only 0.5L per 100 km [5][7]. - The vehicle boasts a pure electric range of 302 km and a total range of 1420 km, addressing common concerns about range anxiety [5][7]. Group 2: Intelligent Driving and Experience - The Zhijing L7 introduces the "Xiaoyao Zhixing" assisted driving system, featuring the Momenta R6 flywheel model based on end-to-end reinforcement learning, capable of handling complex driving scenarios [8]. - The vehicle has accumulated over 1 billion kilometers of safe driving with its assisted driving technology, positioning it among the top tier of intelligent driving experiences [8]. Group 3: Interior and Comfort - The interior design of the Zhijing L7 emphasizes luxury with a spacious cabin, featuring the industry's first dual 120° zero-gravity seats for enhanced comfort [18][20]. - It is equipped with a 27-speaker Buick Sound theater-level audio system, providing an immersive sound experience akin to being in a top-tier concert hall [18]. Group 4: Design and Aesthetics - The Zhijing L7 showcases a striking exterior design inspired by nature, with a luxurious silhouette and advanced features such as laser radar and high-end lighting [14][16]. - The vehicle's interior utilizes a new pure floating island design aesthetic, creating a sophisticated and elegant atmosphere [16]. Group 5: Market Positioning - As a representative of Buick's redefined brand value in the new energy era, the Zhijing L7 aims to compete in the first tier of the new energy vehicle market, leveraging its advanced range-extending technology and superior luxury experience [20].

逍遥智行辅助驾驶系统

Momenta R6飞轮大模型

高通SA8775P芯片

逍遥智行辅助驾驶系统

Momenta R6飞轮大模型

高通SA8775P芯片

张小珺对话OpenAI姚顺雨：生成新世界的系统

Founder Park· 2025-09-15 05:59

Core Insights - The article discusses the evolution of AI, particularly focusing on the transition to the "second half" of AI development, emphasizing the importance of language and reasoning in creating more generalizable AI systems [4][62]. Group 1: AI Evolution and Language - The concept of AI has evolved from rule-based systems to deep reinforcement learning, and now to language models that can reason and generalize across tasks [41][43]. - Language is highlighted as a fundamental tool for generalization, allowing AI to tackle a variety of tasks by leveraging reasoning capabilities [77][79]. Group 2: Agent Systems - The definition of an "Agent" has expanded to include systems that can interact with their environment and make decisions based on reasoning, rather than just following predefined rules [33][36]. - The development of language agents represents a significant shift, as they can perform tasks in more complex environments, such as coding and internet navigation, which were previously challenging for AI [43][54]. Group 3: Task Design and Reward Mechanisms - The article emphasizes the importance of defining effective tasks and environments for AI training, suggesting that the current bottleneck lies in task design rather than model training [62][64]. - A focus on intrinsic rewards, which are based on outcomes rather than processes, is proposed as a key factor for successful reinforcement learning applications [88][66]. Group 4: Future Directions - The future of AI development is seen as a combination of enhancing agent capabilities through better memory systems and intrinsic rewards, as well as exploring multi-agent systems [88][89]. - The potential for AI to generalize across various tasks is highlighted, with coding and mathematical tasks serving as prime examples of areas where AI can excel [80][82].

语言智能体

语言智能体

攻克强化学习「最慢一环」！交大字节联手，让大模型RL训练速度飙升2.6倍

量子位· 2025-09-13 08:06

Core Insights - The article discusses the inefficiencies in reinforcement learning (RL) training, particularly highlighting the rollout phase, which consumes over 80% of the training time and is limited by memory bandwidth and autoregressive characteristics [1][2]. Group 1: RhymeRL Framework - Shanghai Jiao Tong University and ByteDance's research team introduced RhymeRL, which enhances RL training throughput by 2.6 times without sacrificing accuracy by leveraging historical data [2][21]. - RhymeRL is based on two key components: HistoSpec and HistoPipe [7]. Group 2: HistoSpec - HistoSpec innovatively incorporates speculative decoding, using previous historical responses as the "best script," which transforms the rollout process from a token-by-token generation to a batch verification process [9][10]. - This method significantly increases computational density and speeds up response generation by allowing high acceptance rates of drafts derived from historical sequences [13][14]. Group 3: HistoPipe - HistoPipe optimizes GPU resource utilization by implementing a scheduling strategy that minimizes idle time, allowing for efficient processing of tasks of varying lengths [15][19]. - It employs a "cross-step complement" approach to balance workloads across GPUs, ensuring that resources are fully utilized without idle periods [17][18]. Group 4: Performance Improvement - The combination of HistoSpec and HistoPipe results in a remarkable performance boost, achieving a 2.61 times increase in end-to-end training throughput for tasks such as mathematics and coding [21]. - This advancement allows researchers and companies to train more powerful models with fewer resources and in shorter timeframes, accelerating the iteration of AI technologies [22]. Group 5: Significance of RhymeRL - RhymeRL proposes a new paradigm in reinforcement learning by utilizing historical information to enhance training efficiency, demonstrating the potential for better resource allocation and compatibility with existing training algorithms [23].

如何准备RL面试相关的问题？

自动驾驶之心· 2025-09-12 16:03

Core Insights - The article discusses the GRPO (Group Relative Policy Optimization) framework, primarily categorizing it as on-policy but acknowledging its potential off-policy adaptations [5][6][7] - It emphasizes the importance of understanding the data sources and the implications of using old policy data in the context of on-policy and off-policy learning [10][11] GRPO Framework - GRPO is typically considered on-policy as it estimates group-relative advantage using data generated by the current behavior policy [5][6] - Recent works have explored off-policy adaptations of GRPO, utilizing data from older policies to enhance sample efficiency and stability [4][5][7] - The original implementation of GRPO relies on current policy data to estimate gradients and advantages, aligning with traditional on-policy definitions [6][10] Importance Sampling - Importance Sampling (IS) is a key method in off-policy evaluation, allowing the use of data from a behavior policy to assess the value of a target policy [8][9] - The article outlines the mathematical formulation of IS, highlighting its role in correcting biases arising from differences in sampling distributions [12][14] - Weighted Importance Sampling is introduced as a solution to the high variance problem associated with basic IS [15][16][17] GSPO and DAPO - GSPO (Group Sequence Policy Optimization) addresses high variance and instability issues in GRPO/PPO by shifting the focus to sequence-level importance ratios [18][21] - DAPO (Decoupled Clip & Dynamic Sampling Policy Optimization) enhances training stability and sample efficiency in long chain-of-thought tasks through various engineering techniques [20][24] - Both GSPO and DAPO aim to improve the robustness of training processes in large-scale language models, particularly in handling long sequences and mitigating entropy collapse [20][24][27] Entropy Collapse - Entropy collapse refers to the rapid decrease in policy randomness during training, leading to reduced exploration and potential suboptimal convergence [28][30] - The article discusses various strategies to mitigate entropy collapse, including entropy regularization, KL penalties, and dynamic sampling [32][33][34] - It emphasizes the need for a balance between exploration and exploitation to maintain effective training dynamics [37][41] Relationship Between Reward Hacking and Entropy Collapse - Reward hacking occurs when an agent finds shortcuts to maximize rewards, often leading to entropy collapse as the policy becomes overly deterministic [41][42] - The article outlines the cyclical relationship between reward hacking and entropy collapse, suggesting that addressing one can help mitigate the other [41][42] - Strategies for managing both issues include refining reward functions, enhancing training stability, and ensuring diverse sampling [47][48]

重要性采样

重要性采样

GPT-5 为啥不 “胡说” 了？OpenAI 新论文讲透了

腾讯研究院· 2025-09-12 08:58

Core Viewpoint - The article discusses the advancements and challenges of OpenAI's GPT-5, particularly focusing on the significant reduction in hallucination rates compared to previous models, while also highlighting the underlying mechanisms and implications of these changes [5][6][25]. Group 1: Hallucination Rates and Mechanisms - GPT-5 has a hallucination rate that is approximately 45% lower than GPT-4 and about 80% lower than OpenAI's earlier models [6]. - The reduction in hallucination rates is attributed to enhanced reinforcement learning techniques that allow models to refine their reasoning processes and recognize their errors [8][9]. - The paper published by OpenAI indicates that hallucinations are an inevitable byproduct of the statistical learning nature of language models, making it more challenging to generate reliable information than to assess its reliability [12][16]. Group 2: Theoretical Framework - OpenAI introduces a theoretical "Is-It-Valid" (IIV) judgment mechanism that determines the validity of generated sentences based on their internal probabilities [13]. - The model's tendency to generate plausible-sounding but incorrect information is exacerbated by data sparsity, complexity, and noise in training data [14][16]. - The mathematical conclusion presented in the paper suggests that the error rate of generative models is at least double that of the IIV judgment errors, indicating a compounding effect of judgment mistakes on hallucinations [15][16]. Group 3: Post-Training Challenges - Post-training processes have not effectively mitigated hallucinations, as current evaluation metrics tend to reward models for providing confident but potentially incorrect answers [18][24]. - The article critiques the binary scoring systems used in mainstream AI evaluations, which penalize uncertainty and discourage models from expressing "I don't know" [21][24]. - The reinforcement learning processes that utilize binary reward paths may inadvertently promote overconfidence in models, leading to increased hallucination rates [27][29]. Group 4: Future Directions and Solutions - The article suggests that introducing a penalty-based scoring mechanism during post-training could help models better calibrate their confidence levels and reduce hallucinations [33]. - A shift from a score-optimization focus to a truth-oriented approach is proposed as a potential solution to the hallucination problem [34].

Artificial Intelligence

Artificial Intelligence

一夜刷屏，27岁姚顺雨离职OpenAI，清华姚班天才转型做产品经理？

3 6 Ke· 2025-09-12 04:04

Core Insights - The news highlights the significant attention surrounding Shunyu Yao, a prominent AI talent, and the implications of his potential recruitment by Tencent, which has been officially denied [1][6] - Yao's expertise and contributions to OpenAI's Deep Research make him a highly sought-after figure in the AI industry, with rumors of a salary of 100 million RMB circulating, reflecting the competitive landscape for top AI talent [3][4] Group 1: Shunyu Yao's Background and Achievements - Shunyu Yao, aged 27, is a graduate of Tsinghua University and Princeton University, recognized for his exceptional academic performance and contributions to AI research [7][11] - He has been a core contributor to OpenAI's projects, including the development of intelligent agents and digital automation tools, which are pivotal for advancing AI capabilities [5][11] - His research has garnered significant recognition, with over 15,000 citations, indicating his influence in the field of AI [11][12] Group 2: Industry Implications - The recruitment of top AI talent like Yao signifies a deeper shift in the global AI talent ecosystem, as companies vie for expertise to drive innovation [6][19] - Yao's perspective on the importance of evaluation over training in AI development suggests a potential paradigm shift in how AI models are assessed and improved, emphasizing the need for practical applications [18][20] - The competitive salary offers from companies like Meta, which reportedly reached 100 million USD for core researchers, highlight the escalating financial stakes in attracting leading AI professionals [3][4]

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

外滩大会速递（1）：萨顿提出AI发展新范式，强化学习与多智能体协作成关键

Haitong Securities International· 2025-09-12 02:47

Investment Rating - The report does not explicitly provide an investment rating for the industry or specific companies within it. Core Insights - Richard Sutton proposes that we are entering an "Era of Experience" characterized by autonomous interaction and environmental feedback, emphasizing the need for systems that can create new knowledge through direct interaction with their environments [1][8] - Sutton argues that public fears regarding AI, such as bias and unemployment, are overstated, and that multi-agent cooperation can lead to win-win outcomes [9] - The report highlights the importance of continual learning and meta-learning as key areas for unlocking the potential of reinforcement learning [3][13] Summary by Sections Event - Sutton's presentation at the 2025 INCLUSION Conference outlines a shift from static knowledge transfer to dynamic agent-environment interactions, marking a transition to an "Era of Experience" [1][8] - He identifies reinforcement learning as crucial for this transition, but notes that its full potential is contingent on advancements in continual and meta-learning [1][8] Commentary - The report discusses the shift from "data as experience" to "capability as interaction," suggesting that firms need to develop systems that can actively engage with their environments to generate new knowledge [2][11] - It emphasizes that the real bottleneck in reinforcement learning is not model parameters but the ability to handle time and task sequences, highlighting the need for continual and meta-learning capabilities [3][13] Technical Bottlenecks - The report identifies two main constraints in reinforcement learning: the need for continual learning to avoid catastrophic forgetting and the need for meta-learning to enable rapid adaptation across tasks [3][13] - It suggests that R&D should focus on long-horizon evaluation and the integration of memory mechanisms and planning architectures [3][13] Decentralized Collaboration - The report posits that decentralized collaboration is not only a technical choice but also a governance issue, requiring clear incentives and transparent protocols to function effectively [4][12] - It outlines three foundational institutional requirements for effective decentralized collaboration: open interfaces, cooperation-competition testbeds, and auditability [4][12] Replacement Dynamics - Sutton's view on "replacement" suggests that it will occur at the task level rather than entire job roles, urging organizations to proactively deconstruct tasks and redesign processes for human-AI collaboration [5][15] - The report recommends establishing a human-AI division of labor and reforming performance metrics to focus on collaborative efficiency [5][15]

多智能体协作

多智能体协作

外滩大会再证蚂蚁的底色：金融科技公司

Mei Ri Shang Bao· 2025-09-11 23:04

Group 1: Conference Overview - The 2025 Inclusion·Bund Conference opened in Shanghai with the theme "Reshaping Innovative Growth," featuring 550 guests from 16 countries and regions, including notable figures like Richard Sutton and Yuval Noah Harari [1] - The conference focused on five main topics: "Financial Technology," "Artificial Intelligence and Industry," "Innovation and Investment Ecology," "Global Dialogue and Cooperation," and "Responsible Innovation and Inclusive Future," comprising one main forum and 44 insight forums [1] - The event is recognized as one of Asia's three major financial technology conferences, attracting global attention for its openness, diversity, and forward-looking nature [1] Group 2: Insights from Richard Sutton - Richard Sutton, the 2024 Turing Award winner, emphasized that artificial intelligence is entering an "experience era," where the potential for AI exceeds previous capabilities [2] - He noted that current machine learning methods are reaching the limits of human data, and there is a need for new data sources generated through direct interaction between intelligent agents and the world [2] - Sutton defined "experience" as the interaction of observation, action, and reward, which is essential for learning and intelligence [2][3] Group 3: Insights from Wang Xingxing - Wang Xingxing, CEO of Yushutech, expressed regret for not pursuing AI earlier, highlighting the rapid development of large models that now allow for the integration of AI with robotics [4] - He discussed the emergence of a new embodied intelligence industry, where robots can possess AGI capabilities, enabling them to perceive, plan, and act autonomously [4] - Wang is optimistic about the future of innovation and entrepreneurship, stating that the barriers to entry have significantly lowered, creating a favorable environment for young innovators [4] Group 4: Ant Group's Technological Advancements - Ant Group is recognized as a leading technology financial company, with significant investments in AI and various sectors [5][6] - The conference showcased Ant Group's new AI assistant "Xiao Zheng," which integrates multiple large models to streamline government services [6] - Ant Group's CTO announced the launch of the "Agentic Contract," which will be natively deployed on their new Layer2 blockchain, Jovay [6]