Workflow
强化学习
icon
Search documents
蚂蚁集团大模型数据智能算法工程师招聘(可内推)
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the responsibilities and requirements for a position focused on developing advanced algorithms for large model data production, emphasizing the importance of data knowledge systems, automatic classification, authoritative evaluation sets, quality assessment, and innovative solutions in the field of artificial intelligence and deep learning [1][2][3]. Group 1: Responsibilities - The role involves designing and developing algorithms to address key issues in large model data production, including data knowledge system generation, automatic corpus classification, authoritative evaluation set construction, and quality assessment of training data [1][5]. - Specific tasks include researching automatic knowledge graph generation based on LLM, developing classification algorithms, and creating standardized evaluation sets to assess model performance [1][5]. - The position also requires establishing a data-driven system for quality assessment, identifying low-quality data, and synthesizing training data to improve model performance [1][5]. Group 2: Requirements - Candidates should possess a master's degree or higher in computer science, artificial intelligence, deep learning, or related fields, and be proficient in deep learning frameworks such as PyTorch and TensorFlow [2][6]. - Strong problem-solving skills, self-motivation, and the ability to analyze and address issues are essential, along with effective communication and coordination abilities [2][6]. - Preference is given to candidates with practical experience in large model data system design, corpus classification, evaluation set construction, and data annotation algorithms [3][4][6].
论文解读之港科PLUTO:首次超越Rule-Based的规划器!
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the development and features of the PLUTO model within the end-to-end autonomous driving domain, emphasizing its unique two-stage architecture and its direct encoding of structured perception outputs for downstream control tasks [1][2]. Summary by Sections Overview of PLUTO - PLUTO is characterized by its three main losses: regression loss, classification loss, and imitation learning loss, which collectively contribute to the model's performance [7]. - Additional auxiliary losses are incorporated to aid model convergence [9]. Course Introduction - The article introduces a new course titled "End-to-End and VLA Autonomous Driving," developed in collaboration with top algorithm experts from domestic leading manufacturers, aimed at addressing the challenges faced by learners in this rapidly evolving field [12][15]. Learning Challenges - The course addresses the difficulties learners face due to the fast-paced development of technology and the fragmented nature of knowledge across various domains, making it hard for beginners to grasp the necessary concepts [13]. Course Features - The course is designed to provide quick entry into the field, build a framework for research capabilities, and combine theory with practical applications [15][16][17]. Course Outline - The course consists of several chapters covering topics such as the history and evolution of end-to-end algorithms, background knowledge on various technologies, and detailed discussions on both one-stage and two-stage end-to-end methods [20][21][22][29]. Practical Application - The course includes practical assignments, such as RLHF fine-tuning, allowing students to apply their theoretical knowledge in real-world scenarios [31]. Instructor Background - The instructor, Jason, has a strong academic and practical background in cutting-edge algorithms related to end-to-end and large models, contributing to the course's credibility [32]. Target Audience and Expected Outcomes - The course is aimed at individuals with a foundational understanding of autonomous driving and related technologies, with the goal of elevating their skills to the level of an end-to-end autonomous driving algorithm engineer within a year [36].
字节跳动这篇论文对理想有帮助的
理想TOP2· 2025-09-15 15:32
25年9月11日字节跳动发布 Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents 对理想的帮助之处在于,理想要做agent,大概率会参考的,一样会遇到类似 学习信号的强度(梯度 大小)与模型决策时的不确定性(熵)存在一种天生的、有害的耦合关系的问题 实际和人类学习挺像的,只要结果正确,就容易过渡强化其步骤正确性(类比销量高了,做啥都是对 的),遇到一个错误的路径,如果非常自信,容易不反思,无法矫正错误。迷茫探索时遇到错误,容 易畏手畏脚,不敢继续探索。 本应该被大力强化的自信且正确的步骤,只得到了微调 。本应该被严厉惩罚的自信且错误的步骤, 也只得到了微调 。而那些本应被谨慎对待的不确定的探索步骤,却承受了最剧烈的奖惩,导致训练 非常不稳定 。 字节这篇论文给出了解决这类问题的思路。 以下为更细化论述: 本质是在讲 解决一个当前LLM Agent训练中的核心困境:如何在最终结果"非成即败"(即稀疏奖励) 的漫长任务中,知道该奖励或惩罚哪一步决策 。 在传统的强化学习中,智能体(Agent) ...
进击新能源第一阵营 “增程豪华轿车新标杆”别克至境L7全国首秀
Yang Zi Wan Bao Wang· 2025-09-15 13:57
Core Viewpoint - The Buick Zhijing L7, a luxury electric vehicle, has been unveiled as the flagship model of Buick's high-end electric sub-brand, showcasing advanced technology and luxury features aimed at redefining the range-extended vehicle segment [1][3]. Group 1: Vehicle Features - The Zhijing L7 is built on the new Buick "Xiaoyao" super fusion architecture, integrating top technologies in driving, assisted driving, and luxury comfort [3]. - It features the "Zhenlong" range-extending system, which offers a maximum power output of 252 kW, equivalent to a 3.0T V6 engine, achieving 0-100 km/h in just 5.9 seconds and a combined fuel consumption of only 0.5L per 100 km [5][7]. - The vehicle boasts a pure electric range of 302 km and a total range of 1420 km, addressing common concerns about range anxiety [5][7]. Group 2: Intelligent Driving and Experience - The Zhijing L7 introduces the "Xiaoyao Zhixing" assisted driving system, featuring the Momenta R6 flywheel model based on end-to-end reinforcement learning, capable of handling complex driving scenarios [8]. - The vehicle has accumulated over 1 billion kilometers of safe driving with its assisted driving technology, positioning it among the top tier of intelligent driving experiences [8]. Group 3: Interior and Comfort - The interior design of the Zhijing L7 emphasizes luxury with a spacious cabin, featuring the industry's first dual 120° zero-gravity seats for enhanced comfort [18][20]. - It is equipped with a 27-speaker Buick Sound theater-level audio system, providing an immersive sound experience akin to being in a top-tier concert hall [18]. Group 4: Design and Aesthetics - The Zhijing L7 showcases a striking exterior design inspired by nature, with a luxurious silhouette and advanced features such as laser radar and high-end lighting [14][16]. - The vehicle's interior utilizes a new pure floating island design aesthetic, creating a sophisticated and elegant atmosphere [16]. Group 5: Market Positioning - As a representative of Buick's redefined brand value in the new energy era, the Zhijing L7 aims to compete in the first tier of the new energy vehicle market, leveraging its advanced range-extending technology and superior luxury experience [20].
张小珺对话OpenAI姚顺雨:生成新世界的系统
Founder Park· 2025-09-15 05:59
Core Insights - The article discusses the evolution of AI, particularly focusing on the transition to the "second half" of AI development, emphasizing the importance of language and reasoning in creating more generalizable AI systems [4][62]. Group 1: AI Evolution and Language - The concept of AI has evolved from rule-based systems to deep reinforcement learning, and now to language models that can reason and generalize across tasks [41][43]. - Language is highlighted as a fundamental tool for generalization, allowing AI to tackle a variety of tasks by leveraging reasoning capabilities [77][79]. Group 2: Agent Systems - The definition of an "Agent" has expanded to include systems that can interact with their environment and make decisions based on reasoning, rather than just following predefined rules [33][36]. - The development of language agents represents a significant shift, as they can perform tasks in more complex environments, such as coding and internet navigation, which were previously challenging for AI [43][54]. Group 3: Task Design and Reward Mechanisms - The article emphasizes the importance of defining effective tasks and environments for AI training, suggesting that the current bottleneck lies in task design rather than model training [62][64]. - A focus on intrinsic rewards, which are based on outcomes rather than processes, is proposed as a key factor for successful reinforcement learning applications [88][66]. Group 4: Future Directions - The future of AI development is seen as a combination of enhancing agent capabilities through better memory systems and intrinsic rewards, as well as exploring multi-agent systems [88][89]. - The potential for AI to generalize across various tasks is highlighted, with coding and mathematical tasks serving as prime examples of areas where AI can excel [80][82].
攻克强化学习「最慢一环」!交大字节联手,让大模型RL训练速度飙升2.6倍
量子位· 2025-09-13 08:06
强化学习的训练效率,实在是太低了! 随着DeepSeek、GPT-4o、Gemini等模型的激烈角逐,大模型"深度思考"能力的背后,强化学习 (RL) 无疑是那把最关键的密钥。 允中 发自 凹非寺 量子位 | 公众号 QbitAI 然而,这场竞赛的背后,一个巨大的瓶颈正悄然限制着所有玩家的速度——相较于预训练和推理,RL训练更像一个效率低下的"手工作坊", 投入巨大但产出缓慢 。 其中,占据超过80%时间的Rollout (响应生成) 阶段,由于其内存带宽限制和自回归特性,成为了整个AI基础设施中公认的阿喀琉斯之 踵。 如何攻克这块AI基建的最后高地?现在,上海交通大学与字节跳动研究团队给出了一个全新的答案。 该团队联手推出的 RhymeRL ,从一个被忽视的现象入手,巧妙地将历史数据变废为宝,在不牺牲精度的前提下, 将RL训练吞吐量提升了 2.6倍 。 模型生成的答案存在两大"历史相似性" 该研究团队深入分析了大量RL训练过程,发现在相邻的两个训练周期中,尽管模型权重已经更新,但对于同一个问题 (Prompt) ,模型生 成的答案 (Rollout) 存在两大"历史相似性": 第一,序列相似性 。 新答案" ...
如何准备RL面试相关的问题?
自动驾驶之心· 2025-09-12 16:03
作者 | Abel chen 编辑 | 自动驾驶之心 原文链接: https://zhuanlan.zhihu.com/p/1948681769332240910 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 本文只做学术分享,如有侵权,联系删文 1. GRPO是on policy还是off policy?为什么? 简短答案: GRPO 最初设计和常用实现是 on-policy(在线/近端策略式) ;但它可以被扩展为 off-policy,已有工作专门研究这种扩展及其优缺点。 为什么是 on-policy(解释) 为什么有人说可以 off-policy(扩展) 最近有工作把 GRPO 的思想推广到 off-policy 场景(比如用来自别的策略 / 旧批次的数据来估计优势并做修正),并且报告了在样本效率、稳定性等方面的潜在好 处与权衡。也就是说,虽然 GRPO 本质上是基于 on-policy 的 surrogate objective,但数学上和工程上可以设计重要性采样、批内归一化或裁剪等技巧把它改成 off- policy 版本。 实践建议(简要) ...
GPT-5 为啥不 “胡说” 了?OpenAI 新论文讲透了
腾讯研究院· 2025-09-12 08:58
Core Viewpoint - The article discusses the advancements and challenges of OpenAI's GPT-5, particularly focusing on the significant reduction in hallucination rates compared to previous models, while also highlighting the underlying mechanisms and implications of these changes [5][6][25]. Group 1: Hallucination Rates and Mechanisms - GPT-5 has a hallucination rate that is approximately 45% lower than GPT-4 and about 80% lower than OpenAI's earlier models [6]. - The reduction in hallucination rates is attributed to enhanced reinforcement learning techniques that allow models to refine their reasoning processes and recognize their errors [8][9]. - The paper published by OpenAI indicates that hallucinations are an inevitable byproduct of the statistical learning nature of language models, making it more challenging to generate reliable information than to assess its reliability [12][16]. Group 2: Theoretical Framework - OpenAI introduces a theoretical "Is-It-Valid" (IIV) judgment mechanism that determines the validity of generated sentences based on their internal probabilities [13]. - The model's tendency to generate plausible-sounding but incorrect information is exacerbated by data sparsity, complexity, and noise in training data [14][16]. - The mathematical conclusion presented in the paper suggests that the error rate of generative models is at least double that of the IIV judgment errors, indicating a compounding effect of judgment mistakes on hallucinations [15][16]. Group 3: Post-Training Challenges - Post-training processes have not effectively mitigated hallucinations, as current evaluation metrics tend to reward models for providing confident but potentially incorrect answers [18][24]. - The article critiques the binary scoring systems used in mainstream AI evaluations, which penalize uncertainty and discourage models from expressing "I don't know" [21][24]. - The reinforcement learning processes that utilize binary reward paths may inadvertently promote overconfidence in models, leading to increased hallucination rates [27][29]. Group 4: Future Directions and Solutions - The article suggests that introducing a penalty-based scoring mechanism during post-training could help models better calibrate their confidence levels and reduce hallucinations [33]. - A shift from a score-optimization focus to a truth-oriented approach is proposed as a potential solution to the hallucination problem [34].
一夜刷屏,27岁姚顺雨离职OpenAI,清华姚班天才转型做产品经理?
3 6 Ke· 2025-09-12 04:04
Core Insights - The news highlights the significant attention surrounding Shunyu Yao, a prominent AI talent, and the implications of his potential recruitment by Tencent, which has been officially denied [1][6] - Yao's expertise and contributions to OpenAI's Deep Research make him a highly sought-after figure in the AI industry, with rumors of a salary of 100 million RMB circulating, reflecting the competitive landscape for top AI talent [3][4] Group 1: Shunyu Yao's Background and Achievements - Shunyu Yao, aged 27, is a graduate of Tsinghua University and Princeton University, recognized for his exceptional academic performance and contributions to AI research [7][11] - He has been a core contributor to OpenAI's projects, including the development of intelligent agents and digital automation tools, which are pivotal for advancing AI capabilities [5][11] - His research has garnered significant recognition, with over 15,000 citations, indicating his influence in the field of AI [11][12] Group 2: Industry Implications - The recruitment of top AI talent like Yao signifies a deeper shift in the global AI talent ecosystem, as companies vie for expertise to drive innovation [6][19] - Yao's perspective on the importance of evaluation over training in AI development suggests a potential paradigm shift in how AI models are assessed and improved, emphasizing the need for practical applications [18][20] - The competitive salary offers from companies like Meta, which reportedly reached 100 million USD for core researchers, highlight the escalating financial stakes in attracting leading AI professionals [3][4]
外滩大会速递(1):萨顿提出AI发展新范式,强化学习与多智能体协作成关键
Investment Rating - The report does not explicitly provide an investment rating for the industry or specific companies within it. Core Insights - Richard Sutton proposes that we are entering an "Era of Experience" characterized by autonomous interaction and environmental feedback, emphasizing the need for systems that can create new knowledge through direct interaction with their environments [1][8] - Sutton argues that public fears regarding AI, such as bias and unemployment, are overstated, and that multi-agent cooperation can lead to win-win outcomes [9] - The report highlights the importance of continual learning and meta-learning as key areas for unlocking the potential of reinforcement learning [3][13] Summary by Sections Event - Sutton's presentation at the 2025 INCLUSION Conference outlines a shift from static knowledge transfer to dynamic agent-environment interactions, marking a transition to an "Era of Experience" [1][8] - He identifies reinforcement learning as crucial for this transition, but notes that its full potential is contingent on advancements in continual and meta-learning [1][8] Commentary - The report discusses the shift from "data as experience" to "capability as interaction," suggesting that firms need to develop systems that can actively engage with their environments to generate new knowledge [2][11] - It emphasizes that the real bottleneck in reinforcement learning is not model parameters but the ability to handle time and task sequences, highlighting the need for continual and meta-learning capabilities [3][13] Technical Bottlenecks - The report identifies two main constraints in reinforcement learning: the need for continual learning to avoid catastrophic forgetting and the need for meta-learning to enable rapid adaptation across tasks [3][13] - It suggests that R&D should focus on long-horizon evaluation and the integration of memory mechanisms and planning architectures [3][13] Decentralized Collaboration - The report posits that decentralized collaboration is not only a technical choice but also a governance issue, requiring clear incentives and transparent protocols to function effectively [4][12] - It outlines three foundational institutional requirements for effective decentralized collaboration: open interfaces, cooperation-competition testbeds, and auditability [4][12] Replacement Dynamics - Sutton's view on "replacement" suggests that it will occur at the task level rather than entire job roles, urging organizations to proactively deconstruct tasks and redesign processes for human-AI collaboration [5][15] - The report recommends establishing a human-AI division of labor and reforming performance metrics to focus on collaborative efficiency [5][15]