强化学习

Search documents
OpenAI 4 名王牌研究员“叛变”,Meta 上亿美元的签约奖金终于花出去了
AI前线· 2025-06-28 05:13
Group 1 - Meta has recruited four former OpenAI researchers to join its newly established superintelligence lab, including Trapit Bansal, who played a key role in launching OpenAI's reinforcement learning project [1] - The other three researchers, Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai, previously assisted in establishing OpenAI's Zurich office and worked at DeepMind [1] - The formation of the superintelligence lab comes after Meta's internal large language model, Llama 4 Behemoth, faced performance issues, leading to a delay in its release [1] Group 2 - OpenAI revealed that Meta attempted to lure its employees with signing bonuses of up to $100 million, although many researchers declined the offers [2] - Meta's recruitment efforts extend beyond OpenAI, having recently hired Alexandr Wang, CEO of AI training dataset provider ScaleAI, and invested $14.3 billion for a 49% stake in the company [2] - Meta is also in advanced negotiations to acquire PlayAI, a voice AI developer, which has previously raised approximately $21 million in funding [2] Group 3 - Meta is seeking to hire tech investors Daniel Gross and former GitHub CEO Nat Friedman, who co-founded Safe Superintelligence, aiming to develop multi-task AI models that surpass human capabilities [3] - To support its AI initiatives, Meta plans to invest up to $65 billion in data center infrastructure, including the construction of a new data center equipped with over 1.3 million NVIDIA GPUs [3]
肖仰华教授:具身智能距离“涌现”还有多远?
3 6 Ke· 2025-06-27 11:30
Group 1 - The development of artificial intelligence (AI) has two clear trajectories: one represented by AIGC (Artificial Intelligence Generated Content) and the other by embodied intelligence [3][6] - AIGC is considered a technological revolution due to its foundational nature, its ability to significantly enhance productivity, and its profound impact on societal structures [10][11] - Embodied intelligence aims to replicate human sensory and action capabilities, but its impact on productivity is seen as limited compared to cognitive intelligence [11][13] Group 2 - The current stage of AI development emphasizes the quality of data and training strategies over sheer data volume and computational power [3][15] - The scaling law, which highlights the importance of large datasets and computational resources, is crucial for both AIGC and embodied intelligence [14][15] - The industry faces challenges in gathering sufficient high-quality data for embodied intelligence, which is currently lacking compared to language models [20][21] Group 3 - The future of embodied intelligence relies on its ability to understand and interact with human emotions, making emotional intelligence a core requirement for consumer applications [5][28] - The development of embodied AI is hindered by the complexity of accurately modeling human experiences and environmental interactions [30][32] - There is a need for innovative data acquisition strategies, such as combining real, synthetic, and simulated data, to overcome current limitations in embodied intelligence training [22][23]
OpenAI连丢4位大将!Ilya合作者/o1核心贡献者加入Meta,苏黎世三人组回应跳槽:集体做出的选择
量子位· 2025-06-27 08:09
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 扎克伯格未免有点太针对奥特曼了! 又有OpenAI核心研究员被挖走,而且做的是最前沿推理大模型。 最新跳槽到Meta的是 Trapit Bansal ,他在2022年加入OpemnAI, 曾与Ilya合作,在大模型强化学习研究的启动过程中发挥了关键作用 , 也被列为 o1的核心贡献者 。 △ Trapit Bansal 加入Meta后,Trapit Bansal在新成立的超级智能部门继续研究推理大模型。 Trapit Bansal博士毕业于马萨诸塞大学阿默斯特分校。 毕业后他加入OpenAI,与Ilya合作启动了强化学习在推理大模型上的研究。 目前他在谷歌学术上有2800+被引用数量,多篇论文与Ilya合著。 读博期间他就在OpenAI实习过,参与了多智能体强化学习研究:通过自我对弈让AI发现新的技能,无需专门为这些技能设计奖励。 | Trapit Bansal | | FOLLOW | | GET MY OWN PROFILE | | | --- | --- | --- | --- | --- | --- | | OpenAl | | | | | | | ...
突破通用领域推理的瓶颈!清华NLP实验室强化学习新研究RLPR
机器之心· 2025-06-27 00:49
Core Viewpoint - The article discusses the introduction of a novel reinforcement learning technique called Reinforcement Learning with Reference Probability Reward (RLPR), which addresses the limitations of existing methods in generalizing to diverse domains beyond mathematics and coding [4][24]. Group 1: RLPR Technology Overview - RLPR significantly enhances the quality of probability-based rewards through the Prob-to-Reward method, outperforming likelihood-based baseline methods in performance and training stability [7][24]. - The technology introduces a dynamic filtering mechanism based on reward standard deviation, further improving the stability and performance of reinforcement learning [8][17]. Group 2: Effectiveness of PR - The research team found that the generation probability of reference answers in large language models (LLMs) directly reflects the quality assessment of the model's reasoning process, indicating a strong correlation between the model's reasoning accuracy and the probability of generating correct reference answers [11][24]. - The PR mechanism effectively captures the model's self-assessment of reasoning quality, demonstrating its reliability in evaluating output [11][13]. Group 3: Advantages Over Existing Methods - Unlike existing RLVR methods that require extensive human resources for domain-specific validation rules, RLPR generates reward scores with a simple forward pass, making it more efficient in handling the complexity of natural language [13][24]. - RLPR's dynamic filtering mechanism retains samples with high reward standard deviation for training, enhancing training stability and effectiveness [17][24]. Group 4: Robustness and Validation - The research team evaluated the quality of different reward sources using the ROC-AUC metric, showing that PR outperformed rule-based rewards and verifier model rewards at a scale of 0.5 billion, with further improvements possible as model capabilities increase [19][21]. - RLPR demonstrated stable performance improvements across various training templates and base models, including Gemma and Llama, surpassing the performance of traditional rule-based RLVR baselines [22][24].
神经因子挖掘(五):强化学习混频Multi-StepDQN择时策略
Changjiang Securities· 2025-06-26 11:41
金融工程丨深度报告 [Table_Title] 神经因子挖掘(五)—— 强化学习混频 Multi-Step DQN 择时策略 %% %% %% %% research.95579.com 1 丨证券研究报告丨 报告要点 [Table_Summary] 我们设计 DQN 的核心是学习在给定市场状态下最优交易动作的潜在价值。将 DQN 应用于中 证 1000 指数日频择时,模型信号(做多/做空/空仓)显示出有效预测能力。构建的策略显著超 越基准:多空策略年化收益高达 64.9%(经多步 DQN 优化后提升至 79.4%),空头策略风险控 制优异(最大回撤仅-14.33%,优化后夏普/卡玛比领先)。仓位变动连续合理,避免了高频无意 义切换。多步优化 DQN 进一步提升了信号质量和各策略表现(收益与风控指标均改善),证明 了其在量化择时领域的巨大潜力。 分析师及联系人 [Table_Author] 覃川桃 杨凯杰 SAC:S0490513030001 SFC:BUT353 请阅读最后评级说明和重要声明 2 / 23 2 [Table_Title 神经因子挖掘(五)—— 2] 强化学习混频 Multi-Step DQN ...
通往 AGI 之路的苦涩教训
AI科技大本营· 2025-06-26 11:10
Core Viewpoint - The article discusses the rapid advancement of AI and the potential for achieving Artificial General Intelligence (AGI) within the next 5 to 10 years, as predicted by Google DeepMind CEO Demis Hassabis, who estimates a 50% probability of this achievement [1] Group 1: AI Development and Challenges - The AI wave is accelerating at an unprecedented pace, but there have been numerous missteps along the way, as highlighted by Richard Sutton's 2019 article "The Bitter Lesson," which emphasizes the pitfalls of relying too heavily on human knowledge and intuition [2][4] - Sutton argues that computational power and data are the fundamental engines driving AI forward, rather than human intelligence [3] - The article suggests that many previously held beliefs about the paths to intelligence are becoming obstacles in this new era [4] Group 2: Paths to AGI - The article introduces a discussion on the "bitter lessons" learned on the road to AGI, featuring a dialogue with Liu Jia, a professor at Tsinghua University, who has explored the intersection of AI, brain science, and cognitive science [5][11] - Liu Jia identifies three paths to AGI: reinforcement learning, brain simulation, and natural language processing (NLP), but warns that each path has its own hidden risks [9] - The article emphasizes that language does not equate to cognition, and models do not represent true thought, indicating that while NLP is progressing rapidly, it is not the ultimate destination [9][14] Group 3: Technical Insights - The article discusses the Scaling Law and the illusion of intelligence associated with large models, questioning whether the success of these models is genuine evolution or merely an illusion [15] - It raises concerns about the limitations of brain simulation due to computational bottlenecks and theoretical blind spots, as well as the boundaries of language in relation to understanding the world [14]
哈啰进军无人驾驶赛道!背靠蚂蚁+宁王,能否复刻两轮神话?
Nan Fang Du Shi Bao· 2025-06-25 15:19
Core Viewpoint - The collaboration between Hello Chuxing, Ant Group, and CATL to establish Shanghai Zhaofu Intelligent Technology Co., Ltd. marks a significant move into the Robotaxi sector, focusing on L4 autonomous driving technology with a registered capital of 1.288 billion yuan [1][3]. Group 1: Company Collaboration - The partnership showcases complementary strengths among the three companies, with Hello Chuxing leveraging its experience in domestic and international transportation markets and AI technology applications [3]. - Ant Group contributes substantial financial support and expertise in payment systems and data assets, which are crucial for developing payment solutions and user credit systems for autonomous vehicles [3]. - CATL, as a leader in battery technology, addresses the "range anxiety" issue that hinders the widespread adoption of electric mobility, providing essential technical support for Zhaofu Intelligent [3]. Group 2: Previous Collaborations - This is not the first collaboration among the three companies; they previously launched a battery swap service for electric scooters in June 2019, which has provided valuable experience for their current Robotaxi venture [4]. Group 3: Industry Competition - The Robotaxi sector is highly competitive, with numerous domestic companies like Baidu Apollo, Pony.ai, and Didi actively participating and expanding their market share [6]. - Internationally, Tesla has also launched its Robotaxi service in Austin, Texas, intensifying global competition in the market [6]. Group 4: Challenges in the Industry - Despite the promising outlook for Robotaxi development, the industry faces significant challenges in both technology and commercialization, particularly regarding safety and reliability in complex environments [6]. - High research and development costs are a major concern, with companies like Waymo investing over $2 billion annually, and operational costs for fleets being substantial [7]. - Most companies in the sector are currently operating at a loss, making the establishment of a sustainable profit model a critical issue for the industry's future [8]. Group 5: Unique Advantages of the Partnership - Hello Chuxing's entry into the Robotaxi market, backed by strong partners, integrates resources across transportation, AI technology, and battery systems, creating a synergistic model of "scene + technology + energy" [8]. - The future development of Hello Chuxing in the Robotaxi sector will require significant efforts in technology development, cost control, market expansion, and competition with industry giants [8].
让多模态大模型「想明白再画」!港大等开源GoT-R1:强化学习解锁视觉生成推理新范式
机器之心· 2025-06-25 06:50
Core Viewpoint - The article discusses the significant advancements in multimodal large models for generating high-fidelity images from complex text prompts, while also highlighting the challenges faced in accurately interpreting spatial relationships and multi-object attributes [1][2]. Group 1: Introduction of GoT-R1 - A research team from the University of Hong Kong, Chinese University of Hong Kong, and SenseTime has introduced GoT-R1, an important advancement following the Generation Chain-of-Thought (GoT) framework [2]. - GoT-R1 enhances the semantic-spatial reasoning capabilities of multimodal large models through the innovative application of reinforcement learning, allowing the model to autonomously explore and learn better reasoning strategies [3][5]. Group 2: Limitations of GoT Framework - The GoT framework improves image generation accuracy and controllability by explicitly planning semantic content and spatial layout before image generation, but its reasoning capabilities are limited by supervised fine-tuning data based on predefined templates [4][13]. - GoT-R1 aims to overcome these limitations by introducing reinforcement learning into the semantic-spatial reasoning process, enabling the model to learn and optimize reasoning paths independently [5][13]. Group 3: Reward Mechanism in GoT-R1 - GoT-R1 constructs a comprehensive and effective reward mechanism for visual generation tasks, evaluating multiple dimensions of the generated results, including semantic consistency, spatial accuracy, and overall aesthetic quality [13][14]. - The reward framework includes: 1. Reasoning Process Evaluation Reward (RPR) [14] 2. Reasoning-to-Image Alignment Reward (RRI), which quantifies adherence to the reasoning chain using Intersection over Union (IoU) [15] 3. Semantic Alignment Reward (Rsem) and Spatial Alignment Reward (Rspa), which assess the completeness and accuracy of the reasoning chain against the original text prompt [16] 4. Text-to-Image Alignment Reward (RPI), which evaluates the overall consistency of the generated image with the original text prompt [17]. Group 4: Performance Evaluation of GoT-R1 - GoT-R1 was evaluated on the challenging T2I-CompBench, where it established new state-of-the-art (SOTA) performance, achieving the highest scores in five out of six evaluation categories [21][23]. - The model demonstrated significant advantages in handling complex, multi-layered instructions, particularly in the "Complex" benchmark [23]. - Compared to the baseline model, GoT-R1-7B achieved up to a 15% improvement in evaluation metrics, showcasing the effectiveness of reinforcement learning in enhancing the model's reasoning capabilities [24][25]. Group 5: Comparison of Reasoning Chains - A comparative analysis using GPT-4o revealed that GoT-R1 generated reasoning chains were preferred over those from the baseline model across all evaluation categories, particularly in spatial relationship understanding [25][26].
7B小模型超越DeepSeek-R1:模仿人类教师,弱模型也能教出强推理LLM | Transformer作者团队
量子位· 2025-06-24 13:36
不圆 发自 凹非寺 量子位 | 公众号 QbitAI Thinking模式当道,教师模型也该学会" 启发式 "教学了—— 由Transformer作者之一Llion Jones创立的明星AI公司 Sakana AI ,带着他们的新方法来了! 这个方法要求教师模型像优秀的人类教师一样,根据已知解决方案输出清晰的逐步解释,而不再是从头开始自己解决。 用Sanaka AI的新方法训练出的7B小模型,在传授推理技能方面,比671B的DeepSeek-R1还要有效。 | Teacher | Student | | | Final model AIME 2024 MATH 500 GPQA Diamond Overall | | | | --- | --- | --- | --- | --- | --- | --- | | N.A. | | Owen-7B | 10.00 | 74.20 | 33.30 | 39.17 | | DeepSeek-R1 (671B) Qwen-7B | | Bespoke-7B | 20.00 | 82.00 | 37.80 | 46.60 | | RLT teacher (7B) | | ...
0产品估值100亿美元!前OpenAI CTO的“明星创业项目”:要做“企业定制AI模型”
Hua Er Jie Jian Wen· 2025-06-24 08:39
OpenAI前首席技术官Mira Murati创立的AI初创企业Thinking Machines Lab(TML)正以惊人的速度崭露 头角,其目标直指通过定制化AI模型助力企业实现收入增长。 据媒体报道,这家成立不到五个月的公司已完成20亿美元融资,估值达到100亿美元。 强化学习驱动的定制化AI策略 TML的商业模式聚焦于通过强化学习技术开发定制化AI模型。 据与Murati交流的人士透露,该公司将AI模型与企业追踪的具体KPI挂钩,旨在帮助客户直接提升营收 或利润。这种针对性策略被投资者称为"RL for businesses",意在为企业提供更精准的解决方案。 前OpenAI CTO新创企业估值百亿美元,计划开发定制化AI助企业增收。 据知情人士向媒体透露,Meta首席执行官扎克伯格近几个月来曾与Murati讨论投资或收购可能性,但 谈判并未取得实质进展。 此外,谷歌云正为TML提供英伟达驱动的服务器租赁服务,这可能促使谷歌进一步投资,以期TML未 来增加服务器租赁支出。 这种定制化方法可能让TML在特定行业领域具备竞争优势,如客户支持、投资银行或零售等,以满足 细分市场需求,客户或将为此支付溢价。 ...