Workflow
强化学习
icon
Search documents
OpenAI连丢4位大将!Ilya合作者/o1核心贡献者加入Meta,苏黎世三人组回应跳槽:集体做出的选择
量子位· 2025-06-27 08:09
Core Insights - Meta has successfully recruited key talent from OpenAI, including Trapit Bansal, who will focus on advanced reasoning models in a newly established superintelligence department [1][2][10] - The recent hiring spree includes a group of three researchers from Zurich, indicating a strategic move by Meta to strengthen its AI capabilities [10][11] Group 1: Talent Acquisition - Trapit Bansal, a core contributor to OpenAI's large model reinforcement learning research, has joined Meta after a year at OpenAI [1][6] - The Zurich trio, consisting of Lucas Beyer, Alexander Kolesnikov, and Zhai Xiaohua, confirmed their transition to Meta, emphasizing their collective decision to move [10][11][21] - Bansal has over 2800 citations on Google Scholar, showcasing his significant impact in the field [7] Group 2: Research Focus - Bansal's research at Meta will continue to explore reasoning models, building on his previous work in multi-agent reinforcement learning [4][6] - The Zurich trio is known for developing the ViT architecture, which has been widely cited, indicating their strong background in AI research [14][15] Group 3: Strategic Moves - Meta is not only focusing on talent acquisition but is also in talks to acquire PlayAI, a voice AI startup, to enhance its capabilities in voice technology [23][24] - This acquisition strategy aligns with Meta's goal to integrate more voice functionalities into its AR glasses [27]
突破通用领域推理的瓶颈!清华NLP实验室强化学习新研究RLPR
机器之心· 2025-06-27 00:49
Core Viewpoint - The article discusses the introduction of a novel reinforcement learning technique called Reinforcement Learning with Reference Probability Reward (RLPR), which addresses the limitations of existing methods in generalizing to diverse domains beyond mathematics and coding [4][24]. Group 1: RLPR Technology Overview - RLPR significantly enhances the quality of probability-based rewards through the Prob-to-Reward method, outperforming likelihood-based baseline methods in performance and training stability [7][24]. - The technology introduces a dynamic filtering mechanism based on reward standard deviation, further improving the stability and performance of reinforcement learning [8][17]. Group 2: Effectiveness of PR - The research team found that the generation probability of reference answers in large language models (LLMs) directly reflects the quality assessment of the model's reasoning process, indicating a strong correlation between the model's reasoning accuracy and the probability of generating correct reference answers [11][24]. - The PR mechanism effectively captures the model's self-assessment of reasoning quality, demonstrating its reliability in evaluating output [11][13]. Group 3: Advantages Over Existing Methods - Unlike existing RLVR methods that require extensive human resources for domain-specific validation rules, RLPR generates reward scores with a simple forward pass, making it more efficient in handling the complexity of natural language [13][24]. - RLPR's dynamic filtering mechanism retains samples with high reward standard deviation for training, enhancing training stability and effectiveness [17][24]. Group 4: Robustness and Validation - The research team evaluated the quality of different reward sources using the ROC-AUC metric, showing that PR outperformed rule-based rewards and verifier model rewards at a scale of 0.5 billion, with further improvements possible as model capabilities increase [19][21]. - RLPR demonstrated stable performance improvements across various training templates and base models, including Gemma and Llama, surpassing the performance of traditional rule-based RLVR baselines [22][24].
神经因子挖掘(五):强化学习混频Multi-StepDQN择时策略
Changjiang Securities· 2025-06-26 11:41
金融工程丨深度报告 [Table_Title] 神经因子挖掘(五)—— 强化学习混频 Multi-Step DQN 择时策略 %% %% %% %% research.95579.com 1 丨证券研究报告丨 报告要点 [Table_Summary] 我们设计 DQN 的核心是学习在给定市场状态下最优交易动作的潜在价值。将 DQN 应用于中 证 1000 指数日频择时,模型信号(做多/做空/空仓)显示出有效预测能力。构建的策略显著超 越基准:多空策略年化收益高达 64.9%(经多步 DQN 优化后提升至 79.4%),空头策略风险控 制优异(最大回撤仅-14.33%,优化后夏普/卡玛比领先)。仓位变动连续合理,避免了高频无意 义切换。多步优化 DQN 进一步提升了信号质量和各策略表现(收益与风控指标均改善),证明 了其在量化择时领域的巨大潜力。 分析师及联系人 [Table_Author] 覃川桃 杨凯杰 SAC:S0490513030001 SFC:BUT353 请阅读最后评级说明和重要声明 2 / 23 2 [Table_Title 神经因子挖掘(五)—— 2] 强化学习混频 Multi-Step DQN ...
通往 AGI 之路的苦涩教训
AI科技大本营· 2025-06-26 11:10
Core Viewpoint - The article discusses the rapid advancement of AI and the potential for achieving Artificial General Intelligence (AGI) within the next 5 to 10 years, as predicted by Google DeepMind CEO Demis Hassabis, who estimates a 50% probability of this achievement [1] Group 1: AI Development and Challenges - The AI wave is accelerating at an unprecedented pace, but there have been numerous missteps along the way, as highlighted by Richard Sutton's 2019 article "The Bitter Lesson," which emphasizes the pitfalls of relying too heavily on human knowledge and intuition [2][4] - Sutton argues that computational power and data are the fundamental engines driving AI forward, rather than human intelligence [3] - The article suggests that many previously held beliefs about the paths to intelligence are becoming obstacles in this new era [4] Group 2: Paths to AGI - The article introduces a discussion on the "bitter lessons" learned on the road to AGI, featuring a dialogue with Liu Jia, a professor at Tsinghua University, who has explored the intersection of AI, brain science, and cognitive science [5][11] - Liu Jia identifies three paths to AGI: reinforcement learning, brain simulation, and natural language processing (NLP), but warns that each path has its own hidden risks [9] - The article emphasizes that language does not equate to cognition, and models do not represent true thought, indicating that while NLP is progressing rapidly, it is not the ultimate destination [9][14] Group 3: Technical Insights - The article discusses the Scaling Law and the illusion of intelligence associated with large models, questioning whether the success of these models is genuine evolution or merely an illusion [15] - It raises concerns about the limitations of brain simulation due to computational bottlenecks and theoretical blind spots, as well as the boundaries of language in relation to understanding the world [14]
哈啰进军无人驾驶赛道!背靠蚂蚁+宁王,能否复刻两轮神话?
Nan Fang Du Shi Bao· 2025-06-25 15:19
Core Viewpoint - The collaboration between Hello Chuxing, Ant Group, and CATL to establish Shanghai Zhaofu Intelligent Technology Co., Ltd. marks a significant move into the Robotaxi sector, focusing on L4 autonomous driving technology with a registered capital of 1.288 billion yuan [1][3]. Group 1: Company Collaboration - The partnership showcases complementary strengths among the three companies, with Hello Chuxing leveraging its experience in domestic and international transportation markets and AI technology applications [3]. - Ant Group contributes substantial financial support and expertise in payment systems and data assets, which are crucial for developing payment solutions and user credit systems for autonomous vehicles [3]. - CATL, as a leader in battery technology, addresses the "range anxiety" issue that hinders the widespread adoption of electric mobility, providing essential technical support for Zhaofu Intelligent [3]. Group 2: Previous Collaborations - This is not the first collaboration among the three companies; they previously launched a battery swap service for electric scooters in June 2019, which has provided valuable experience for their current Robotaxi venture [4]. Group 3: Industry Competition - The Robotaxi sector is highly competitive, with numerous domestic companies like Baidu Apollo, Pony.ai, and Didi actively participating and expanding their market share [6]. - Internationally, Tesla has also launched its Robotaxi service in Austin, Texas, intensifying global competition in the market [6]. Group 4: Challenges in the Industry - Despite the promising outlook for Robotaxi development, the industry faces significant challenges in both technology and commercialization, particularly regarding safety and reliability in complex environments [6]. - High research and development costs are a major concern, with companies like Waymo investing over $2 billion annually, and operational costs for fleets being substantial [7]. - Most companies in the sector are currently operating at a loss, making the establishment of a sustainable profit model a critical issue for the industry's future [8]. Group 5: Unique Advantages of the Partnership - Hello Chuxing's entry into the Robotaxi market, backed by strong partners, integrates resources across transportation, AI technology, and battery systems, creating a synergistic model of "scene + technology + energy" [8]. - The future development of Hello Chuxing in the Robotaxi sector will require significant efforts in technology development, cost control, market expansion, and competition with industry giants [8].
让多模态大模型「想明白再画」!港大等开源GoT-R1:强化学习解锁视觉生成推理新范式
机器之心· 2025-06-25 06:50
Core Viewpoint - The article discusses the significant advancements in multimodal large models for generating high-fidelity images from complex text prompts, while also highlighting the challenges faced in accurately interpreting spatial relationships and multi-object attributes [1][2]. Group 1: Introduction of GoT-R1 - A research team from the University of Hong Kong, Chinese University of Hong Kong, and SenseTime has introduced GoT-R1, an important advancement following the Generation Chain-of-Thought (GoT) framework [2]. - GoT-R1 enhances the semantic-spatial reasoning capabilities of multimodal large models through the innovative application of reinforcement learning, allowing the model to autonomously explore and learn better reasoning strategies [3][5]. Group 2: Limitations of GoT Framework - The GoT framework improves image generation accuracy and controllability by explicitly planning semantic content and spatial layout before image generation, but its reasoning capabilities are limited by supervised fine-tuning data based on predefined templates [4][13]. - GoT-R1 aims to overcome these limitations by introducing reinforcement learning into the semantic-spatial reasoning process, enabling the model to learn and optimize reasoning paths independently [5][13]. Group 3: Reward Mechanism in GoT-R1 - GoT-R1 constructs a comprehensive and effective reward mechanism for visual generation tasks, evaluating multiple dimensions of the generated results, including semantic consistency, spatial accuracy, and overall aesthetic quality [13][14]. - The reward framework includes: 1. Reasoning Process Evaluation Reward (RPR) [14] 2. Reasoning-to-Image Alignment Reward (RRI), which quantifies adherence to the reasoning chain using Intersection over Union (IoU) [15] 3. Semantic Alignment Reward (Rsem) and Spatial Alignment Reward (Rspa), which assess the completeness and accuracy of the reasoning chain against the original text prompt [16] 4. Text-to-Image Alignment Reward (RPI), which evaluates the overall consistency of the generated image with the original text prompt [17]. Group 4: Performance Evaluation of GoT-R1 - GoT-R1 was evaluated on the challenging T2I-CompBench, where it established new state-of-the-art (SOTA) performance, achieving the highest scores in five out of six evaluation categories [21][23]. - The model demonstrated significant advantages in handling complex, multi-layered instructions, particularly in the "Complex" benchmark [23]. - Compared to the baseline model, GoT-R1-7B achieved up to a 15% improvement in evaluation metrics, showcasing the effectiveness of reinforcement learning in enhancing the model's reasoning capabilities [24][25]. Group 5: Comparison of Reasoning Chains - A comparative analysis using GPT-4o revealed that GoT-R1 generated reasoning chains were preferred over those from the baseline model across all evaluation categories, particularly in spatial relationship understanding [25][26].
7B小模型超越DeepSeek-R1:模仿人类教师,弱模型也能教出强推理LLM | Transformer作者团队
量子位· 2025-06-24 13:36
不圆 发自 凹非寺 量子位 | 公众号 QbitAI Thinking模式当道,教师模型也该学会" 启发式 "教学了—— 由Transformer作者之一Llion Jones创立的明星AI公司 Sakana AI ,带着他们的新方法来了! 这个方法要求教师模型像优秀的人类教师一样,根据已知解决方案输出清晰的逐步解释,而不再是从头开始自己解决。 用Sanaka AI的新方法训练出的7B小模型,在传授推理技能方面,比671B的DeepSeek-R1还要有效。 | Teacher | Student | | | Final model AIME 2024 MATH 500 GPQA Diamond Overall | | | | --- | --- | --- | --- | --- | --- | --- | | N.A. | | Owen-7B | 10.00 | 74.20 | 33.30 | 39.17 | | DeepSeek-R1 (671B) Qwen-7B | | Bespoke-7B | 20.00 | 82.00 | 37.80 | 46.60 | | RLT teacher (7B) | | ...
0产品估值100亿美元!前OpenAI CTO的“明星创业项目”:要做“企业定制AI模型”
Hua Er Jie Jian Wen· 2025-06-24 08:39
Core Insights - Thinking Machines Lab (TML), founded by former OpenAI CTO Mira Murati, has achieved a valuation of $10 billion after raising $2 billion in funding within five months of its establishment [1][2]. Group 1: Business Model and Strategy - TML focuses on developing customized AI models driven by reinforcement learning, linking AI models to specific KPIs tracked by businesses to enhance revenue or profit [2]. - The company aims to provide tailored solutions for specific industries such as customer support, investment banking, and retail, potentially allowing clients to pay a premium for these services [2]. - TML plans to shorten development cycles by integrating open-source models through a technique called "model merging," which combines the strengths of multiple models without additional training [2]. Group 2: Talent and Acquisition Interest - TML has assembled a team of over 20 top researchers and engineers from leading AI companies, including OpenAI and Anthropic, making it an attractive target for acquisition [3]. - Discussions have occurred between Meta's CEO Mark Zuckerberg and Murati regarding potential investment or acquisition, although no substantial progress has been made [3]. - Google Cloud is providing TML with NVIDIA-powered server rental services, which may lead to further investment from Google as TML's server rental expenses increase [3]. Group 3: Market Challenges - TML faces significant competition from other AI startups like Scale AI and Turing, which are also developing customized AI consulting services for specific industries [4]. - The scalability of consulting services presents challenges, potentially limiting profit margins and growth rates [4]. - TML is exploring the development of additional AI applications or software to enhance profit margins [4]. Group 4: Future Product Plans - TML is considering launching consumer-facing products, including a chatbot that could compete with OpenAI's ChatGPT, although specific details remain unclear [5].
强化学习新发现:无需数学样本,仅游戏训练AI推理大增
机器之心· 2025-06-24 06:46
Core Viewpoint - The research introduces a groundbreaking method called ViGaL (Visual Game Learning), which enhances multi-modal reasoning capabilities in AI models through game training, without the need for extensive mathematical training samples [5][11][24]. Group 1: Research Findings - The study demonstrates that training AI models on simple games like Snake can significantly improve their performance in mathematical reasoning and multi-disciplinary tasks, achieving an average accuracy increase of 2.9% on mathematical benchmarks and 2.0% on multi-disciplinary reasoning tasks [11][15]. - The research team utilized a 7B parameter model, Qwen2.5-VL, and found that reinforcement learning through game play outperformed traditional methods that relied on mathematical or multi-disciplinary data [11][15]. - The findings suggest that game training can lead to stronger cross-domain generalization, allowing models to transfer skills learned in gaming to complex reasoning tasks in mathematics and other fields [7][11]. Group 2: Game Design and Training Methodology - The research involved two complementary training games: Snake, which focuses on path planning and spatial navigation, and a custom-designed 3D rotation game that enhances spatial geometric understanding [18][19]. - The design philosophy of the games is complementary, with Snake improving 2D coordinate-related mathematical performance and the rotation game targeting angle and length reasoning [20]. - Joint training on both games proved to be more effective than training on either game alone, showcasing the potential for diverse gaming tasks to enhance AI performance [20]. Group 3: Implications and Future Directions - The success of ViGaL indicates a potential new trend in AI training, suggesting that well-designed games could serve as synthetic tasks to develop multi-modal reasoning capabilities when high-quality human data is scarce [22][23]. - This game-based training paradigm offers unique advantages over traditional methods, emphasizing the importance of cultivating underlying general reasoning abilities rather than solely focusing on direct task learning [23]. - The research highlights that allowing AI to "play games" may be more effective than conventional training methods, especially as challenges arise in scaling traditional approaches [24].
光大证券:L4纯视觉或再掀技术变革 持续关注智驾主题
Zhi Tong Cai Jing· 2025-06-24 03:15
Group 1 - The report from Everbright Securities indicates a positive outlook for the domestic urban intelligent driving penetration rate, expecting a turning point in 2025E and rapid growth thereafter in 2026E and beyond [1] - The focus for the L2+ market is on promoting affordable intelligent vehicles priced between 100,000 to 200,000 yuan, while the L4 market is centered on the breakthrough of commercial scale for Robotaxi [1] - Recommendations include Tesla and the steering supplier Nextracker for L4 pure vision Robotaxi commercialization, as well as Xpeng Motors, with a suggestion to pay attention to Li Auto, NIO, and Pony.ai [1] Group 2 - The acceleration of Robotaxi commercialization is nearing a scale-up inflection point, with significant breakthroughs in order volumes and external collaborations among leading global Robotaxi companies since the second half of 2024 [1] - The report suggests that the core methodology for achieving L4 may involve reinforcement learning combined with world models, contrasting with L2+ which primarily relies on imitation learning [2] - The complexity of L4 implementation is expected to increase due to challenges in data construction, algorithm development, and the need for substantial computational resources [2] Group 3 - The report highlights that the dual paths of lidar and pure vision technology will continue from L2+ to L4, despite the drawbacks of lidar technology such as delays and conflicts in multi-sensor fusion [3] - The key to achieving commercial scalability for L4 lies in technological upgrades and cost reductions, as hardware costs are expected to rise [3] - The VLA (Vision-Language-Action) model combined with world models is anticipated to be a mainstream trend in the intelligent driving industry, although it has not yet been fully realized [4]