强化学习
Search documents
搞不懂CUDA的人有救了,Devin开发商开源Kevin,强化学习生成CUDA内核
机器之心· 2025-05-07 04:34
| 机器之心报道 | | --- | 编辑:蛋酱、泽南 本周三,知名 AI 创业公司,曾发布「全球首个 AI 软件工程师」的 Cognition AI 开源了一款使用强化学习,用于编写 CUDA 内核的大模型 Kevin-32B 。 Kevin-32B 基于 QwQ-32B 在 KernelBench 数据集上使用 GRPO 进行了多轮强化学习训练,实现了超越 o3 和 o4-mini 的顶级推理表现。 对此,机器学习社区表现出了极大的兴趣。有人表示期待 DeepSeek R1 风格的训练方法用来提升代码效率已久,这回终于有人站出来了。 在一篇博客中,Cognition AI 详细介绍了新模型强化学习训练的机制。 代码是一个不断迭代的过程 —— 需要我们编写、执行程序,评估结果,并根据反馈优化代码。大语言模型(LLM)在代码生成方面的最新进展尝试将此过程融入 推理阶段,并使用并行采样等方法。虽然这些方法是有效的,但它们依赖于搜索而非实际学习 —— 在这其中模型权重被冻结。 Cognition AI 探索了多轮强化学习,使用来自环境的中间反馈,并屏蔽模型思维以避免在多轮训练中上下文爆炸。 他们提出的模型 Kev ...
万字长文带你读懂强化学习,去中心化强化学习又能否实现?
机器之心· 2025-05-07 04:34
Core Insights - Reinforcement Learning (RL) is emerging as a pivotal method for enhancing AI models, particularly in the context of decentralized systems [2][3][20] - The article outlines a timeline of AI scaling methods, emphasizing the shift from pre-training to RL-based approaches for model improvement [6][10][20] - DeepSeek's innovative use of RL in their models, particularly R1-Zero, demonstrates a new paradigm for self-improvement in AI without heavy reliance on human data [25][26][51] Group 1: Historical Context of AI Scaling - The initial scaling laws established the importance of data in training, leading to the understanding that many models were under-trained relative to their parameters [6][10] - The introduction of Chinchilla Scaling Law highlighted the optimal data-to-parameter ratio, prompting researchers to utilize significantly more data for training [6][10] - The evolution of scaling methods culminated in the recognition of the limitations of pre-training data availability, as noted by Ilya Sutskever [19][20] Group 2: DeepSeek's Model Innovations - DeepSeek's R1-Zero model showcases the potential of RL to enhance model performance with minimal human intervention, marking a significant advancement in AI training methodologies [25][26][51] - The model employs a recursive improvement process, allowing it to generate and refine its own reasoning paths, thus reducing dependency on new human data [26][48] - The transition from traditional supervised fine-tuning (SFT) to a GRPO (Group Relative Policy Optimization) framework simplifies the RL process and reduces computational overhead [44][46] Group 3: Decentralized Reinforcement Learning - The article discusses the necessity of a decentralized framework for training and optimizing AI models, emphasizing the need for a robust environment to generate diverse reasoning data [67][72] - Key components of a decentralized RL system include a foundational model, a training environment for generating reasoning data, and an optimizer for fine-tuning [67][70] - The potential for decentralized networks to facilitate collaborative learning and data generation is highlighted, suggesting a shift in how AI models can be developed and improved [72][78] Group 4: Future Directions - The exploration of modular and expert-based models is suggested as a promising avenue for future AI development, allowing for parallel training and improvement of specialized components [106][107] - The integration of decentralized approaches with existing frameworks like RL Swarm indicates a trend towards more collaborative and efficient AI training methodologies [102][107] - The ongoing research into optimizing decentralized training environments and validation mechanisms is crucial for the advancement of AI capabilities [75][78]
VDC+VBench双榜第一!强化学习打磨的国产视频大模型,超越Sora、Pika
机器之心· 2025-05-06 04:11
Core Insights - The article discusses the integration of reinforcement learning into video generation, highlighting the success of models like Cockatiel and IPOC in achieving superior performance in video generation tasks [1][14]. Group 1: Video Detailed Captioning - The video detailed captioning model serves as a foundational element for video generation, with the Cockatiel method achieving first place in the VDC leaderboard, outperforming several prominent multimodal models [3][5]. - Cockatiel's approach involves a three-stage fine-tuning process that leverages high-quality synthetic data aligned with human preferences, resulting in a model that excels in fine-grained expression and human preference consistency [5][8]. Group 2: IPOC Framework - The IPOC framework introduces an iterative reinforcement learning preference optimization method, achieving a total score of 86.57% on the VBench leaderboard, surpassing various well-known video generation models [14][15]. - The IPOC method consists of three stages: human preference data annotation, reward model training, and iterative reinforcement learning optimization, which collectively enhance the efficiency and effectiveness of video generation [19][20]. Group 3: Model Performance - Experimental results indicate that the Cockatiel series models generate video descriptions with comprehensive dimensions, precise narratives, and minimal hallucination phenomena, showcasing higher reliability and accuracy compared to baseline models [7][21]. - The IPOC-2B model demonstrates significant improvements in temporal consistency, structural rationality, and aesthetic quality in generated videos, leading to more natural and coherent movements [21][25].
OpenAI放弃营利性转型!奥特曼:非营利组织继续掌控;关税重压下Temu停运中国直邮美国商品;英伟达再推中国特供版AI芯片
雷峰网· 2025-05-06 00:29
Group 1 - Temu has announced the cessation of direct sales of Chinese products to the U.S. due to a 130% import tariff, shifting to local sellers for U.S. market sales [5][6] - The U.S. Customs policy change effective May 2, 2025, will eliminate the small package tariff exemption for goods from mainland China and Hong Kong, requiring proper customs declarations and payment of applicable tariffs [5] - The number of full-service sellers on Temu's U.S. site has significantly decreased, with some sellers experiencing over 50% of their products being delisted [6] Group 2 - Neta Auto's app and website experienced significant downtime due to unpaid traffic fees, leading to accessibility issues during the holiday period [8] - Neta Auto's sales have sharply declined in 2023, revealing operational challenges, including layoffs and payment delays to suppliers [9] - The company previously achieved a sales record of approximately 152,100 vehicles in 2022, becoming a leading player among new car manufacturers [8] Group 3 - Major car manufacturers, including Xiaomi and Huawei, have rebranded their "smart driving" features to "assisted driving," reflecting a shift in marketing strategy [10][11] - The term "smart driving" is becoming less prominent in product promotions, with many companies opting for more conservative language in their marketing [11] Group 4 - Xiaomi's international market department has undergone leadership changes, with Xu Fei appointed as the new general manager [16] - Xu Fei has been with Xiaomi for 15 years and previously served as the head of the MIUI product team [16] Group 5 - Ant Group plans to separately list its overseas division, Ant International, in Hong Kong, which accounts for approximately 20% of Ant Group's revenue [15] - Ant International focuses on cross-border payment services, leveraging products like Alipay+ and WorldFirst [15] Group 6 - NVIDIA is developing a new AI chip tailored for the Chinese market after the U.S. government banned the export of its H20 chip, with samples expected to be available in June [21] - The new chip design aims to comply with U.S. export regulations while maintaining NVIDIA's market presence in China [21] Group 7 - OpenAI has decided to maintain its non-profit structure, abandoning plans for a profit-driven transformation, which may complicate future funding efforts [20] - The organization emphasizes its mission to ensure that AGI benefits all of humanity, contrasting with traditional profit-driven corporate governance [20]
梁文锋和杨植麟再“撞车”
华尔街见闻· 2025-05-05 12:26
Core Viewpoint - The article discusses the competitive landscape of large model development in China, focusing on the advancements of DeepSeek and Kimi, and the challenges they face from larger companies like Alibaba and Baidu [2][15]. Group 1: Model Developments - DeepSeek launched its new model, DeepSeek-Prover-V2, with a parameter scale of 671 billion, significantly larger than the previous version's 7 billion, enhancing efficiency and accuracy in mathematical tasks [3][4]. - Kimi, developed by the team at Moonlight, released a model called Kimina-Prover with 1.5 billion and 7 billion parameter distilled versions, achieving a miniF2F test pass rate of 80.7% [3][4]. - The performance of DeepSeek-Prover-V2 surpassed that of Kimina-Prover in both miniF2F and PutnamBench tests, indicating a competitive edge in mathematical reasoning capabilities [4]. Group 2: Competitive Challenges - DeepSeek faces declining interest in its R1 model, with competitors like Alibaba rapidly advancing their models, prompting expectations for new releases like R2 or V4 [6][18]. - Kimi is also under pressure from ByteDance's Doubao and Tencent's Yuanbao, necessitating continuous innovation to maintain its market position [7][16]. - The article highlights the rapid growth of Kimi, which reached 20 million monthly active users in November 2024, trailing behind Doubao's 56 million [16]. Group 3: Market Dynamics - Alibaba's new model, Qwen3, is described as a hybrid reasoning model that outperforms DeepSeek's R1, with a parameter count only one-third of R1's [19]. - Baidu's recent releases, including Wenxin 4.5 Turbo, are noted for their superior performance and lower costs compared to DeepSeek, with criticisms regarding DeepSeek's speed and pricing [20][21]. - The competitive landscape is intensifying, with more players entering the large model open-source race, emphasizing the need for advanced technology to set industry standards [22].
边学边练,推理觉醒:LUFFY让强化学习即学即用!
机器之心· 2025-05-05 03:40
破解 "只学不练" 与 "只练不学" 的难题 想象你准备参加一场高水平的数学竞赛。如果你只是反复背诵往年题目的标准答案,从不亲自动手解题,那么一旦遇到新题型,很可能束手无策;反过来,如果 你闭门造车,只凭自己反复试错而从不参考老师和高手的解题经验,进步又会异常缓慢。这就好比 AI 模型 训练中长期存在的两种极端: 「 模仿学习 」 只顾照搬 示范却缺乏自我实践, 「强化学习 」 一味自我探索却不借鉴现有经验。 这两种 「只学不练 」 和 「只练不学 」 的策略各有弊端:前者往往学得快但 泛化差 ,后者可能探索勤但 效率低 。那么,有没有两全其美的办法,让模型既能借 鉴高手经验又能保持自主探索?最近,上海 AI 实验室联合西湖大学、南京大学和香港中文大学的研究团队提出了一种全新的强化学习范式: LUFFY(Learning to reason Under oFF-policY guidance) 。 论文链接:https://arxiv.org/abs/2504.14945 代码仓库:https://github.com/ElliottYan/LUFFY 图表 1. 在六项竞赛级数学推理基准上的整体表现。在 A ...
梁文锋和杨植麟再“撞车”
虎嗅APP· 2025-05-04 08:29
Core Viewpoint - The article discusses the competitive landscape of large model development in China, focusing on the advancements and challenges faced by companies like DeepSeek and Kimi, as well as the impact of larger tech firms like Alibaba and Tencent on the market [2][4][12]. Group 1: Model Developments - DeepSeek launched its new model, DeepSeek-Prover-V2, with a parameter scale of 671 billion, significantly larger than the previous version's 7 billion, resulting in improved efficiency and accuracy in mathematical tasks [2][9]. - Kimi, developed by the Moonlight team, also released a model for formal theorem proving, with a smaller parameter scale of 1.5 billion and 7 billion, achieving an 80.7% pass rate in miniF2F tests [2][3]. - The evolution of DeepSeek's models is synchronized, with a timeline of updates from Prover series models starting in March 2024 to the latest Prover-V2 in April 2025 [8][9]. Group 2: Competitive Landscape - DeepSeek faces increasing competition from Alibaba's new model Qwen3, which is touted as a hybrid reasoning model with superior performance despite having only one-third the parameters of DeepSeek's R1 model [14][15]. - Kimi has seen rapid growth, reaching 20 million monthly active users within a year, but is now challenged by Tencent's Yuanbao, which has surpassed Kimi in user numbers due to aggressive marketing [12][13]. - The article highlights the need for multiple leading models in the Chinese market, suggesting that competition and innovation should be encouraged rather than focusing on a single dominant player [14][15]. Group 3: Future Directions - DeepSeek's founder has indicated a focus on three paths for achieving AGI: mathematics and code, multimodal learning, and natural language processing, viewing mathematics as a verifiable system for high intelligence [7]. - The upcoming R2 model is expected to enhance reinforcement learning capabilities, while the V4 model may involve a longer development cycle due to significant changes in pre-training methods [10][11].
机器人领域新突破!顶刊《IJRR》近期重磅论文概述
机器人大讲堂· 2025-05-03 08:04
Group 1 - The article reviews seven selected papers published in the International Journal of Robotics Research, covering various research directions in robotics such as soft actuators, human-robot interaction, dual-arm robots, multi-robot systems, and bipedal locomotion control [1][6][18][27][38][48][58] Group 2 - A new low-profile soft rotary pneumatic actuator was designed, addressing the limitations of traditional soft pneumatic actuators in confined spaces and providing a compact solution for applications in wearable devices and biomedical devices [2][4] - The THÖR-MAGNI dataset was introduced to overcome data bottlenecks in social navigation and human-robot interaction, featuring multi-modal data collection and extensive scene coverage [6][11][14] - The FMB benchmark was developed to standardize robotic manipulation research, offering a diverse set of objects and a modular framework for imitation learning [18][20][24][26] - A framework for dual-arm robots to manipulate deformable linear objects in constrained environments was proposed, achieving high success rates in complex tasks [27][30][31][34] - A real-time planning method for large heterogeneous multi-robot systems was introduced, significantly improving computational efficiency and robustness in dynamic environments [38][40][45] - A survey on communicating robot learning during human-robot interaction highlighted the importance of a closed-loop communication framework to enhance collaboration [48][50][53][55] - Reinforcement learning was applied to bipedal locomotion control, demonstrating significant advancements in adaptability and robustness in complex environments [58][60][62]
OpenAI最新技术报告:GPT-4o变谄媚的原因万万没想到
量子位· 2025-05-03 04:05
一水 发自 凹非寺 量子位 | 公众号 QbitAI GPT-4o更新后"变谄媚"?后续技术报告来了。 OpenAI一篇新鲜出炉的认错小作文,直接引来上百万网友围观。 CEO奥特曼也做足姿态,第一时间转发小作文并表示: (新报告) 揭示了GPT-4o更新失败是因为什么,从中OpenAI学到了什么,以及我们将会采取的应对措施是什么。 概括而言,最新报告提到,大约一周前的bug原来出在了"强化学习"身上—— 上次更新 引入了一个基于用户反馈的额外奖励信号 ,即对ChatGPT的点赞或点踩。 虽然这个信号通常很有用,但可能使模型逐渐倾向于做出更令人愉快的回应。 此外,尽管还没有明确证据,但 用户记忆在某些情况下也可能加剧奉承行为的影响。 一言以蔽之,OpenAI认为一些单独看可能对改进模型有益的举措,结合起来后却共同导致了模型变得"谄媚"。 而在看到这篇报告后,目前大多数网友的反应be like: (你小汁) 认错态度不错~ 甚至有人表示,这算得上OpenAI过去几年里最详细的报告了。 具体咋回事儿?接下来一起吃瓜。 完整事件回顾 4月25日,OpenAI对GPT-4o进行了一次更新。 在官网的更新日志中,当时提到 ...
DeepSeek新数学模型刷爆记录!7B小模型自主发现671B模型不会的新技能
量子位· 2025-05-01 03:53
DeepSeek放大招!新模型专注数学定理证明,大幅刷新多项高难基准测试。 在普特南测试上, 新模型 DeepSeek-Prover-V2 直接把记录刷新到 49道 。 目前的 第一名 在657道题中只做出 10道 题,为Kimi与 AIME2024冠军团队Numina 合作成果 Kimina-Prover 。 而未针对定理证明优化的 DeepSeek-R1只做出 1道 。 让还没发布的R2更令人期待了。 | 657) | | --- | | (out of | | Lean | | मै | Model | num- | | | --- | --- | --- | --- | | | | solved | compute | | 1 | Kimina-Prover-7B-Distill♥ | 10 | pass@192 | | 2 | Self-play Theorem Prover♥ | 8 | pass@3200 | | 3 | Goedel-Prover-SFT♥ | 7 | pass@512 | | 4 | ABEL | 7 | pass@596 | | 5 | InternLM2.5-StepPr ...