强化学习
Search documents
7B小模型超越DeepSeek-R1:模仿人类教师,弱模型也能教出强推理LLM | Transformer作者团队
量子位· 2025-06-24 13:36
不圆 发自 凹非寺 量子位 | 公众号 QbitAI Thinking模式当道,教师模型也该学会" 启发式 "教学了—— 由Transformer作者之一Llion Jones创立的明星AI公司 Sakana AI ,带着他们的新方法来了! 这个方法要求教师模型像优秀的人类教师一样,根据已知解决方案输出清晰的逐步解释,而不再是从头开始自己解决。 用Sanaka AI的新方法训练出的7B小模型,在传授推理技能方面,比671B的DeepSeek-R1还要有效。 | Teacher | Student | | | Final model AIME 2024 MATH 500 GPQA Diamond Overall | | | | --- | --- | --- | --- | --- | --- | --- | | N.A. | | Owen-7B | 10.00 | 74.20 | 33.30 | 39.17 | | DeepSeek-R1 (671B) Qwen-7B | | Bespoke-7B | 20.00 | 82.00 | 37.80 | 46.60 | | RLT teacher (7B) | | ...
0产品估值100亿美元!前OpenAI CTO的“明星创业项目”:要做“企业定制AI模型”
Hua Er Jie Jian Wen· 2025-06-24 08:39
OpenAI前首席技术官Mira Murati创立的AI初创企业Thinking Machines Lab(TML)正以惊人的速度崭露 头角,其目标直指通过定制化AI模型助力企业实现收入增长。 据媒体报道,这家成立不到五个月的公司已完成20亿美元融资,估值达到100亿美元。 强化学习驱动的定制化AI策略 TML的商业模式聚焦于通过强化学习技术开发定制化AI模型。 据与Murati交流的人士透露,该公司将AI模型与企业追踪的具体KPI挂钩,旨在帮助客户直接提升营收 或利润。这种针对性策略被投资者称为"RL for businesses",意在为企业提供更精准的解决方案。 前OpenAI CTO新创企业估值百亿美元,计划开发定制化AI助企业增收。 据知情人士向媒体透露,Meta首席执行官扎克伯格近几个月来曾与Murati讨论投资或收购可能性,但 谈判并未取得实质进展。 此外,谷歌云正为TML提供英伟达驱动的服务器租赁服务,这可能促使谷歌进一步投资,以期TML未 来增加服务器租赁支出。 这种定制化方法可能让TML在特定行业领域具备竞争优势,如客户支持、投资银行或零售等,以满足 细分市场需求,客户或将为此支付溢价。 ...
强化学习新发现:无需数学样本,仅游戏训练AI推理大增
机器之心· 2025-06-24 06:46
Core Viewpoint - The research introduces a groundbreaking method called ViGaL (Visual Game Learning), which enhances multi-modal reasoning capabilities in AI models through game training, without the need for extensive mathematical training samples [5][11][24]. Group 1: Research Findings - The study demonstrates that training AI models on simple games like Snake can significantly improve their performance in mathematical reasoning and multi-disciplinary tasks, achieving an average accuracy increase of 2.9% on mathematical benchmarks and 2.0% on multi-disciplinary reasoning tasks [11][15]. - The research team utilized a 7B parameter model, Qwen2.5-VL, and found that reinforcement learning through game play outperformed traditional methods that relied on mathematical or multi-disciplinary data [11][15]. - The findings suggest that game training can lead to stronger cross-domain generalization, allowing models to transfer skills learned in gaming to complex reasoning tasks in mathematics and other fields [7][11]. Group 2: Game Design and Training Methodology - The research involved two complementary training games: Snake, which focuses on path planning and spatial navigation, and a custom-designed 3D rotation game that enhances spatial geometric understanding [18][19]. - The design philosophy of the games is complementary, with Snake improving 2D coordinate-related mathematical performance and the rotation game targeting angle and length reasoning [20]. - Joint training on both games proved to be more effective than training on either game alone, showcasing the potential for diverse gaming tasks to enhance AI performance [20]. Group 3: Implications and Future Directions - The success of ViGaL indicates a potential new trend in AI training, suggesting that well-designed games could serve as synthetic tasks to develop multi-modal reasoning capabilities when high-quality human data is scarce [22][23]. - This game-based training paradigm offers unique advantages over traditional methods, emphasizing the importance of cultivating underlying general reasoning abilities rather than solely focusing on direct task learning [23]. - The research highlights that allowing AI to "play games" may be more effective than conventional training methods, especially as challenges arise in scaling traditional approaches [24].
光大证券:L4纯视觉或再掀技术变革 持续关注智驾主题
Zhi Tong Cai Jing· 2025-06-24 03:15
Group 1 - The report from Everbright Securities indicates a positive outlook for the domestic urban intelligent driving penetration rate, expecting a turning point in 2025E and rapid growth thereafter in 2026E and beyond [1] - The focus for the L2+ market is on promoting affordable intelligent vehicles priced between 100,000 to 200,000 yuan, while the L4 market is centered on the breakthrough of commercial scale for Robotaxi [1] - Recommendations include Tesla and the steering supplier Nextracker for L4 pure vision Robotaxi commercialization, as well as Xpeng Motors, with a suggestion to pay attention to Li Auto, NIO, and Pony.ai [1] Group 2 - The acceleration of Robotaxi commercialization is nearing a scale-up inflection point, with significant breakthroughs in order volumes and external collaborations among leading global Robotaxi companies since the second half of 2024 [1] - The report suggests that the core methodology for achieving L4 may involve reinforcement learning combined with world models, contrasting with L2+ which primarily relies on imitation learning [2] - The complexity of L4 implementation is expected to increase due to challenges in data construction, algorithm development, and the need for substantial computational resources [2] Group 3 - The report highlights that the dual paths of lidar and pure vision technology will continue from L2+ to L4, despite the drawbacks of lidar technology such as delays and conflicts in multi-sensor fusion [3] - The key to achieving commercial scalability for L4 lies in technological upgrades and cost reductions, as hardware costs are expected to rise [3] - The VLA (Vision-Language-Action) model combined with world models is anticipated to be a mainstream trend in the intelligent driving industry, although it has not yet been fully realized [4]
只训练数学,却在物理化学生物战胜o1!新强化学习算法带来显著性能提升,还缓解训练崩溃问题
量子位· 2025-06-23 04:45
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm, CPGD (Clipped Policy Gradient Optimization with Policy Drift), which significantly enhances model stability and performance in multi-modal reasoning tasks, outperforming traditional algorithms like GRPO and RLOO [1][6][11]. Group 1: Algorithm Development - CPGD algorithm alleviates training instability and improves performance, achieving an average performance increase of 11% over models trained with GRPO [1][14]. - The MM-Eureka-CPGD-7B model shows a 21.8% improvement on the MMK12 test set compared to the base model QwenVL2.5-7B, demonstrating superior generalization capabilities [1][14]. - The new algorithm introduces a logarithmic treatment of policy ratios and a policy drift term to stabilize training and control policy changes, proving more effective than existing methods [8][11]. Group 2: Model Performance - The MM-Eureka-CPGD-32B model surpasses the o1 model in various subjects, despite being trained solely on mathematical datasets [2][14]. - The MM-Eureka series has gained significant attention, with over 10,000 downloads and nearly 100 citations since its release [3][14]. - Performance metrics indicate that MM-Eureka-CPGD-7B outperforms leading models like OpenAI-o1 and GPT-4o across multiple datasets [13][15]. Group 3: Data and Framework - The MMK12 dataset, containing over 15,000 multi-modal math reasoning questions, addresses issues of single-type questions and inaccurate answers, becoming a key benchmark in multi-modal reasoning tasks [16][17]. - The multi-modal reinforcement learning framework built on OpenRLHF supports various models and algorithms, enhancing scalability and stability for large-scale training [4][5]. - The MM-PRM (Multi-modal Process Reward Model) focuses on the reasoning process, providing a structured approach to evaluate and guide model inference [18][21]. Group 4: Future Directions - The combination of PRM and reinforcement learning is seen as a promising area for further exploration, aiming to enhance model robustness and interpretability in complex reasoning tasks [22][24]. - The company plans to continue advancing multi-modal reasoning training and systematic optimization, inviting community participation in the development [25].
00后投身具身智能创业,剑指机器人界「Model 3」!已推出21个自由度灵巧手
量子位· 2025-06-22 04:46
Core Viewpoint - Lingchu Intelligent has launched a self-developed dexterous hand with 21 degrees of freedom, aiming to achieve high precision in operations that surpass common 6-degree-of-freedom grippers [1][3][2]. Group 1: Product Features and Goals - The dexterous hand supports 16 active degrees of freedom, enabling complex tasks such as gripping, rotating, and precise insertion [1][2]. - The company aims to reduce the price of a complete robot system to $10,000 (approximately 71,885 yuan), similar to Tesla's Model 3 pricing strategy [3][29]. - Lingchu's humanoid robot design features a "wheeled + two-handed" structure, moving away from traditional gripper designs [4][6]. Group 2: Technical Challenges and Innovations - The complexity of achieving 21 degrees of freedom presents significant manufacturing and stability challenges, making it difficult to lower production costs [3][10]. - Lingchu emphasizes the need for higher dexterity and freedom in robotic hands to perform tasks like tool usage and precision assembly, which are not achievable with simple grippers [8][10]. - The company has adopted a layered end-to-end architecture for its algorithms, combining fast and slow processing to enhance task execution and decision-making [22][23]. Group 3: Market Strategy and Vision - Lingchu's strategy is to integrate hardware and software deeply, defining the user experience at a system level rather than selling individual components [27][26]. - The company aims to establish a complete product ecosystem that includes the robot, action systems, data, and task delivery, positioning itself for scalability in the market [28][29]. - Lingchu is working towards making humanoid robots commercially viable, similar to how Tesla's Model 3 transformed the electric vehicle market [36][35].
从RLHF、PPO到GRPO再训练推理模型,这是你需要的强化学习入门指南
机器之心· 2025-06-22 04:26
Core Insights - Reinforcement Learning (RL) has become an essential technology in the AI field, particularly in large language models (LLMs) [1] - The Unsloth team has released a comprehensive reinforcement learning tutorial that covers various concepts from RLHF to GRPO, making it accessible for beginners and advanced users alike [2][3] Group 1: Understanding Reinforcement Learning - The goal of reinforcement learning is to increase the likelihood of achieving "good" outcomes while reducing the chances of "bad" outcomes [8][10] - Key components of RL include the environment, agent, actions, and reward functions, which collectively define the learning process [9][14] - RLHF (Reinforcement Learning from Human Feedback) has gained popularity, particularly through OpenAI's implementation, which trains agents to generate outputs deemed useful by humans [16][19] Group 2: GRPO and Its Advantages - GRPO (Group Relative Policy Optimization) is a method developed to train reasoning models, differing from PPO (Proximal Policy Optimization) by removing the value model and utilizing custom reward functions [22][24] - GRPO estimates average rewards through sampling multiple outputs for a given question, which helps in optimizing the model's performance [27][28] - The approach allows for significant memory savings and can enhance various tasks beyond coding and mathematics, such as email automation and legal applications [30] Group 3: Training with Unsloth - Unsloth provides a detailed guide for training reasoning models using GRPO, requiring a minimum of 5GB VRAM for local training of models up to 1.5 billion parameters [44] - The training process involves generating multiple answer variants for each question, evaluating them with a reward function, and updating model weights accordingly [45][57] - Effective training requires a well-designed reward function and a sufficient amount of data, with recommendations for at least 500 lines for optimal results [49][50] Group 4: Reward Functions and Validators - Reward functions and validators play crucial roles in evaluating model outputs, with the former assigning scores based on correctness and quality, while the latter verifies the accuracy of the outputs [46][56] - Examples of reward functions include those that reward correct answers and penalize incorrect or overly verbose responses [61] - The design of reward functions is critical, as poorly constructed ones can inadvertently degrade model performance [57]
VR-Robo:real2sim2real,机器人视觉强化学习导航和运动控制新范式!
具身智能之心· 2025-06-20 00:44
Core Viewpoint - The article discusses the advancements in footed robot navigation and motion control through a unified framework called VR-Robo, which addresses the challenges of transferring learned strategies from simulation to real-world applications [3][16]. Related Work - Previous research has explored various methods to bridge the Sim-to-Real gap, but many rely on specific sensors and struggle to balance high-fidelity rendering with real geometric modeling [3][4]. Solution - The VR-Robo framework combines geometric priors from images to reconstruct consistent scenes, utilizes GS-mesh hybrid representation for creating interactive simulation environments, and employs neural reconstruction methods like NeRF for generating high-fidelity scene images [4][5][16]. Experimental Analysis - Comparative experiments were conducted against baseline methods, including imitation learning and textured mesh approaches, to evaluate the performance of the VR-Robo framework [11][12]. - Performance metrics reported include Success Rate (SR) and Average Reaching Time (ART), demonstrating VR-Robo's superior performance in various difficulty levels [14][15]. Summary and Limitations - VR-Robo successfully trains visual navigation strategies using only RGB images, enabling autonomous navigation in complex environments without additional sensors. However, it currently only applies to static indoor environments and has limitations in training efficiency and structural accuracy of the reconstructed meshes [16].
小鹏想要的,不止“留在牌桌上”
虎嗅APP· 2025-06-19 23:55
Core Viewpoint - The article discusses the significant growth and strategic positioning of two electric vehicle manufacturers, Xiaopeng and Leap Motor, highlighting their sales performance, product strategies, and marketing approaches in a competitive market. Group 1: Sales Performance - In the first five months of the year, both Xiaopeng and Leap Motor maintained rapid growth, with Leap Motor's sales increasing by 161% year-on-year and Xiaopeng's by 293% [3][4] - Both companies reported substantial revenue growth in Q1, with Leap Motor's revenue up 187% and Xiaopeng's up 142% year-on-year [4] - Net losses for Leap Motor shrank by 87% and for Xiaopeng by 52%, indicating improved financial health [4] Group 2: Product Strategy - Xiaopeng's rebound in sales is attributed to the successful launch of the MONA M03 model, which has become a best-seller, accounting for over 50% of Xiaopeng's monthly sales in several months [7] - The MONA M03 is positioned as a cost-effective option, featuring a CLTC range of 620 kilometers, which alleviates range anxiety for consumers [7][12] - The vehicle includes user-friendly features such as smart parking and enhanced comfort, appealing to a younger demographic [12][14] Group 3: Marketing and Branding - Xiaopeng has adopted an aggressive marketing strategy, including multiple product launches and media events to increase brand visibility [4][6] - The company has successfully attracted a significant female consumer base, with female users accounting for 50% of MONA M03 orders, a notable increase from the market average [16][14] - Xiaopeng's marketing events have been designed to resonate with younger consumers, incorporating engaging elements and celebrity endorsements [16][18] Group 4: Technological Advancements - Xiaopeng is focusing on technological innovation, with the introduction of the self-developed "Turing AI chip" aimed at enhancing autonomous driving capabilities [20][21] - The company is leveraging large-scale models and reinforcement learning to improve its autonomous driving technology, showcasing its commitment to advancing AI in vehicles [28][30] - Xiaopeng's AI team has validated the effectiveness of scaling laws in autonomous driving, indicating a strategic approach to enhancing vehicle intelligence [28][29]
小鹏想要的,不止“留在牌桌上”
Hu Xiu· 2025-06-19 23:13
Core Insights - Both Leapmotor and Xpeng have significantly increased their sales, with Leapmotor growing 161% and Xpeng 293% year-on-year from January to May. Their Q1 revenues also saw substantial growth, with Leapmotor up 187% and Xpeng up 142%. Net losses were reduced significantly, with Leapmotor's loss shrinking by 87% and Xpeng's by 52% [2] - Xpeng's proactive marketing and product launch strategy contrasts with Leapmotor's more reserved approach, indicating a different mindset in responding to market opportunities [2] - Xpeng's recent product, the MONA M03, has been a key driver of its sales rebound, accounting for over 50% of monthly sales since its launch [7][12] Sales and Marketing Strategy - Xpeng's marketing strategy includes extensive media engagement and product launch events, such as the recent X9 launch in Hong Kong, which attracted nearly 500 media representatives [3][4] - The company has focused on creating a strong brand presence through various promotional activities, including events targeting actual car owners [2][3] - The MONA M03's competitive pricing and features, such as a 620 km range, have made it appealing to consumers, particularly in addressing range anxiety [9][8] Product Development and Features - The MONA M03 has been designed with a focus on user needs, balancing cost control with essential features, which has resonated well with consumers [8][12] - The vehicle includes enhancements like electric tailgates and smart parking, while also simplifying certain features to reduce costs [10][11] - Xpeng's product team demonstrated efficiency in refining the MONA model within a short timeframe after acquiring it from Didi [12] Consumer Demographics and Feedback - The MONA M03 has attracted a notably high percentage of female consumers, with 38.6% of users being women, which is significantly above the industry average [18][19] - Feedback from female users highlights the vehicle's aesthetics and practical features, contributing to its popularity among this demographic [20][21] - Xpeng has quickly adapted to market feedback by introducing new interior options that appeal to female consumers, further boosting sales [21][25] Technological Advancements - Xpeng is focusing on technological innovation, particularly with its self-developed "Turing AI chip," which will enhance the capabilities of its vehicles, including the upcoming G7 model [27][30] - The G7 will feature advanced computing power, significantly exceeding that of competitors, which is part of Xpeng's strategy to differentiate itself in the market [30][31] - The company is also exploring the application of scaling laws in AI to improve autonomous driving capabilities, indicating a commitment to ongoing technological development [40][42] Future Outlook - Xpeng's CEO has emphasized the importance of building a robust system rather than relying solely on individual product successes, indicating a long-term vision for the company [26][51] - The company aims to maintain its focus on technological advancements and market responsiveness to ensure its competitive position in the automotive industry [51]