强化学习
Search documents
98年清华小伙,如何带着一群草根在机器人马拉松中逆袭?
混沌学园· 2025-05-08 11:08
Core Viewpoint - The article discusses the journey of Songyan Power, a startup in the humanoid robot sector, highlighting its unconventional approach to overcoming challenges in funding, technology, and commercialization, ultimately demonstrating that success can come from grassroots efforts rather than elite backgrounds [1][39]. Funding Survival - The company faced significant funding challenges due to its grassroots team lacking prestigious backgrounds, making it difficult to attract investors in a nascent market [7][8]. - Initial self-funding of 1 million RMB allowed the team to create a demonstrable humanoid robot, which subsequently attracted interest from investors, leading to a seed round of 7.6 million RMB [11][13]. - After securing initial funding, the company quickly progressed from a prototype to a functional robot, raising an additional 25 million RMB shortly thereafter [13]. Technical and Talent Bottlenecks - The company encountered a technical and talent bottleneck after initial successes, leading to a cash burn rate of 3 million RMB per month and a precarious financial situation [15][17]. - Acknowledging the need for a shift in strategy, the company decided to pivot towards deep reinforcement learning, a more advanced algorithmic approach, despite the high demand and low supply of qualified engineers in this field [20][22]. - The company implemented a targeted recruitment strategy to identify and cultivate potential talent, leading to the successful onboarding of engineers who could contribute to the development of advanced humanoid robots [24][25]. Commercialization Challenges - By 2025, the company faced a new challenge in commercialization, lacking personnel with marketing and sales expertise, which hindered its ability to gain market visibility [27][28]. - The company recognized the need for a marketing strategy and initiated a campaign centered around a high-profile demonstration of a backflip, which showcased the robot's capabilities and attracted media attention [30][31]. - Following the successful marketing efforts, the company secured over 1,000 orders for its humanoid robots, marking a significant turnaround in its commercial prospects [34][35]. Key Insights - The company learned that tangible product demonstrations were more effective in overcoming trust barriers with investors than mere business plans [37]. - The importance of identifying and nurturing potential talent rather than solely relying on established experts was emphasized as a critical factor in overcoming technical challenges [37]. - The necessity of integrating marketing and sales strategies into the business model was highlighted as essential for sustainable growth and market presence [37].
国泰海通:具身智能驱动人形机器人商业化落地 算法突破等成行业上涨催化剂
智通财经网· 2025-05-08 07:56
Group 1 - The core viewpoint is that embodied intelligence is the key to the commercialization of humanoid robots, with a market space exceeding one trillion yuan, and the intelligent level of humanoid robots in China is expected to evolve significantly by 2045 [1] - Humanoid robots possess human-like perception, body structure, and movement, making them highly adaptable to human society, with potential applications in manufacturing, social services, and hazardous operations [1] - The market scale for humanoid robots is currently below ten billion yuan, but as intelligent levels progress towards embodied intelligence, the market is expected to expand significantly [1] Group 2 - Multi-modal large models and reinforcement learning are enhancing operational control performance, with significant advancements in communication and computing power to support real-time control [2] - Major companies like NVIDIA and Tesla are integrating multi-modal perception to improve robot interaction and decision-making accuracy, while the development of embodied reasoning models is expected to enhance performance in complex environments [2] - The adoption of pure visual solutions and advanced sensors is anticipated to lower hardware costs and improve perception sensitivity, with EtherCAT emerging as a mainstream communication protocol due to its high real-time performance [2]
突破多模态奖励瓶颈!中科院清华快手联合提出R1-Reward,用强化学习赋予模型长期推理能力
量子位· 2025-05-08 06:58
Core Viewpoint - The article discusses the development of the R1-Reward model, which utilizes a stable reinforcement learning algorithm (StableReinforce) to enhance the performance of multi-modal reward models (MRMs) in multi-modal large language models (MLLMs) [1][45]. Group 1: Model Development and Performance - The R1-Reward model achieves a performance improvement of 5%-15% compared to the current state-of-the-art (SOTA) models in existing multi-modal reward model benchmarks [2]. - The model's performance can further increase with more inference sampling, indicating the potential for significant optimization through reinforcement learning [3]. - R1-Reward demonstrates outstanding results on several mainstream multi-modal reward model evaluation benchmarks, significantly surpassing previous best models, with improvements of 8.4% and 14.3% on different leaderboards [11][38]. Group 2: Key Contributions and Innovations - The model provides stable rewards during training and selects better sample results during evaluation, and can also function as an evaluator independently [4]. - The introduction of a "consistency reward" mechanism ensures that the model's analysis aligns with its final answer, promoting logical judgments [11][31]. - The research team collected 200,000 preference data points to construct the R1-Reward-200k dataset for training, employing a progressive difficulty training strategy to enhance model learning [11][34]. Group 3: Algorithm Enhancements - The StableReinforce algorithm addresses the limitations of existing reinforcement learning methods by introducing improvements such as Pre-Clip and Advantage Filter to stabilize training and enhance performance [9][26]. - The Pre-Clip strategy mitigates the impact of large ratio differences during probability calculations, while the Advantage Filter retains only samples within a specified range to avoid extreme values affecting training stability [23][26]. - The model's average output length decreased by approximately 15% after reinforcement learning training, suggesting increased efficiency [44]. Group 4: Future Directions - The article highlights the potential for further exploration in the application of reinforcement learning in reward modeling, including advanced voting strategies for inference and improved training methods to enhance the model's foundational capabilities [45].
仅看视频就能copy人类动作,宇树G1分分钟掌握100+,UC伯克利提出机器人训练新方式
量子位· 2025-05-08 04:04
Core Viewpoint - The article discusses the development of a new robotic training system called VideoMimic by a team from UC Berkeley, which allows robots to learn human movements from video without the need for motion capture technology [1][2]. Group 1: VideoMimic System Overview - VideoMimic has successfully enabled the Yushun G1 robot to mimic over 100 human actions [2]. - The core principle of VideoMimic involves extracting pose and point cloud data from videos, training in a simulated environment, and ultimately transferring the learned actions to a physical robot [3][17]. - The system has garnered significant attention online, with comparisons made to characters like Jack Sparrow from "Pirates of the Caribbean" [4]. Group 2: Training Process - The research team collected a dataset of 123 video clips filmed in everyday environments, showcasing various human movement skills and scenarios [5][6]. - The Yushun Go1 robot has been trained to adapt to different terrains and perform actions such as stepping over curbs and descending stairs, demonstrating its ability to maintain balance even when slipping [7][14][16]. Group 3: Technical Workflow - VideoMimic's workflow consists of three main steps: converting video to a simulation environment, training control strategies in simulation, and validating these strategies on real robots [18]. - The first step involves reconstructing human motion and scene geometry from single RGB videos, optimizing for accurate alignment of human movements and scene geometry [19]. - The second step processes the scene point cloud into a lightweight triangular mesh model for efficient collision detection and rendering [21]. Group 4: Strategy Training and Deployment - The training process is divided into four progressive stages, resulting in a robust control strategy that requires only the robot's proprioceptive information and local height maps as input [24]. - The Yushun Go1 robot, equipped with 12 degrees of freedom and various sensors, serves as the physical testing platform for deploying the trained strategies [30][31]. - The deployment involves configuring the robot's PD controller to match the simulation environment and utilizing real-time data from its depth camera and IMU for effective movement [35][39]. Group 5: Research Team - The project features four co-authors, all PhD students at UC Berkeley, with diverse research interests in robotics, computer vision, and machine learning [43][48][52].
梁文锋和杨植麟再“撞车”
创业家· 2025-05-07 09:57
Core Viewpoint - The article discusses the competitive landscape in the AI large model sector, focusing on the advancements and challenges faced by companies DeepSeek and Kimi, as well as the impact of larger players like Alibaba and Baidu on their market positions [2][5][13]. Group 1: Model Developments - DeepSeek launched its new model, DeepSeek-Prover-V2, with a parameter scale of 671 billion, significantly larger than the previous version's 7 billion, resulting in improved efficiency and accuracy in mathematical tasks [3][4]. - The performance of DeepSeek-Prover-V2 in the miniF2F test reached 88.9%, while it solved 49 problems in the PutnamBench test, outperforming Kimi's model, which had an 80.7% pass rate and solved 10 problems [3][4]. - The evolution of DeepSeek's models is synchronized, with a timeline of updates from Prover series models starting in March 2024 to the latest updates in 2025 [8][9]. Group 2: Competitive Landscape - DeepSeek and Kimi are facing increasing competition from major companies like Alibaba and Baidu, which are rapidly advancing their own AI models [5][15]. - Alibaba's new model, Qwen3, is described as a "mixed reasoning model" that outperforms DeepSeek's R1 model despite having only one-third of its parameters [15][16]. - Kimi has seen rapid growth, reaching 20 million monthly active users within a year, but is now being challenged by Tencent's Yuanbao, which has surpassed Kimi in user numbers [14][15]. Group 3: Future Directions - DeepSeek's founder has identified three paths for achieving AGI: mathematics and code, multimodal learning, and natural language [7]. - The upcoming R2 model is anticipated to enhance DeepSeek's capabilities, with expectations of a shorter development cycle compared to the more extensive updates expected for the V4 model [9][10]. - The market is eager for DeepSeek's new models, with speculation about the use of Huawei's Ascend chips for R2, although there are concerns about their robustness for large model development [10][11].
搞不懂CUDA的人有救了,Devin开发商开源Kevin,强化学习生成CUDA内核
机器之心· 2025-05-07 04:34
| 机器之心报道 | | --- | 编辑:蛋酱、泽南 本周三,知名 AI 创业公司,曾发布「全球首个 AI 软件工程师」的 Cognition AI 开源了一款使用强化学习,用于编写 CUDA 内核的大模型 Kevin-32B 。 Kevin-32B 基于 QwQ-32B 在 KernelBench 数据集上使用 GRPO 进行了多轮强化学习训练,实现了超越 o3 和 o4-mini 的顶级推理表现。 对此,机器学习社区表现出了极大的兴趣。有人表示期待 DeepSeek R1 风格的训练方法用来提升代码效率已久,这回终于有人站出来了。 在一篇博客中,Cognition AI 详细介绍了新模型强化学习训练的机制。 代码是一个不断迭代的过程 —— 需要我们编写、执行程序,评估结果,并根据反馈优化代码。大语言模型(LLM)在代码生成方面的最新进展尝试将此过程融入 推理阶段,并使用并行采样等方法。虽然这些方法是有效的,但它们依赖于搜索而非实际学习 —— 在这其中模型权重被冻结。 Cognition AI 探索了多轮强化学习,使用来自环境的中间反馈,并屏蔽模型思维以避免在多轮训练中上下文爆炸。 他们提出的模型 Kev ...
万字长文带你读懂强化学习,去中心化强化学习又能否实现?
机器之心· 2025-05-07 04:34
Core Insights - Reinforcement Learning (RL) is emerging as a pivotal method for enhancing AI models, particularly in the context of decentralized systems [2][3][20] - The article outlines a timeline of AI scaling methods, emphasizing the shift from pre-training to RL-based approaches for model improvement [6][10][20] - DeepSeek's innovative use of RL in their models, particularly R1-Zero, demonstrates a new paradigm for self-improvement in AI without heavy reliance on human data [25][26][51] Group 1: Historical Context of AI Scaling - The initial scaling laws established the importance of data in training, leading to the understanding that many models were under-trained relative to their parameters [6][10] - The introduction of Chinchilla Scaling Law highlighted the optimal data-to-parameter ratio, prompting researchers to utilize significantly more data for training [6][10] - The evolution of scaling methods culminated in the recognition of the limitations of pre-training data availability, as noted by Ilya Sutskever [19][20] Group 2: DeepSeek's Model Innovations - DeepSeek's R1-Zero model showcases the potential of RL to enhance model performance with minimal human intervention, marking a significant advancement in AI training methodologies [25][26][51] - The model employs a recursive improvement process, allowing it to generate and refine its own reasoning paths, thus reducing dependency on new human data [26][48] - The transition from traditional supervised fine-tuning (SFT) to a GRPO (Group Relative Policy Optimization) framework simplifies the RL process and reduces computational overhead [44][46] Group 3: Decentralized Reinforcement Learning - The article discusses the necessity of a decentralized framework for training and optimizing AI models, emphasizing the need for a robust environment to generate diverse reasoning data [67][72] - Key components of a decentralized RL system include a foundational model, a training environment for generating reasoning data, and an optimizer for fine-tuning [67][70] - The potential for decentralized networks to facilitate collaborative learning and data generation is highlighted, suggesting a shift in how AI models can be developed and improved [72][78] Group 4: Future Directions - The exploration of modular and expert-based models is suggested as a promising avenue for future AI development, allowing for parallel training and improvement of specialized components [106][107] - The integration of decentralized approaches with existing frameworks like RL Swarm indicates a trend towards more collaborative and efficient AI training methodologies [102][107] - The ongoing research into optimizing decentralized training environments and validation mechanisms is crucial for the advancement of AI capabilities [75][78]
VDC+VBench双榜第一!强化学习打磨的国产视频大模型,超越Sora、Pika
机器之心· 2025-05-06 04:11
Core Insights - The article discusses the integration of reinforcement learning into video generation, highlighting the success of models like Cockatiel and IPOC in achieving superior performance in video generation tasks [1][14]. Group 1: Video Detailed Captioning - The video detailed captioning model serves as a foundational element for video generation, with the Cockatiel method achieving first place in the VDC leaderboard, outperforming several prominent multimodal models [3][5]. - Cockatiel's approach involves a three-stage fine-tuning process that leverages high-quality synthetic data aligned with human preferences, resulting in a model that excels in fine-grained expression and human preference consistency [5][8]. Group 2: IPOC Framework - The IPOC framework introduces an iterative reinforcement learning preference optimization method, achieving a total score of 86.57% on the VBench leaderboard, surpassing various well-known video generation models [14][15]. - The IPOC method consists of three stages: human preference data annotation, reward model training, and iterative reinforcement learning optimization, which collectively enhance the efficiency and effectiveness of video generation [19][20]. Group 3: Model Performance - Experimental results indicate that the Cockatiel series models generate video descriptions with comprehensive dimensions, precise narratives, and minimal hallucination phenomena, showcasing higher reliability and accuracy compared to baseline models [7][21]. - The IPOC-2B model demonstrates significant improvements in temporal consistency, structural rationality, and aesthetic quality in generated videos, leading to more natural and coherent movements [21][25].
OpenAI放弃营利性转型!奥特曼:非营利组织继续掌控;关税重压下Temu停运中国直邮美国商品;英伟达再推中国特供版AI芯片
雷峰网· 2025-05-06 00:29
Group 1 - Temu has announced the cessation of direct sales of Chinese products to the U.S. due to a 130% import tariff, shifting to local sellers for U.S. market sales [5][6] - The U.S. Customs policy change effective May 2, 2025, will eliminate the small package tariff exemption for goods from mainland China and Hong Kong, requiring proper customs declarations and payment of applicable tariffs [5] - The number of full-service sellers on Temu's U.S. site has significantly decreased, with some sellers experiencing over 50% of their products being delisted [6] Group 2 - Neta Auto's app and website experienced significant downtime due to unpaid traffic fees, leading to accessibility issues during the holiday period [8] - Neta Auto's sales have sharply declined in 2023, revealing operational challenges, including layoffs and payment delays to suppliers [9] - The company previously achieved a sales record of approximately 152,100 vehicles in 2022, becoming a leading player among new car manufacturers [8] Group 3 - Major car manufacturers, including Xiaomi and Huawei, have rebranded their "smart driving" features to "assisted driving," reflecting a shift in marketing strategy [10][11] - The term "smart driving" is becoming less prominent in product promotions, with many companies opting for more conservative language in their marketing [11] Group 4 - Xiaomi's international market department has undergone leadership changes, with Xu Fei appointed as the new general manager [16] - Xu Fei has been with Xiaomi for 15 years and previously served as the head of the MIUI product team [16] Group 5 - Ant Group plans to separately list its overseas division, Ant International, in Hong Kong, which accounts for approximately 20% of Ant Group's revenue [15] - Ant International focuses on cross-border payment services, leveraging products like Alipay+ and WorldFirst [15] Group 6 - NVIDIA is developing a new AI chip tailored for the Chinese market after the U.S. government banned the export of its H20 chip, with samples expected to be available in June [21] - The new chip design aims to comply with U.S. export regulations while maintaining NVIDIA's market presence in China [21] Group 7 - OpenAI has decided to maintain its non-profit structure, abandoning plans for a profit-driven transformation, which may complicate future funding efforts [20] - The organization emphasizes its mission to ensure that AGI benefits all of humanity, contrasting with traditional profit-driven corporate governance [20]
梁文锋和杨植麟再“撞车”
华尔街见闻· 2025-05-05 12:26
Core Viewpoint - The article discusses the competitive landscape of large model development in China, focusing on the advancements of DeepSeek and Kimi, and the challenges they face from larger companies like Alibaba and Baidu [2][15]. Group 1: Model Developments - DeepSeek launched its new model, DeepSeek-Prover-V2, with a parameter scale of 671 billion, significantly larger than the previous version's 7 billion, enhancing efficiency and accuracy in mathematical tasks [3][4]. - Kimi, developed by the team at Moonlight, released a model called Kimina-Prover with 1.5 billion and 7 billion parameter distilled versions, achieving a miniF2F test pass rate of 80.7% [3][4]. - The performance of DeepSeek-Prover-V2 surpassed that of Kimina-Prover in both miniF2F and PutnamBench tests, indicating a competitive edge in mathematical reasoning capabilities [4]. Group 2: Competitive Challenges - DeepSeek faces declining interest in its R1 model, with competitors like Alibaba rapidly advancing their models, prompting expectations for new releases like R2 or V4 [6][18]. - Kimi is also under pressure from ByteDance's Doubao and Tencent's Yuanbao, necessitating continuous innovation to maintain its market position [7][16]. - The article highlights the rapid growth of Kimi, which reached 20 million monthly active users in November 2024, trailing behind Doubao's 56 million [16]. Group 3: Market Dynamics - Alibaba's new model, Qwen3, is described as a hybrid reasoning model that outperforms DeepSeek's R1, with a parameter count only one-third of R1's [19]. - Baidu's recent releases, including Wenxin 4.5 Turbo, are noted for their superior performance and lower costs compared to DeepSeek, with criticisms regarding DeepSeek's speed and pricing [20][21]. - The competitive landscape is intensifying, with more players entering the large model open-source race, emphasizing the need for advanced technology to set industry standards [22].