Workflow
强化学习
icon
Search documents
RL训练总崩溃?R1-Reward稳定解锁奖励模型Long-Cot推理能力
机器之心· 2025-05-12 04:31
Core Viewpoint - The article discusses the development and application of the R1-Reward model, which utilizes a new algorithm called StableReinforce to enhance the performance of multimodal reward models (MRMs) through reinforcement learning (RL) techniques, addressing issues of training instability and inconsistency in reward modeling [1][38]. Group 1: R1-Reward Model and Its Applications - R1-Reward has shown significant academic value and has been successfully applied in practical scenarios at Kuaishou, such as in short videos, e-commerce, and live streaming, achieving notable performance improvements [2]. - The R1-Reward model outperforms state-of-the-art (SOTA) models by 5%-15% on existing multimodal reward model benchmarks, with further improvements observed as the number of inference samples increases [1][38]. Group 2: Algorithm Improvements - The article introduces a new algorithm, StableReinforce, which optimizes existing RL methods to enhance training stability and efficiency [9]. - Key improvements include a gradual training strategy, a robust advantage value handling method called Advantage Filter, and a novel "consistency reward" mechanism that checks the coherence between the model's analysis and its final answer [12][25]. Group 3: Training Methodology - The training process involves a two-step approach: first, a supervised fine-tuning (SFT) phase using a dataset of 200,000 preference data points, followed by a reinforcement learning phase focusing on more challenging samples [27][30]. - The SFT phase allows the model to learn the task format and process, while the RL phase targets samples deemed "harder" to improve the model's ability to discern subtle differences [32]. Group 4: Experimental Results - R1-Reward has demonstrated exceptional performance on multiple multimodal reward model leaderboards, significantly surpassing previous best models [34]. - A voting strategy during evaluation, where the model outputs multiple judgments and selects the most frequent answer, has led to substantial accuracy improvements, with accuracy rising from approximately 71% to 86.47% when voting 15 times [35]. Group 5: Future Directions - The article suggests that there are many unexplored avenues for applying RL in reward modeling, including advanced voting strategies and improved training methodologies to further enhance the model's foundational capabilities [38].
人形机器人到底是产业革命还是资本泡沫?
机器人大讲堂· 2025-05-11 04:26
Core Viewpoint - The humanoid robot industry is experiencing a capital bubble due to blind investment in emerging technologies, despite the lack of substantial commercial progress and technological maturity [2][4][20]. Group 1: Capital Investment and Market Dynamics - The humanoid robot sector has attracted significant capital investment, leading to rapid valuation increases for some startups, even those established less than a year [1]. - The influx of capital has not effectively promoted technological advancements, resulting in a market bubble where expectations exceed actual capabilities [2][20]. - Historical examples, such as Honda's Asimo and Boston Dynamics' robots, illustrate the disconnect between technological aspirations and market realities, often leading to project failures [4]. Group 2: Technological Challenges - Humanoid robots face significant technical bottlenecks in perception and motion control, limiting their effectiveness in real-world applications [5][10]. - Despite advancements in sensor systems and motion control technologies, robots struggle with environmental perception accuracy and dynamic adaptability [8][10]. - Current humanoid robots lack true intelligent decision-making capabilities, relying instead on pre-programmed instructions, which hinders their ability to adapt to complex environments [11][13]. Group 3: Future Prospects - The development of humanoid robots is expected to be gradual, with small-scale commercialization anticipated in the next 3-5 years [20]. - Emerging technologies like reinforcement learning may enhance robots' adaptive capabilities, but significant computational resources and time are required for effective implementation [16]. - The future of humanoid robots lies in improving their cognitive abilities to make autonomous decisions in dynamic environments, moving beyond mere mechanical control [16].
前谷歌CEO:千万不要低估中国的AI竞争力
Hu Xiu· 2025-05-10 03:55
Group 1: Founder Psychology and Roles - Eric Schmidt emphasizes the difference between founders and professional managers, stating that founders are visionaries while professional managers are "amplifiers" who help scale ideas [4][10] - Schmidt reflects on his experience at Google, noting that he was not a typical entrepreneur but rather a professional manager who contributed during the company's scaling phase [3][4] - He discusses the challenges of attracting talent, highlighting that many talented individuals often choose to start their own companies instead of joining established firms [3][5] Group 2: Market Dynamics and Startup Ecosystem - Schmidt points out that many startups are often acquired for their talent rather than their products, indicating a market structure that can be inefficient [6][7] - He notes that the majority of startups fail, with traditional venture capital experiences suggesting that 4 out of 10 will fail completely, and 5 will become "zombies" with no growth potential [7] - The conversation highlights the importance of competition for startups, suggesting that true leadership is demonstrated when facing challenges from larger companies [11][12] Group 3: AI and Future Trends - Schmidt believes that AI is currently underestimated rather than overhyped, citing the scaling laws that drive AI advancements [33][34] - He discusses the potential of AI to transform business processes and scientific breakthroughs, emphasizing the importance of understanding how humans will coexist with advanced AI systems [35][39] - The conversation touches on the competitive landscape between the U.S. and China in AI development, with China investing heavily in AI as a national strategy [41][42] Group 4: Talent Acquisition and Management - Schmidt stresses the importance of attracting top talent by creating an environment where individuals feel they are solving significant problems [18][20] - He differentiates between "rockstar" employees who drive change and "mediocre" employees who are self-serving, advocating for the retention of the former [21][22] - The discussion includes insights on how to identify and nurture high-potential talent within organizations [24][25] Group 5: Challenges in AI Development - Schmidt highlights the challenges of defining reward functions in reinforcement learning, which is crucial for AI's self-learning capabilities [51] - He warns about the potential pitfalls of over-investing in AI infrastructure without a clear path to profitability, suggesting that many companies may face economic traps [47][48] - The conversation concludes with a call for companies to focus on the most challenging problems in AI, as solving these will yield the most significant rewards [52][53]
9年实现爱因斯坦级AGI?OpenAI科学家Dan Roberts谈强化学习扩展的未来
机器之心· 2025-05-10 03:42
Core Insights - The core insight of the article is the prediction that reinforcement learning will play an increasingly significant role in the development of AI models, potentially leading to the creation of models capable of discovering new scientific knowledge within the next nine years [2][37]. Group 1: Presentation Highlights - Dan Roberts, a research scientist at OpenAI, discussed the importance of scaling laws in pre-training and reinforcement learning during his presentation at AI Ascent [2][4]. - The presentation highlighted a significant finding: as the "thinking time" of models increases, their performance improves, indicating that models can learn to think more effectively [9][12]. - OpenAI's recent model, o3, demonstrates enhanced reasoning capabilities, allowing it to solve complex problems in a fraction of the time it would take a human [14][31]. Group 2: Future Predictions - The company aims to expand the scale of reinforcement learning significantly, with plans to invest $500 billion in computational resources to enhance model training [48]. - Predictions suggest that AI's ability to process tasks will double approximately every seven months, potentially allowing for computations lasting up to eight years by 2034 [56][57]. - The ultimate goal is to develop models that can contribute significantly to human knowledge and scientific discovery, akin to the time it took Einstein to formulate the theory of general relativity [31][57].
21对话|卓驭陈晓智:用有限算力做极致性能,这是我们血液里的东西
Core Insights - The article discusses the rise of intelligent driving technology in the automotive market, particularly focusing on Zhuoyue Technology's approach to providing cost-effective driving assistance solutions [1][2][3]. Group 1: Company Overview - Zhuoyue Technology, formerly known as DJI Automotive, has transitioned from a team within DJI focused on intelligent driving technology to an independent entity, leveraging its expertise in sensors and computer vision from the drone industry [2]. - The company aims to provide high-performance driving assistance features at lower costs, utilizing its self-developed hardware and software [1][2]. Group 2: Product Development - Zhuoyue's 7V (7 cameras) + 32 TOPS configuration has become standard in vehicles priced between 80,000 to 150,000 RMB, enabling features like urban memory navigation and highway driving [1]. - The company plans to launch the "Chengxing Platform" in November 2024, offering 7V and 9V solutions that reduce reliance on high-precision maps and LiDAR, thus lowering costs for advanced driving assistance [2]. Group 3: Market Position and Strategy - The mid-to-low-end market is expected to grow significantly by 2025, which aligns with Zhuoyue's strengths [3]. - Zhuoyue has established partnerships with major automotive manufacturers, including FAW, Volkswagen, and BYD, with over 20 models already in production and more than 30 models set to launch soon [2]. Group 4: Technological Innovations - The company is focusing on enhancing its capabilities through the introduction of the Thor platform, which offers higher computing power at a lower cost compared to existing solutions [3][6]. - Zhuoyue is also exploring the integration of reinforcement learning and world models to improve safety and decision-making in driving assistance systems [12][19]. Group 5: Future Directions - The company is preparing to develop hardware for L3 and L4 autonomous driving, including necessary sensors and controllers, while emphasizing the importance of first perfecting L2 assistance before advancing to higher levels of automation [9][10]. - Zhuoyue aims to enhance user experience by implementing a more intuitive point-to-point navigation system that mimics human driving behavior [20].
【重磅深度】AI+汽车智能化系列之十一——以地平线为例,探究第三方智驾供应商核心竞争力
Core Viewpoint - The company is optimistic about the breakthrough opportunities for leading third-party intelligent driving suppliers, driven by the demand for equal access to intelligent driving and the performance catch-up and mass production validation [2][8]. Group 1: Market Opportunities - Leading third-party intelligent driving suppliers are expected to become the optimal solution for second- and third-tier automakers seeking equal access to intelligent driving, with a potential market share of around 50% of total new car sales [2][8]. - The current trend of intelligent driving is accelerating towards equal access, with a focus on cost reduction in systems, as automakers balance performance and cost in their strategies [2][8]. Group 2: Domestic Chip Comparison - NVIDIA's Orin series chips currently dominate the high-end intelligent driving market, but domestic chip suppliers have made significant progress in performance, mass production validation, and customer acquisition over the past five years [3][39]. - The domestic chip leader, Horizon Robotics, is entering a new cycle of product iteration and business model elevation, leading in the rollout of mid- to high-end intelligent driving chips and algorithms [11][39]. Group 3: Core Value of Third-Party Chip Suppliers - The importance of being a first mover is highlighted, as intelligent driving chips typically require over three years of R&D and manufacturing cycles, necessitating continuous iteration capabilities for cost-performance balance [4][54]. - The design and manufacturing cost perspective indicates that a 7nm intelligent driving chip can achieve cost parity with mature chip procurement at a production volume of 1.5 million units, emphasizing the need for high output and rapid iteration capabilities for self-developed chips [4][57]. Group 4: Algorithm Insights - The "BEV + Transformer" algorithm approach, focusing on "heavy perception, light mapping," has been validated and widely applied, reducing risks for Tier 1 suppliers and allowing them to keep pace with cutting-edge technologies [4][62]. - Horizon Robotics' latest intelligent driving algorithm, HSD, is positioned as a "showcase," balancing performance and efficiency while addressing the challenges of scaling up and out in intelligent driving systems [62][63]. Group 5: Industry Trends - The intelligent driving landscape is expected to see a significant shift towards equal access by 2026, with many domestic automakers planning to adopt domestic chips as their mainstream solution [28][43]. - The competitive landscape among automakers is intensifying, with a focus on high-level intelligent driving capabilities and the need for cost-effective solutions [2][8].
颠覆谷歌搜索API,成本降至88%,阿里开源RL框架ZeroSearch,重新定义AI搜索!
AI科技大本营· 2025-05-09 09:35
Core Insights - Alibaba's Tongyi team has launched ZeroSearch, a generative search engine framework that operates independently without external search interfaces, achieving low-cost and high-performance retrieval capabilities [1][10]. Group 1: ZeroSearch Overview - ZeroSearch allows users to run a 14 billion parameter model on four A100 GPUs for just $70.80, providing search capabilities that can rival or exceed Google [1][16]. - The framework employs a novel reinforcement learning approach to train search capabilities without interacting with real search engines, addressing issues of document quality and high API costs [2][6]. Group 2: Training Methodology - The training process involves lightweight supervised fine-tuning to convert a large model into a retrieval module capable of generating relevant and irrelevant documents based on queries [8]. - A curriculum learning strategy is introduced, gradually lowering document quality to challenge the model's reasoning and retrieval abilities, thus enhancing its search learning path [2][8]. Group 3: Cost Efficiency and Performance - ZeroSearch has demonstrated an 80%-90% reduction in training costs compared to traditional methods, making it a truly low-cost and high-performance solution for AI search training [10][16]. - In various experimental scenarios, ZeroSearch has achieved performance levels that are equal to or better than models trained with real search engines, with a 7 billion parameter model matching Google search quality and a 14 billion parameter version surpassing it [15][16]. Group 4: Open Source and Accessibility - The researchers have made their code, datasets, and pre-trained models publicly available on GitHub and Hugging Face, promoting accessibility for other researchers and companies [16].
拜拜,昂贵的谷歌搜索 API!阿里开源 RL 框架让大模型自给自足、成本直降88%,网友:游戏规则变了
AI前线· 2025-05-09 05:18
Core Viewpoint - Alibaba's new technology "ZeroSearch" significantly reduces the cost and complexity of training AI systems for information retrieval, eliminating the need for expensive commercial search engine APIs [1][2][14]. Summary by Sections Technology Overview - ZeroSearch is a reinforcement learning framework that allows large language models (LLMs) to develop advanced search capabilities through simulation, outperforming models based on real search engines while incurring zero API costs [2][3]. - The technology is compatible with various model series, including Qwen-2.5 and LLaMA-3.2, and does not require a separate supervised preheating phase [2][3]. Performance Metrics - In comprehensive experiments across seven question-answer datasets, ZeroSearch's performance matched or exceeded that of models trained with real search engines [3][5]. - A 3 billion parameter LLM can achieve search capabilities comparable to Google, while a 14 billion parameter module can surpass Google's performance [3][5]. Cost Efficiency - Training using Google search via SerpAPI for approximately 64,000 queries costs around $586.70, while using a 14 billion parameter simulated LLM on four A100 GPUs costs only $70.80, representing an 88% reduction in costs [7][8]. Methodology - ZeroSearch begins with a lightweight supervised fine-tuning process that transforms LLMs into retrieval modules capable of generating relevant and irrelevant documents in response to queries [9][11]. - The system employs a course-based learning deployment mechanism, gradually increasing the difficulty of generated documents to simulate challenging retrieval scenarios [11][12]. Implications for AI Development - ZeroSearch represents a significant shift in AI training methods, enabling AI systems to improve without relying on external tools like search engines [14][15]. - This technology creates a more equitable competitive environment for small AI companies and startups by drastically lowering the entry barrier associated with high API costs [14][15].
文生图进入R1时刻:港中文MMLab发布T2I-R1
机器之心· 2025-05-09 02:47
Core Viewpoint - The article discusses the development of T2I-R1, a novel text-to-image generation model that utilizes a dual-level Chain of Thought (CoT) reasoning framework combined with reinforcement learning to enhance image generation quality and alignment with human expectations [1][3][11]. Group 1: Methodology - T2I-R1 employs two distinct levels of CoT reasoning: Semantic-CoT and Token-CoT. Semantic-CoT focuses on the global structure of the image, while Token-CoT deals with the detailed generation of image tokens [6][7]. - The model integrates Semantic-CoT to plan and reason about the image before generation, optimizing the alignment between prompts and generated images [7][8]. - Token-CoT generates image tokens sequentially, ensuring visual coherence and detail in the generated images [7][8]. Group 2: Model Enhancement - T2I-R1 enhances a unified language and vision model (ULM) by incorporating both Semantic-CoT and Token-CoT into a single framework for text-to-image generation [9][11]. - The model uses reinforcement learning to jointly optimize the two levels of CoT, allowing for multiple sets of Semantic-CoT and Token-CoT to be generated for a single image prompt [11][12]. Group 3: Experimental Results - The T2I-R1 model demonstrates improved robustness and alignment with human expectations when generating images based on prompts, particularly in unusual scenarios [13]. - Quantitative results indicate that T2I-R1 outperforms baseline models by 13% and 19% on the T2I-CompBench and WISE benchmarks, respectively, and surpasses previous state-of-the-art models [16].
阶跃星辰姜大昕:多模态目前还没有出现GPT-4时刻
Hu Xiu· 2025-05-08 11:50
Core Viewpoint - The multi-modal model industry has not yet reached a "GPT-4 moment," as the lack of an integrated understanding-generating architecture is a significant bottleneck for development [1][3]. Company Overview - The company, founded by CEO Jiang Daxin in 2023, focuses on multi-modal models and has undergone internal restructuring to form a "generation-understanding" team from previously separate groups [1][2]. - The company currently employs over 400 people, with 80% in technical roles, fostering a collaborative and open work environment [2]. Technological Insights - The understanding-generating integrated architecture is deemed crucial for the evolution of multi-modal models, allowing for pre-training with vast amounts of image and video data [1][3]. - The company emphasizes the importance of multi-modal capabilities for achieving Artificial General Intelligence (AGI), asserting that any shortcomings in this area could delay progress [12][31]. Market Position and Competition - The company has completed a Series B funding round of several hundred million dollars and is one of the few in the "AI six tigers" that has not abandoned pre-training [3][36]. - The competitive landscape is intense, with major players like OpenAI, Google, and Meta releasing numerous new models, highlighting the urgency for innovation [3][4]. Future Directions - The company plans to enhance its models by integrating reasoning capabilities and long-chain thinking, which are essential for solving complex problems [13][18]. - Future developments will focus on achieving a scalable understanding-generating architecture in the visual domain, which is currently a significant challenge [26][28]. Application Strategy - The company adopts a dual strategy of "super models plus super applications," aiming to leverage multi-modal capabilities and reasoning skills in its applications [31][32]. - The focus on intelligent terminal agents is seen as a key area for growth, with the potential to enhance user experience and task completion through better contextual understanding [32][34].