Workflow
Reinforcement Learning
icon
Search documents
突破高分辨率图像推理瓶颈,复旦联合南洋理工提出基于视觉Grounding的多轮强化学习框架MGPO
机器之心· 2025-07-21 04:04
Core Insights - The article discusses the development of a multi-turn reinforcement learning method called MGPO, which enhances the visual reasoning capabilities of large multi-modal models (LMMs) when processing high-resolution images [1][8][21] - MGPO allows LMMs to automatically predict key area coordinates and crop sub-images based on questions, improving the model's ability to focus on relevant information without requiring expensive grounding annotations [2][21] Summary by Sections Introduction - Current LMMs, such as Qwen2.5-VL, face challenges in processing high-resolution images due to the conversion of images into a large number of visual tokens, many of which are irrelevant to the task [5][6] - The human visual system employs a task-driven visual search strategy, which MGPO aims to replicate by enabling LMMs to focus on key areas of images [6][7] Method Overview - MGPO simulates a multi-step visual reasoning process where the model first predicts key area coordinates and then crops sub-images for further reasoning [10][21] - The method overcomes the limitations of traditional visual grounding models that require extensive grounding annotations for training [7][21] Key Innovations of MGPO - A top-down, interpretable visual reasoning mechanism that allows LMMs to conduct problem-driven visual searches [2] - The ability to accurately identify relevant area coordinates from high-resolution images, even when visual tokens are limited [2] - The model can be trained on standard Visual Question Answering (VQA) datasets without additional grounding annotations, relying solely on answer correctness for feedback [2][21] Experimental Results - MGPO demonstrated significant performance improvements over other methods like SFT and GRPO, achieving increases of 5.4% and 5.2% in benchmark tests [18][19] - The model outperformed OpenAI's models despite being trained on a smaller dataset, showcasing its effectiveness [18][19] - The proportion of effective grounding coordinates generated by MGPO increased significantly during training, indicating its ability to develop robust visual grounding capabilities autonomously [20] Conclusion - MGPO effectively addresses issues of visual token redundancy and key information loss in high-resolution image processing [21] - The method proves that reinforcement learning can foster robust grounding capabilities without the need for costly annotations, enhancing the efficiency of LMMs [21]
A Taxonomy for Next-gen Reasoning — Nathan Lambert, Allen Institute (AI2) & Interconnects.ai
AI Engineer· 2025-07-19 21:15
Model Reasoning and Applications - Reasoning unlocks new language model applications, exemplified by improved information retrieval [1] - Reasoning models are enhancing applications like website analysis and code assistance, making them more steerable and user-friendly [1] - Reasoning models are pushing the limits of task completion, requiring ongoing effort to determine what models need to continue progress [1] Planning and Training - Planning is a new frontier for language models, requiring a shift in training approaches beyond just reasoning skills [1][2] - The industry needs to develop research plans to train reasoning models that can work autonomously and have meaningful planning capabilities [1] - Calibration is crucial for products, as models tend to overthink, requiring better management of output tokens relative to problem difficulty [1] - Strategy and abstraction are key subsets of planning, enabling models to choose how to break down problems and utilize tools effectively [1] Reinforcement Learning and Compute - Reinforcement learning with verifiable rewards is a core technique, where language models generate completions and receive feedback to update weights [2] - Parallel compute enhances model robustness and exploration, but doesn't solve every problem, indicating a need for balanced approaches [3] - The industry is moving towards considering post-training as a significant portion of compute, potentially reaching parity with pre-training in GPU hours [3]
How to Train Your Agent: Building Reliable Agents with RL — Kyle Corbitt, OpenPipe
AI Engineer· 2025-07-19 21:12
Core Idea - The presentation discusses a case study on building an open-source natural language assistant (ART E) for answering questions from email inboxes using reinforcement learning [1][2][3] - The speaker shares lessons learned, what worked and didn't, and how they built an agent that worked well with reinforcement learning [2] Development Process & Strategy - The speaker recommends starting with prompted models to achieve the best performance before using any training, including reinforcement learning, to work out bugs in the environment and potentially avoid training altogether [7][8][9] - The company was able to surpass prompted model baselines with reinforcement learning, achieving a 60% reduction in errors compared to the best prompted model (03, which had 90% accuracy, while the RL model achieved 96% accuracy) [10][15] - The training of the ART E model cost approximately $80 in GPU time and one week of engineering time with an experienced engineer [23][24] Key Metrics & Optimization - The company benchmarked cost, accuracy, and latency, finding that the trained model (Quen 2.5 14B) achieved significant cost reduction compared to 03 ($55 per 1,000 searches) and 04 mini ($8 per 1,000 searches) [16][17] - The company improved latency by moving to a smaller model, training the model to have fewer turns, and considering speculative decoding [19][20][21] - The company optimized the reward function to include extra credit for fewer turns and discouraging hallucination, resulting in a significantly lower hallucination rate compared to prompted models [45][46][49][50] Challenges & Solutions - The two hard problems in using RL are figuring out a realistic environment and getting the right reward function [26][27][28] - The company created a realistic environment using the Enron email dataset, which contains 500,000 emails [33][34][35] - The company designed the reward function by having Gemini 2.5 Pro generate questions and answers from batches of emails, creating a verified dataset for the agent to learn from [37][38][39] - The company emphasizes the importance of watching out for reward hacking, where the model exploits the reward function without actually solving the problem, and suggests modifying the reward function to penalize such behavior [51][53][61]
L4产业链跟踪系列第三期-头部Robotaxi公司近况跟踪(技术方向)
2025-07-16 06:13
Summary of Conference Call Company and Industry - The conference call primarily discusses advancements in the autonomous driving industry, specifically focusing on a company involved in Level 4 (L4) autonomous driving technology. Key Points and Arguments 1. **Technological Framework**: The company has a modular architecture for its autonomous driving system, which includes perception, prediction, control, and planning. This framework has evolved to incorporate advanced techniques like reinforcement learning and world models, although the core structure remains intact [1][2][3]. 2. **Transition to Large Models**: The industry is shifting from CNN architectures to transformer-based models. The company is gradually replacing its existing models with these new frameworks, which may take longer due to the high baseline performance of their current systems [3][4]. 3. **Data Utilization**: The company emphasizes the importance of both real and simulated data for model training. While real data is primarily used, there is a plan to increasingly incorporate simulated data to address data shortages, especially for control models [8][9][10]. 4. **Learning Techniques**: Imitation learning has been used for scenarios where rule-based approaches fail, while reinforcement learning is applied in end-to-end (E2E) models. The proportion of reinforcement learning used is not significant, indicating a cautious approach to its implementation [11][12]. 5. **Operational Deployment**: The company has deployed several autonomous vehicles in major cities like Beijing and Guangzhou, with plans to expand in Shenzhen and Shanghai. The current fleet consists of a few hundred vehicles [14][21]. 6. **Cost Structure**: The cost of vehicles includes hardware components such as multiple radars and cameras, with estimates suggesting that the total cost could be reduced to around 200,000 yuan [15][19]. 7. **Computational Resources**: The company is facing challenges with computational capacity, particularly with the integration of various models across different chips. There is a focus on optimizing the use of existing resources while planning for future upgrades [19][20]. 8. **Profitability Goals**: The company aims to achieve a break-even point by deploying a fleet of over 10,000 vehicles by 2027 or 2028. Current estimates suggest that achieving profitability may require a fleet size closer to 100,000 vehicles [26]. 9. **Market Positioning**: The company acknowledges competition from other players in the autonomous driving space, particularly in terms of regulatory approvals and operational capabilities. It aims to maintain a competitive edge by leveraging its faster acquisition of commercial licenses [27][28]. Other Important Content - The discussion highlights the ongoing evolution of the autonomous driving technology landscape, with a focus on the balance between technological advancement and operational scalability. The company is committed to addressing challenges in data acquisition, model training, and fleet management to enhance its market position [22][23][30].
最强人才接连被挖,创业大佬离开 OpenAI 后说了实话:7 周硬扛出 Codex,无统一路线、全靠小团队猛冲
AI前线· 2025-07-16 05:08
Core Insights - The article discusses the recent departure of key researchers from OpenAI to Meta's newly established superintelligence lab, highlighting the competitive landscape in AI research and talent acquisition [1][2][3] - It provides a personal perspective on the internal culture and operational dynamics at OpenAI, emphasizing the unique environment that fosters innovation and rapid project execution [3][4][10] Group 1: OpenAI's Internal Culture - OpenAI operates as a cluster of small teams rather than a centralized organization, allowing for flexibility and rapid execution of projects without a strict roadmap [3][11] - The company has a strong emphasis on bottom-up decision-making, where good ideas can come from any employee, and the focus is on action rather than extensive planning [11][12] - OpenAI's culture encourages a high degree of autonomy among researchers, leading to a dynamic environment where projects can be initiated and developed quickly [12][18] Group 2: Talent Movement and Industry Dynamics - The movement of researchers like Jason Wei and Hyung Won Chung from OpenAI to Meta raises questions about the internal environment at OpenAI and the factors influencing talent retention [1][2] - The article reflects on the competitive nature of the AI industry, particularly among leading firms like OpenAI, Meta, and Google, each pursuing different strategies in the race towards AGI [33] Group 3: Project Execution and Innovation - The Codex project exemplifies OpenAI's ability to deliver significant products in a short timeframe, with the team completing the project in just seven weeks [26][27] - OpenAI's operational model is likened to a research lab, where innovation is prioritized, and the focus is on creating impactful consumer applications while maintaining a commitment to safety and ethical considerations [15][16][18]
倒计时2天,即将开课啦!从0基础到强化学习,再到sim2real
具身智能之心· 2025-07-12 13:59
Core Viewpoint - The article discusses the rapid advancements in embodied intelligence, highlighting its potential to revolutionize various industries by enabling robots to understand language, navigate complex environments, and make intelligent decisions [1]. Group 1: Embodied Intelligence Technology - Embodied intelligence aims to integrate AI systems with physical capabilities, allowing them to perceive and interact with the real world [1]. - Major tech companies like Tesla, Boston Dynamics, OpenAI, and Google are competing in this transformative field [1]. - The potential applications of embodied intelligence span manufacturing, healthcare, service industries, and space exploration [1]. Group 2: Technical Challenges - Achieving true embodied intelligence presents unprecedented technical challenges, requiring advanced algorithms and a deep understanding of physical simulation, robot control, and perception fusion [2]. Group 3: Role of MuJoCo - MuJoCo (Multi-Joint dynamics with Contact) is identified as a critical technology for embodied intelligence, serving as a high-fidelity simulation engine that bridges the virtual and real worlds [3]. - It allows researchers to create realistic virtual robots and environments, enabling millions of trials and learning experiences without risking expensive hardware [5]. - MuJoCo's advantages include high simulation speed, the ability to test extreme scenarios safely, and effective transfer of learned strategies to real-world applications [5]. Group 4: Research and Industry Adoption - MuJoCo has become a standard tool in both academia and industry, with major companies like Google, OpenAI, and DeepMind utilizing it for robot research [7]. - Mastery of MuJoCo positions entities at the forefront of embodied intelligence technology [7]. Group 5: Practical Training and Curriculum - A comprehensive MuJoCo development course has been created, focusing on practical applications and theoretical foundations within the embodied intelligence technology stack [9]. - The course includes project-driven learning, covering topics from physical simulation principles to deep reinforcement learning and Sim-to-Real transfer techniques [9][10]. - Six progressive projects are designed to enhance understanding and application of various technical aspects, ensuring a solid foundation for future research and work [14][15]. Group 6: Expected Outcomes - Upon completion of the course, participants will gain a complete embodied intelligence technology stack, enhancing their technical, engineering, and innovative capabilities [25][26]. - Participants will develop skills in building complex robot simulation environments, understanding core reinforcement learning algorithms, and applying Sim-to-Real transfer techniques [25].
前 OpenAI 研究员 Kevin Lu:别折腾 RL 了,互联网才是让大模型进步的关键
Founder Park· 2025-07-11 12:07
Core Viewpoint - The article emphasizes that the internet is the key technology driving the advancement of artificial intelligence, rather than focusing solely on model architectures like Transformers [1][5][55]. Group 1: Importance of the Internet - The internet provides a rich and diverse data source that is essential for training AI models, enabling scalable deployment and natural learning pathways [1][5][54]. - Without the internet, even advanced models like Transformers would lack the necessary data to perform effectively, highlighting the critical role of data quality and quantity [28][30]. Group 2: Critique of Current Research Focus - The article critiques the current emphasis on optimizing model architectures and manual dataset creation, arguing that these approaches are unlikely to yield significant improvements in model capabilities [1][19][55]. - It suggests that researchers should shift their focus from deep learning optimizations to exploring new methods of data consumption, particularly leveraging the internet [16][17]. Group 3: Data Paradigms - The article outlines two main paradigms in data consumption: the compute-bound era and the data-bound era, indicating a shift in focus from algorithmic improvements to data availability [11][13]. - It argues that the internet's vast array of sequence data is perfectly suited for next-token prediction, which is a fundamental aspect of many AI models [17][22]. Group 4: Role of Reinforcement Learning - While reinforcement learning (RL) is seen as a necessary condition for achieving advanced AI, the article points out the challenges in obtaining high-quality reward signals for RL applications [55][61]. - The article posits that the internet serves as a complementary resource for next-token prediction, which is crucial for RL to thrive [55][56]. Group 5: Future Directions - The article calls for a reevaluation of how AI research is conducted, suggesting that a collaborative approach between product development and research could lead to more meaningful advancements in AI [35][54]. - It emphasizes the need for diverse and economically viable data sources to support the development of robust AI systems, indicating that user engagement is vital for data contribution [51][54].
奖励模型终于迎来预训练新时代!上海AI Lab、复旦POLAR,开启Scaling新范式
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the limitations of current reward modeling methods in reinforcement learning, particularly in the context of large language models (LLMs), and introduces a new paradigm called POLAR that aims to enhance scalability and generalization in reward modeling [2][3][5]. Group 1: Current Reward Modeling Methods - Preference-based Reward Modeling relies on high-quality preference data, which is costly and difficult to scale, and struggles with generalization and susceptibility to reward hacking [3][4]. - Rule-based Verifier methods provide accurate reward signals for verifiable tasks but fail to extend to more general scenarios like open-domain dialogue and complex interactions [3][4]. Group 2: Introduction of POLAR - POLAR, developed by a team from Shanghai AI Lab and Fudan University, utilizes Policy Discriminative Learning to decouple from absolute preferences, allowing for efficient scaling and strong generalization capabilities [5][9]. - The training process of POLAR involves measuring the "distance" between candidate strategies and optimal strategies, providing a relative reward signal that does not depend on human-annotated preferences [9][10]. Group 3: Training Methodology - POLAR's pre-training corpus is constructed through automated data synthesis, sampling from LLM pre-training data and using a large pool of models for trajectory sampling [14][15]. - The pre-training objective employs Bradley-Terry Loss to assign higher rewards to trajectories generated by similar strategies, effectively modeling the differences in strategy distributions [14][15]. Group 4: Performance and Generalization - POLAR demonstrates superior performance in preference evaluation, outperforming state-of-the-art reward models by significant margins in various tasks, including STEM [33]. - In reinforcement fine-tuning (RFT) experiments, models fine-tuned with POLAR show an average improvement of 9.0% over initial results, highlighting its effectiveness in enhancing LLM capabilities [34]. Group 5: Scaling Effects - POLAR exhibits scaling laws similar to LLM Next Token Prediction, indicating that increased computational resources lead to improved reward model performance [35]. - The validation loss decreases in a power-law relationship with the increase in model parameters and training compute, suggesting the potential for building more powerful and generalizable reward models [35]. Conclusion - POLAR represents a novel and scalable approach to reward modeling, offering new possibilities for LLM post-training and addressing the challenges in reinforcement learning [37].
两个华人 AI 分别融了数千万美金:创始人都来自 Meta
投资实习所· 2025-07-09 05:42
Core Insights - The article highlights the emergence of AI products developed by Chinese teams, particularly focusing on Pokee AI, which has successfully raised $12 million in seed funding aimed at automating enterprise workflows [1][12]. Group 1: Company Overview - Pokee AI is led by Bill Zhu, a former head of the reinforcement learning group at Meta, and aims to automate online workflows for users by integrating AI functionalities into existing tools and services [1][11]. - The funding round was led by Point72 Ventures, with participation from Qualcomm, Samsung, and other notable investors, indicating strong market interest and confidence in the product [1][12]. Group 2: Product Features - Pokee AI integrates AI capabilities into various platforms such as Google Workspace, Meta platforms, LinkedIn, and more, allowing users to automate tasks without switching between different tools [2][3]. - The product targets three core scenarios: AI + Productivity, AI + Social Media, and AI + Research & Engineering, addressing common pain points in workflow automation [9]. Group 3: Technology and Approach - Unlike many AI agents that rely on large language models (LLMs), Pokee AI utilizes reinforcement learning (RL) to tackle the execution of complex tasks, which is seen as a significant advantage [6][11]. - The RL approach allows the AI to learn from interactions with the environment, improving its decision-making and execution capabilities, achieving over 97% accuracy in selecting from thousands of tools [11]. Group 4: Market Context - The article notes a growing trend among Chinese AI teams to create innovative solutions for enterprise-level automation, with other products also securing significant funding and market traction [12][15]. - The focus on automating repetitive tasks and enhancing productivity reflects a broader industry shift towards integrating AI into everyday business processes [8][12].
DeepSeek 复盘:128 天后 ,为何迟迟推迟发布——SemiAnalysis
2025-07-07 15:45
Summary of DeepSeek's Impact on AI Market Industry Overview - The document discusses the AI industry, specifically focusing on DeepSeek, a Chinese large language model (LLM) that has recently launched its R1 model, which competes with OpenAI's offerings [4][7]. Key Points and Arguments 1. **Market Entry and Pricing Strategy** - DeepSeek R1 was launched at a competitive price of $0.55 input and $2.1 output, undercutting OpenAI's pricing by 80% [4][8]. - Despite initial market share growth, DeepSeek's user momentum has declined, indicating challenges in maintaining its competitive edge [8][9]. 2. **User Engagement and Traffic Trends** - After the launch, DeepSeek experienced a spike in consumer app traffic, but this growth has not sustained compared to other AI applications [8]. - Traffic for DeepSeek's own web browser has decreased, while third-party hosted instances of DeepSeek have seen a nearly 20x increase in usage [10][13]. 3. **Tokenomics and Performance Trade-offs** - DeepSeek's pricing strategy is influenced by its tokenomics, which involves trade-offs between latency, throughput, and context window size [17][19]. - The model's latency is a significant drawback, as users experience longer wait times for responses compared to competitors [22]. - DeepSeek's context window is smaller than that of competitors, limiting its effectiveness in complex tasks like coding [24]. 4. **Batching and Resource Allocation** - DeepSeek employs batching strategies to minimize costs, which results in higher latency and lower throughput for users [27][28]. - The company prioritizes internal research and development over user experience, focusing on achieving artificial general intelligence (AGI) [27]. 5. **Competitive Landscape** - Other AI providers, such as Anthropic and Google, are leveraging their compute resources to enhance user experience and performance, contrasting with DeepSeek's approach [29][30]. - Anthropic's recent developments in coding applications have outpaced DeepSeek, highlighting the competitive pressure in the AI market [30][41]. 6. **Future Prospects and Challenges** - There are rumors regarding delays in the release of DeepSeek's R2 model, attributed to export controls and operational changes within the company [54][55]. - Despite these challenges, DeepSeek continues to innovate, with recent updates showing improvements in coding performance [55][56]. Additional Important Insights - The document emphasizes the importance of compute resources in the AI industry, noting that companies like Amazon are investing heavily in AI infrastructure [37][38]. - The shift towards viewing tokens as a service rather than a bundled subscription model is gaining traction, with more companies emulating Anthropic's approach [44]. - The competitive dynamics in the AI market are rapidly evolving, with cost and user experience becoming critical factors for success [47][53].