Reinforcement Learning
Search documents
Inside OpenAI’s Rocky Path to GPT-5 — The Information
2025-08-05 03:19
Summary of OpenAI's Path to GPT-5 Industry Overview - The document discusses the challenges faced by OpenAI in developing its next flagship AI model, GPT-5, highlighting broader trends in the AI industry regarding performance improvements and technical difficulties [2][6][10]. Key Points and Arguments - **Performance Expectations**: GPT-5 is expected to show improvements over previous models, but these improvements will not match the significant leaps seen between earlier versions like GPT-3 and GPT-4 [6][10]. - **Technical Challenges**: OpenAI has encountered various technical problems that have hindered the development of models like o3, which was intended to enhance performance but ultimately did not meet expectations [6][10][34]. - **Incremental Gains**: Despite the challenges, the current models are generating substantial commercial value, which could increase customer demand even for incremental improvements [11]. - **Investment Needs**: OpenAI plans to spend $45 billion over the next three and a half years to support its development and operational needs, which may attract new investors [11]. - **Microsoft Partnership**: OpenAI has a close financial relationship with Microsoft, which holds a significant equity stake in OpenAI's for-profit arm. Negotiations between the two companies are ongoing, with Microsoft likely to secure a 33% stake [18][22]. - **Competition**: OpenAI faces stiff competition from well-capitalized rivals like Google, xAI, and Anthropic, which raises concerns about its ability to maintain a leading position in AI advancements [22]. Additional Important Content - **Model Development Issues**: The internal development of a model named Orion, which was supposed to be GPT-5, failed to produce the expected results, leading to its release as GPT-4.5 instead [23][24]. - **Resource Allocation**: OpenAI has improved its models by utilizing more Nvidia chip servers, enhancing processing power for complex tasks [30]. - **Reinforcement Learning**: The company has focused on reinforcement learning techniques to improve AI capabilities, which have been recognized as essential for achieving artificial general intelligence (AGI) [44]. - **Staff Changes**: OpenAI has experienced staff reorganizations and departures, including researchers moving to competitors like Meta, which has affected morale and productivity [19][20]. - **Communication Challenges**: The transition from advanced reasoning models to chat-based versions has led to performance degradation, indicating a need for better training in human communication [35][38]. Conclusion - OpenAI is on a complex journey toward releasing GPT-5, facing both internal and external challenges. While the model is expected to bring improvements, the company must navigate technical difficulties, competitive pressures, and investment needs to sustain its growth and innovation in the AI sector [6][10][11][22].
Supercharging Startups with AI Agents | Mohit Ambani | TEDxSGGSCC Studio
TEDx Talks· 2025-08-01 15:16
AI Fundamentals - Generative AI works by probabilistically filling in the blanks based on pre-trained data, essentially acting as an advanced autocomplete [5][6] - Pre-training involves feeding massive amounts of unstructured data into large language models (LLMs), requiring significant energy and resources for processing and refinement [7][8][9] - Reinforcement learning and reasoning enhance AI accuracy by implementing strategic action and assigning scores to generated results, reducing hallucinations [11][12] AI Applications in Business - AI agents can automate tasks across various tools and interfaces, acting as digital employees capable of understanding unstructured data and executing actions [13][14] - AI tools can significantly scale business operations, as demonstrated by a cosmetics brand using an AI agent to streamline influencer marketing, reducing the required team size and time [21][22] - AI agents are being used in sales to personalize outreach and automate follow-ups, leading to increased order rates and reduced campaign costs [24] - AI is being applied in operations to automate pricing and quotation processes, monitor safety incidents, and improve response times [25][26] - AI is aiding in financial analysis by enabling rapid screening of stocks based on specific criteria, leveraging open-source tools to retrieve data from millions of PDF files [28] AI's Impact and Future - AI is evolving beyond replacing existing processes to facilitating new inventions, such as a novel use case for magnetic ink in supply chain management [30][31][32][33] - The industry is rapidly advancing towards artificial generalized intelligence (AGI) and artificial super intelligence (ASI), with continuous improvements in AI models and capabilities [34] - The fundamental question is raised about the role of humans in a world where many jobs can be automated, emphasizing the importance of curiosity and relentless questioning [34][35]
China Went HARD...
Matthew Berman· 2025-07-24 00:30
Model Performance & Capabilities - Quen 3 coder rivals Anthropic's Claude family in coding performance, achieving 69.6% on SWEBench verified compared to Claude Sonnet 4's 70.4% [1] - The most powerful variant, Quen 3 coder 480B, features 480 billion parameters with 35 billion active parameters as a mixture of experts model [2][3] - The model supports a native context length of 256k tokens and up to 1 million tokens with extrapolation methods, enhancing its capabilities for tool calling and agentic uses [4] Training Data & Methodology - The model was pre-trained on 7.5 trillion tokens with a 70% code ratio, improving coding abilities while maintaining general and math skills [5] - Quen 2.5 coder was leveraged to clean and rewrite noisy data, significantly improving overall data quality [6] - Code RL training was scaled on a broader set of real-world coding tasks, focusing on diverse coding tasks to unlock the full potential of reinforcement learning [7][8] Tooling & Infrastructure - Quen launched Quen code, a command line tool adapted from Gemini code, enabling agentic and multi-turn execution with planning [2][5][9] - A scalable system was built to run 20,000 independent environments in parallel, leveraging Alibaba cloud's infrastructure for self-play [10] Open Source & Accessibility - The model is hosted on HuggingFace, making it free to use and try out [11]
ChatGPT Agent 团队专访:基模公司做通用 Agent,和 Manus 有什么不一样?
Founder Park· 2025-07-23 13:23
Core Insights - The article discusses the introduction of ChatGPT Agent by OpenAI, which combines deep research and operator capabilities to create a versatile agent capable of performing complex tasks without losing control over extended periods [1][6][13]. Group 1: ChatGPT Agent Overview - ChatGPT Agent is described as the first fully "embodied" agent on a computer, allowing seamless transitions between visual browsing, text analysis, and code execution [1][7]. - The agent can perform complex tasks for up to one hour without losing control, showcasing its advanced capabilities [13][19]. Group 2: Training Methodology - The training of ChatGPT Agent involved reinforcement learning (RL) where the model was given a variety of tools and allowed to discover optimal strategies independently [2][10]. - The agent utilizes a combination of a text browser and a graphical interface, enhancing its efficiency and flexibility in task execution [6][8]. Group 3: Functionality and Use Cases - ChatGPT Agent can handle various tasks, including deep research, online shopping, and creating presentations, making it suitable for both consumer and business applications [13][15]. - Users have reported practical applications such as data extraction from Google Docs and generating financial models, indicating its versatility [16][17]. Group 4: Future Developments - The team envisions continuous improvements in the agent's accuracy and capabilities, aiming to expand its functionality across a wide range of tasks [23][33]. - There is an emphasis on enhancing user interaction and exploring new paradigms for collaboration between users and the agent [34][36]. Group 5: Safety and Risk Management - The article highlights the increased risks associated with the agent's ability to interact with the real world, necessitating robust safety measures and ongoing monitoring [35][36]. - The development team is focused on creating a comprehensive safety framework to mitigate potential harmful actions by the agent [37][39].
OpenAI Just Released ChatGPT Agent, Its Most Powerful Agent Yet
Sequoia Capital· 2025-07-22 09:00
Agent Capabilities & Architecture - OpenAI has created a new agent in ChatGPT that can perform tasks that would take humans a long time, by giving the agent access to a virtual computer [6] - The agent has access to a text browser (similar to deep research tool), a virtual browser (similar to operator tool with full GUI access), and a terminal for running code and calling APIs [6][7][8] - All tools have shared state, allowing for flexible and complex tasks [9] - The agent is trained using reinforcement learning across thousands of virtual machines, allowing it to discover optimal strategies for tool usage [3] Development & Training - The agent is a collaboration between the Deep Research and Operator teams, combining the strengths of both [6] - The agent is trained with reinforcement learning, rewarding efficient and correct task completion [36] - The model figures out when to use which tool through experimentation, without explicit instructions [38] - Reinforcement learning is data-efficient, allowing new capabilities to be taught with smaller, high-quality datasets [75][76] Safety & Limitations - Safety training and mitigations were a core part of the development process due to the agent's ability to take actions with external side effects [44] - The team has implemented a monitor that watches for suspicious activity, similar to antivirus software [48] - Date picking remains a difficult task for the AI system [4][83][84] Future Directions - Future development will focus on improving the accuracy and performance across a wide distribution of tasks [62][85] - The team is exploring different ways of interacting with the agent, beyond the current chat-based interface [68][86] - Personalization and memory for agents will be important for future development, allowing agents to do things without being explicitly asked [67][68]
自动驾驶论文速递 | 世界模型、端到端、VLM/VLA、强化学习等~
自动驾驶之心· 2025-07-21 04:14
Core Insights - The article discusses advancements in autonomous driving technology, particularly focusing on the Orbis model developed by Freiburg University, which significantly improves long-horizon prediction in driving world models [1][2]. Group 1: Orbis Model Contributions - The Orbis model addresses shortcomings in contemporary driving world models regarding long-horizon generation, particularly in complex maneuvers like turns, and introduces a trajectory distribution-based evaluation metric to quantify these issues [2]. - It employs a hybrid discrete-continuous tokenizer that allows for fair comparisons between discrete and continuous prediction methods, demonstrating that continuous modeling (based on flow matching) outperforms discrete modeling (based on masked generation) in long-horizon predictions [2]. - The model achieves state-of-the-art (SOTA) performance with only 469 million parameters and 280 hours of monocular video data, excelling in complex driving scenarios such as turns and urban traffic [2]. Group 2: Experimental Results - The Orbis model achieved a Fréchet Video Distance (FVD) of 132.25 on the nuPlan dataset for 6-second rollouts, significantly lower than other models like Cosmos (291.80) and Vista (323.37), indicating superior performance in trajectory prediction [6][7]. - In turn scenarios, Orbis also outperformed other models, achieving a FVD of 231.88 compared to 316.99 for Cosmos and 413.61 for Vista, showcasing its effectiveness in challenging driving conditions [6][7]. Group 3: LaViPlan Framework - The LaViPlan framework, developed by ETRI, utilizes reinforcement learning with verifiable rewards to address the misalignment between visual, language, and action components in autonomous driving, achieving a 19.91% reduction in Average Displacement Error (ADE) for easy scenarios and 14.67% for hard scenarios on the ROADWork dataset [12][14]. - It emphasizes the transition from linguistic fidelity to functional accuracy in trajectory outputs, revealing a trade-off between semantic similarity and task-specific reasoning [14]. Group 4: World Model-Based Scene Generation - The University of Macau introduced a world model-driven scene generation framework that enhances dynamic graph convolution networks, achieving an 83.2% Average Precision (AP) and a 3.99 seconds mean Time to Anticipate (mTTA) on the DAD dataset, marking significant improvements [23][24]. - This framework combines scene generation with adaptive temporal reasoning to create high-resolution driving scenarios, addressing data scarcity and modeling limitations [24]. Group 5: ReAL-AD Framework - The ReAL-AD framework proposed by Shanghai University of Science and Technology and the Chinese University of Hong Kong integrates a three-layer human cognitive decision-making model into end-to-end autonomous driving, improving planning accuracy by 33% and reducing collision rates by 32% [33][34]. - It features three core modules that enhance situational awareness and structured reasoning, leading to significant improvements in trajectory planning accuracy and safety [34].
突破高分辨率图像推理瓶颈,复旦联合南洋理工提出基于视觉Grounding的多轮强化学习框架MGPO
机器之心· 2025-07-21 04:04
Core Insights - The article discusses the development of a multi-turn reinforcement learning method called MGPO, which enhances the visual reasoning capabilities of large multi-modal models (LMMs) when processing high-resolution images [1][8][21] - MGPO allows LMMs to automatically predict key area coordinates and crop sub-images based on questions, improving the model's ability to focus on relevant information without requiring expensive grounding annotations [2][21] Summary by Sections Introduction - Current LMMs, such as Qwen2.5-VL, face challenges in processing high-resolution images due to the conversion of images into a large number of visual tokens, many of which are irrelevant to the task [5][6] - The human visual system employs a task-driven visual search strategy, which MGPO aims to replicate by enabling LMMs to focus on key areas of images [6][7] Method Overview - MGPO simulates a multi-step visual reasoning process where the model first predicts key area coordinates and then crops sub-images for further reasoning [10][21] - The method overcomes the limitations of traditional visual grounding models that require extensive grounding annotations for training [7][21] Key Innovations of MGPO - A top-down, interpretable visual reasoning mechanism that allows LMMs to conduct problem-driven visual searches [2] - The ability to accurately identify relevant area coordinates from high-resolution images, even when visual tokens are limited [2] - The model can be trained on standard Visual Question Answering (VQA) datasets without additional grounding annotations, relying solely on answer correctness for feedback [2][21] Experimental Results - MGPO demonstrated significant performance improvements over other methods like SFT and GRPO, achieving increases of 5.4% and 5.2% in benchmark tests [18][19] - The model outperformed OpenAI's models despite being trained on a smaller dataset, showcasing its effectiveness [18][19] - The proportion of effective grounding coordinates generated by MGPO increased significantly during training, indicating its ability to develop robust visual grounding capabilities autonomously [20] Conclusion - MGPO effectively addresses issues of visual token redundancy and key information loss in high-resolution image processing [21] - The method proves that reinforcement learning can foster robust grounding capabilities without the need for costly annotations, enhancing the efficiency of LMMs [21]
A Taxonomy for Next-gen Reasoning — Nathan Lambert, Allen Institute (AI2) & Interconnects.ai
AI Engineer· 2025-07-19 21:15
Model Reasoning and Applications - Reasoning unlocks new language model applications, exemplified by improved information retrieval [1] - Reasoning models are enhancing applications like website analysis and code assistance, making them more steerable and user-friendly [1] - Reasoning models are pushing the limits of task completion, requiring ongoing effort to determine what models need to continue progress [1] Planning and Training - Planning is a new frontier for language models, requiring a shift in training approaches beyond just reasoning skills [1][2] - The industry needs to develop research plans to train reasoning models that can work autonomously and have meaningful planning capabilities [1] - Calibration is crucial for products, as models tend to overthink, requiring better management of output tokens relative to problem difficulty [1] - Strategy and abstraction are key subsets of planning, enabling models to choose how to break down problems and utilize tools effectively [1] Reinforcement Learning and Compute - Reinforcement learning with verifiable rewards is a core technique, where language models generate completions and receive feedback to update weights [2] - Parallel compute enhances model robustness and exploration, but doesn't solve every problem, indicating a need for balanced approaches [3] - The industry is moving towards considering post-training as a significant portion of compute, potentially reaching parity with pre-training in GPU hours [3]
How to Train Your Agent: Building Reliable Agents with RL — Kyle Corbitt, OpenPipe
AI Engineer· 2025-07-19 21:12
Core Idea - The presentation discusses a case study on building an open-source natural language assistant (ART E) for answering questions from email inboxes using reinforcement learning [1][2][3] - The speaker shares lessons learned, what worked and didn't, and how they built an agent that worked well with reinforcement learning [2] Development Process & Strategy - The speaker recommends starting with prompted models to achieve the best performance before using any training, including reinforcement learning, to work out bugs in the environment and potentially avoid training altogether [7][8][9] - The company was able to surpass prompted model baselines with reinforcement learning, achieving a 60% reduction in errors compared to the best prompted model (03, which had 90% accuracy, while the RL model achieved 96% accuracy) [10][15] - The training of the ART E model cost approximately $80 in GPU time and one week of engineering time with an experienced engineer [23][24] Key Metrics & Optimization - The company benchmarked cost, accuracy, and latency, finding that the trained model (Quen 2.5 14B) achieved significant cost reduction compared to 03 ($55 per 1,000 searches) and 04 mini ($8 per 1,000 searches) [16][17] - The company improved latency by moving to a smaller model, training the model to have fewer turns, and considering speculative decoding [19][20][21] - The company optimized the reward function to include extra credit for fewer turns and discouraging hallucination, resulting in a significantly lower hallucination rate compared to prompted models [45][46][49][50] Challenges & Solutions - The two hard problems in using RL are figuring out a realistic environment and getting the right reward function [26][27][28] - The company created a realistic environment using the Enron email dataset, which contains 500,000 emails [33][34][35] - The company designed the reward function by having Gemini 2.5 Pro generate questions and answers from batches of emails, creating a verified dataset for the agent to learn from [37][38][39] - The company emphasizes the importance of watching out for reward hacking, where the model exploits the reward function without actually solving the problem, and suggests modifying the reward function to penalize such behavior [51][53][61]
L4产业链跟踪系列第三期-头部Robotaxi公司近况跟踪(技术方向)
2025-07-16 06:13
Summary of Conference Call Company and Industry - The conference call primarily discusses advancements in the autonomous driving industry, specifically focusing on a company involved in Level 4 (L4) autonomous driving technology. Key Points and Arguments 1. **Technological Framework**: The company has a modular architecture for its autonomous driving system, which includes perception, prediction, control, and planning. This framework has evolved to incorporate advanced techniques like reinforcement learning and world models, although the core structure remains intact [1][2][3]. 2. **Transition to Large Models**: The industry is shifting from CNN architectures to transformer-based models. The company is gradually replacing its existing models with these new frameworks, which may take longer due to the high baseline performance of their current systems [3][4]. 3. **Data Utilization**: The company emphasizes the importance of both real and simulated data for model training. While real data is primarily used, there is a plan to increasingly incorporate simulated data to address data shortages, especially for control models [8][9][10]. 4. **Learning Techniques**: Imitation learning has been used for scenarios where rule-based approaches fail, while reinforcement learning is applied in end-to-end (E2E) models. The proportion of reinforcement learning used is not significant, indicating a cautious approach to its implementation [11][12]. 5. **Operational Deployment**: The company has deployed several autonomous vehicles in major cities like Beijing and Guangzhou, with plans to expand in Shenzhen and Shanghai. The current fleet consists of a few hundred vehicles [14][21]. 6. **Cost Structure**: The cost of vehicles includes hardware components such as multiple radars and cameras, with estimates suggesting that the total cost could be reduced to around 200,000 yuan [15][19]. 7. **Computational Resources**: The company is facing challenges with computational capacity, particularly with the integration of various models across different chips. There is a focus on optimizing the use of existing resources while planning for future upgrades [19][20]. 8. **Profitability Goals**: The company aims to achieve a break-even point by deploying a fleet of over 10,000 vehicles by 2027 or 2028. Current estimates suggest that achieving profitability may require a fleet size closer to 100,000 vehicles [26]. 9. **Market Positioning**: The company acknowledges competition from other players in the autonomous driving space, particularly in terms of regulatory approvals and operational capabilities. It aims to maintain a competitive edge by leveraging its faster acquisition of commercial licenses [27][28]. Other Important Content - The discussion highlights the ongoing evolution of the autonomous driving technology landscape, with a focus on the balance between technological advancement and operational scalability. The company is committed to addressing challenges in data acquisition, model training, and fleet management to enhance its market position [22][23][30].