Reinforcement Learning

Search documents
市场铁律被 AI 攻破,NBER研究揭示:交易算法竟能完美合谋,自动组建卡特尔
3 6 Ke· 2025-08-05 08:54
Core Insights - A study by the National Bureau of Economic Research (NBER) indicates that AI-driven trading algorithms can independently develop cartel-like behaviors in financial markets [1][4][21] - The research reveals that these AI programs operate autonomously without any communication or pre-set collaborative protocols, leading to collusion through self-evolution via machine learning [4][21] Group 1: Silent Cartels - The research was led by a team from the Wharton School of the University of Pennsylvania and the Hong Kong University of Science and Technology, utilizing a standard financial market model for simulations [5] - The simulation included AI-driven speculators, passive market participants, and a market maker, with AI algorithms using reinforcement learning to make trading decisions [5][13] - Results showed that AI programs developed two distinct collaborative strategies based on market conditions, ultimately leading to excess profits for the algorithms at the expense of other market participants [5][13] Group 2: Dual Faces of Collusion - The first strategy emerged in stable market conditions with low price volatility, where AI algorithms signaled each other through price movements, effectively punishing aggressive traders [8][10] - The second strategy appeared in volatile markets, where AI programs learned to avoid aggressive trading after negative experiences, leading to a collective cautious approach termed "artificial stupidity" [11][12] - Both mechanisms allowed AI traders to achieve excess returns unattainable in fully competitive markets [13] Group 3: Regulatory Challenges - The collaborative capabilities of AI lead to decreased market efficiency, with prices failing to reflect true asset values and increased pricing errors [14][15] - Current antitrust laws focus on explicit collusion, making it difficult to address AI-driven coordination that occurs without direct communication [16][18] - The study warns that as AI plays a larger role in financial markets, this "silent collusion" may become more prevalent, necessitating new regulatory frameworks to monitor and understand algorithmic behavior [21][22][23]
OpenAI’s GPT-5 Shines in Coding Tasks — The Information
2025-08-05 03:19
Summary of Key Points from the Conference Call Industry: Artificial Intelligence (AI) Core Insights and Arguments - **Introduction of GPT-5**: OpenAI's upcoming model, GPT-5, is generating positive early feedback, particularly in coding tasks, which is a critical area for the company [3][4][5] - **Performance Improvements**: GPT-5 shows enhanced capabilities in various domains, especially in software engineering, outperforming previous models and rival Anthropic's Claude Sonnet 4 in specific tests [7][10] - **Integration of Models**: The model aims to combine traditional large language models (LLMs) with reasoning models, allowing users to control the reasoning capabilities based on task complexity [5][6] - **Practical Applications**: GPT-5 is better equipped to handle real-world programming challenges, such as modifying complex legacy code, which has been a historical weakness for OpenAI's models [8][9] - **Market Implications**: The success of GPT-5 could significantly impact OpenAI's business and its competitors, as coding assistants powered by Anthropic's models are projected to generate substantial revenue for Anthropic [10][12] Additional Important Content - **Caveats on Model Understanding**: There is uncertainty regarding the exact nature of GPT-5, with speculation that it may function as a router directing queries rather than a single, unified model [13] - **Future Improvements**: Experts suggest that future advancements may stem more from post-training reinforcement learning rather than scaling up pretraining processes [15][17] - **Investor Sentiment**: OpenAI executives are optimistic about the potential for future models, claiming they can reach "GPT-8" using current model structures [17] Implications for Stakeholders - **Impact on Suppliers and Investors**: Strong performance of GPT-5 is seen as beneficial for OpenAI's chip supplier Nvidia and data center firms, as well as for equity and debt investors concerned about AI development trajectories [12]
Inside OpenAI’s Rocky Path to GPT-5 — The Information
2025-08-05 03:19
Summary of OpenAI's Path to GPT-5 Industry Overview - The document discusses the challenges faced by OpenAI in developing its next flagship AI model, GPT-5, highlighting broader trends in the AI industry regarding performance improvements and technical difficulties [2][6][10]. Key Points and Arguments - **Performance Expectations**: GPT-5 is expected to show improvements over previous models, but these improvements will not match the significant leaps seen between earlier versions like GPT-3 and GPT-4 [6][10]. - **Technical Challenges**: OpenAI has encountered various technical problems that have hindered the development of models like o3, which was intended to enhance performance but ultimately did not meet expectations [6][10][34]. - **Incremental Gains**: Despite the challenges, the current models are generating substantial commercial value, which could increase customer demand even for incremental improvements [11]. - **Investment Needs**: OpenAI plans to spend $45 billion over the next three and a half years to support its development and operational needs, which may attract new investors [11]. - **Microsoft Partnership**: OpenAI has a close financial relationship with Microsoft, which holds a significant equity stake in OpenAI's for-profit arm. Negotiations between the two companies are ongoing, with Microsoft likely to secure a 33% stake [18][22]. - **Competition**: OpenAI faces stiff competition from well-capitalized rivals like Google, xAI, and Anthropic, which raises concerns about its ability to maintain a leading position in AI advancements [22]. Additional Important Content - **Model Development Issues**: The internal development of a model named Orion, which was supposed to be GPT-5, failed to produce the expected results, leading to its release as GPT-4.5 instead [23][24]. - **Resource Allocation**: OpenAI has improved its models by utilizing more Nvidia chip servers, enhancing processing power for complex tasks [30]. - **Reinforcement Learning**: The company has focused on reinforcement learning techniques to improve AI capabilities, which have been recognized as essential for achieving artificial general intelligence (AGI) [44]. - **Staff Changes**: OpenAI has experienced staff reorganizations and departures, including researchers moving to competitors like Meta, which has affected morale and productivity [19][20]. - **Communication Challenges**: The transition from advanced reasoning models to chat-based versions has led to performance degradation, indicating a need for better training in human communication [35][38]. Conclusion - OpenAI is on a complex journey toward releasing GPT-5, facing both internal and external challenges. While the model is expected to bring improvements, the company must navigate technical difficulties, competitive pressures, and investment needs to sustain its growth and innovation in the AI sector [6][10][11][22].
Supercharging Startups with AI Agents | Mohit Ambani | TEDxSGGSCC Studio
TEDx Talks· 2025-08-01 15:16
AI Fundamentals - Generative AI works by probabilistically filling in the blanks based on pre-trained data, essentially acting as an advanced autocomplete [5][6] - Pre-training involves feeding massive amounts of unstructured data into large language models (LLMs), requiring significant energy and resources for processing and refinement [7][8][9] - Reinforcement learning and reasoning enhance AI accuracy by implementing strategic action and assigning scores to generated results, reducing hallucinations [11][12] AI Applications in Business - AI agents can automate tasks across various tools and interfaces, acting as digital employees capable of understanding unstructured data and executing actions [13][14] - AI tools can significantly scale business operations, as demonstrated by a cosmetics brand using an AI agent to streamline influencer marketing, reducing the required team size and time [21][22] - AI agents are being used in sales to personalize outreach and automate follow-ups, leading to increased order rates and reduced campaign costs [24] - AI is being applied in operations to automate pricing and quotation processes, monitor safety incidents, and improve response times [25][26] - AI is aiding in financial analysis by enabling rapid screening of stocks based on specific criteria, leveraging open-source tools to retrieve data from millions of PDF files [28] AI's Impact and Future - AI is evolving beyond replacing existing processes to facilitating new inventions, such as a novel use case for magnetic ink in supply chain management [30][31][32][33] - The industry is rapidly advancing towards artificial generalized intelligence (AGI) and artificial super intelligence (ASI), with continuous improvements in AI models and capabilities [34] - The fundamental question is raised about the role of humans in a world where many jobs can be automated, emphasizing the importance of curiosity and relentless questioning [34][35]
China Went HARD...
Matthew Berman· 2025-07-24 00:30
Model Performance & Capabilities - Quen 3 coder rivals Anthropic's Claude family in coding performance, achieving 69.6% on SWEBench verified compared to Claude Sonnet 4's 70.4% [1] - The most powerful variant, Quen 3 coder 480B, features 480 billion parameters with 35 billion active parameters as a mixture of experts model [2][3] - The model supports a native context length of 256k tokens and up to 1 million tokens with extrapolation methods, enhancing its capabilities for tool calling and agentic uses [4] Training Data & Methodology - The model was pre-trained on 7.5 trillion tokens with a 70% code ratio, improving coding abilities while maintaining general and math skills [5] - Quen 2.5 coder was leveraged to clean and rewrite noisy data, significantly improving overall data quality [6] - Code RL training was scaled on a broader set of real-world coding tasks, focusing on diverse coding tasks to unlock the full potential of reinforcement learning [7][8] Tooling & Infrastructure - Quen launched Quen code, a command line tool adapted from Gemini code, enabling agentic and multi-turn execution with planning [2][5][9] - A scalable system was built to run 20,000 independent environments in parallel, leveraging Alibaba cloud's infrastructure for self-play [10] Open Source & Accessibility - The model is hosted on HuggingFace, making it free to use and try out [11]
ChatGPT Agent 团队专访:基模公司做通用 Agent,和 Manus 有什么不一样?
Founder Park· 2025-07-23 13:23
Core Insights - The article discusses the introduction of ChatGPT Agent by OpenAI, which combines deep research and operator capabilities to create a versatile agent capable of performing complex tasks without losing control over extended periods [1][6][13]. Group 1: ChatGPT Agent Overview - ChatGPT Agent is described as the first fully "embodied" agent on a computer, allowing seamless transitions between visual browsing, text analysis, and code execution [1][7]. - The agent can perform complex tasks for up to one hour without losing control, showcasing its advanced capabilities [13][19]. Group 2: Training Methodology - The training of ChatGPT Agent involved reinforcement learning (RL) where the model was given a variety of tools and allowed to discover optimal strategies independently [2][10]. - The agent utilizes a combination of a text browser and a graphical interface, enhancing its efficiency and flexibility in task execution [6][8]. Group 3: Functionality and Use Cases - ChatGPT Agent can handle various tasks, including deep research, online shopping, and creating presentations, making it suitable for both consumer and business applications [13][15]. - Users have reported practical applications such as data extraction from Google Docs and generating financial models, indicating its versatility [16][17]. Group 4: Future Developments - The team envisions continuous improvements in the agent's accuracy and capabilities, aiming to expand its functionality across a wide range of tasks [23][33]. - There is an emphasis on enhancing user interaction and exploring new paradigms for collaboration between users and the agent [34][36]. Group 5: Safety and Risk Management - The article highlights the increased risks associated with the agent's ability to interact with the real world, necessitating robust safety measures and ongoing monitoring [35][36]. - The development team is focused on creating a comprehensive safety framework to mitigate potential harmful actions by the agent [37][39].
OpenAI Just Released ChatGPT Agent, Its Most Powerful Agent Yet
Sequoia Capital· 2025-07-22 09:00
Agent Capabilities & Architecture - OpenAI has created a new agent in ChatGPT that can perform tasks that would take humans a long time, by giving the agent access to a virtual computer [6] - The agent has access to a text browser (similar to deep research tool), a virtual browser (similar to operator tool with full GUI access), and a terminal for running code and calling APIs [6][7][8] - All tools have shared state, allowing for flexible and complex tasks [9] - The agent is trained using reinforcement learning across thousands of virtual machines, allowing it to discover optimal strategies for tool usage [3] Development & Training - The agent is a collaboration between the Deep Research and Operator teams, combining the strengths of both [6] - The agent is trained with reinforcement learning, rewarding efficient and correct task completion [36] - The model figures out when to use which tool through experimentation, without explicit instructions [38] - Reinforcement learning is data-efficient, allowing new capabilities to be taught with smaller, high-quality datasets [75][76] Safety & Limitations - Safety training and mitigations were a core part of the development process due to the agent's ability to take actions with external side effects [44] - The team has implemented a monitor that watches for suspicious activity, similar to antivirus software [48] - Date picking remains a difficult task for the AI system [4][83][84] Future Directions - Future development will focus on improving the accuracy and performance across a wide distribution of tasks [62][85] - The team is exploring different ways of interacting with the agent, beyond the current chat-based interface [68][86] - Personalization and memory for agents will be important for future development, allowing agents to do things without being explicitly asked [67][68]
自动驾驶论文速递 | 世界模型、端到端、VLM/VLA、强化学习等~
自动驾驶之心· 2025-07-21 04:14
Core Insights - The article discusses advancements in autonomous driving technology, particularly focusing on the Orbis model developed by Freiburg University, which significantly improves long-horizon prediction in driving world models [1][2]. Group 1: Orbis Model Contributions - The Orbis model addresses shortcomings in contemporary driving world models regarding long-horizon generation, particularly in complex maneuvers like turns, and introduces a trajectory distribution-based evaluation metric to quantify these issues [2]. - It employs a hybrid discrete-continuous tokenizer that allows for fair comparisons between discrete and continuous prediction methods, demonstrating that continuous modeling (based on flow matching) outperforms discrete modeling (based on masked generation) in long-horizon predictions [2]. - The model achieves state-of-the-art (SOTA) performance with only 469 million parameters and 280 hours of monocular video data, excelling in complex driving scenarios such as turns and urban traffic [2]. Group 2: Experimental Results - The Orbis model achieved a Fréchet Video Distance (FVD) of 132.25 on the nuPlan dataset for 6-second rollouts, significantly lower than other models like Cosmos (291.80) and Vista (323.37), indicating superior performance in trajectory prediction [6][7]. - In turn scenarios, Orbis also outperformed other models, achieving a FVD of 231.88 compared to 316.99 for Cosmos and 413.61 for Vista, showcasing its effectiveness in challenging driving conditions [6][7]. Group 3: LaViPlan Framework - The LaViPlan framework, developed by ETRI, utilizes reinforcement learning with verifiable rewards to address the misalignment between visual, language, and action components in autonomous driving, achieving a 19.91% reduction in Average Displacement Error (ADE) for easy scenarios and 14.67% for hard scenarios on the ROADWork dataset [12][14]. - It emphasizes the transition from linguistic fidelity to functional accuracy in trajectory outputs, revealing a trade-off between semantic similarity and task-specific reasoning [14]. Group 4: World Model-Based Scene Generation - The University of Macau introduced a world model-driven scene generation framework that enhances dynamic graph convolution networks, achieving an 83.2% Average Precision (AP) and a 3.99 seconds mean Time to Anticipate (mTTA) on the DAD dataset, marking significant improvements [23][24]. - This framework combines scene generation with adaptive temporal reasoning to create high-resolution driving scenarios, addressing data scarcity and modeling limitations [24]. Group 5: ReAL-AD Framework - The ReAL-AD framework proposed by Shanghai University of Science and Technology and the Chinese University of Hong Kong integrates a three-layer human cognitive decision-making model into end-to-end autonomous driving, improving planning accuracy by 33% and reducing collision rates by 32% [33][34]. - It features three core modules that enhance situational awareness and structured reasoning, leading to significant improvements in trajectory planning accuracy and safety [34].
突破高分辨率图像推理瓶颈,复旦联合南洋理工提出基于视觉Grounding的多轮强化学习框架MGPO
机器之心· 2025-07-21 04:04
Core Insights - The article discusses the development of a multi-turn reinforcement learning method called MGPO, which enhances the visual reasoning capabilities of large multi-modal models (LMMs) when processing high-resolution images [1][8][21] - MGPO allows LMMs to automatically predict key area coordinates and crop sub-images based on questions, improving the model's ability to focus on relevant information without requiring expensive grounding annotations [2][21] Summary by Sections Introduction - Current LMMs, such as Qwen2.5-VL, face challenges in processing high-resolution images due to the conversion of images into a large number of visual tokens, many of which are irrelevant to the task [5][6] - The human visual system employs a task-driven visual search strategy, which MGPO aims to replicate by enabling LMMs to focus on key areas of images [6][7] Method Overview - MGPO simulates a multi-step visual reasoning process where the model first predicts key area coordinates and then crops sub-images for further reasoning [10][21] - The method overcomes the limitations of traditional visual grounding models that require extensive grounding annotations for training [7][21] Key Innovations of MGPO - A top-down, interpretable visual reasoning mechanism that allows LMMs to conduct problem-driven visual searches [2] - The ability to accurately identify relevant area coordinates from high-resolution images, even when visual tokens are limited [2] - The model can be trained on standard Visual Question Answering (VQA) datasets without additional grounding annotations, relying solely on answer correctness for feedback [2][21] Experimental Results - MGPO demonstrated significant performance improvements over other methods like SFT and GRPO, achieving increases of 5.4% and 5.2% in benchmark tests [18][19] - The model outperformed OpenAI's models despite being trained on a smaller dataset, showcasing its effectiveness [18][19] - The proportion of effective grounding coordinates generated by MGPO increased significantly during training, indicating its ability to develop robust visual grounding capabilities autonomously [20] Conclusion - MGPO effectively addresses issues of visual token redundancy and key information loss in high-resolution image processing [21] - The method proves that reinforcement learning can foster robust grounding capabilities without the need for costly annotations, enhancing the efficiency of LMMs [21]
A Taxonomy for Next-gen Reasoning — Nathan Lambert, Allen Institute (AI2) & Interconnects.ai
AI Engineer· 2025-07-19 21:15
Model Reasoning and Applications - Reasoning unlocks new language model applications, exemplified by improved information retrieval [1] - Reasoning models are enhancing applications like website analysis and code assistance, making them more steerable and user-friendly [1] - Reasoning models are pushing the limits of task completion, requiring ongoing effort to determine what models need to continue progress [1] Planning and Training - Planning is a new frontier for language models, requiring a shift in training approaches beyond just reasoning skills [1][2] - The industry needs to develop research plans to train reasoning models that can work autonomously and have meaningful planning capabilities [1] - Calibration is crucial for products, as models tend to overthink, requiring better management of output tokens relative to problem difficulty [1] - Strategy and abstraction are key subsets of planning, enabling models to choose how to break down problems and utilize tools effectively [1] Reinforcement Learning and Compute - Reinforcement learning with verifiable rewards is a core technique, where language models generate completions and receive feedback to update weights [2] - Parallel compute enhances model robustness and exploration, but doesn't solve every problem, indicating a need for balanced approaches [3] - The industry is moving towards considering post-training as a significant portion of compute, potentially reaching parity with pre-training in GPU hours [3]