自我奖励训练（SRT） - filings, earnings calls, financial reports, news

自我奖励训练（SRT）

Search documents

LSTM之父22年前构想将成真？一周内AI「自我进化」论文集中发布，新趋势涌现？

机器之心· 2025-06-02 05:22

Core Insights - The article discusses the evolution of AI systems towards self-improvement, highlighting recent advancements in self-learning models, particularly the "Darwin Gödel Machine" (DGM) and other frameworks [1][4][6]. Group 1: Darwin Gödel Machine (DGM) - DGM utilizes foundational models and open-ended algorithms to create and evaluate new AI agents, capable of reading and modifying its own Python code for self-improvement [4][6]. - DGM has demonstrated significant self-improvement capabilities, with performance metrics increasing from 20.0% to 50.0% on the sw-bench and from 14.2% to 30.7% on Polyglot, surpassing manually designed agents [10]. - The system operates by alternating self-modification and downstream task evaluation, continuously generating and scoring new agents [10][8]. Group 2: Self-Rewarded Training (SRT) - SRT is an online self-training reinforcement learning algorithm that allows large language models to self-supervise and train without external labels, enhancing performance through self-generated feedback [14][16]. - Initial experiments show that SRT can achieve performance comparable to standard reinforcement learning methods that rely on gold-standard answers, although it may eventually face performance degradation [18][21]. - Strategies to mitigate reward hacking include early stopping, using offline-generated labels for self-training, and implementing curriculum learning to maintain model performance [22][24][26]. Group 3: Multi-Modal Unsupervised Post-Training (MM-UPT) - MM-UPT is a framework for continuous self-improvement of multi-modal large models in completely unsupervised settings, validated across multiple benchmarks [30][32]. - The framework employs a voting mechanism to generate pseudo-labels from self-generated data, allowing models to enhance their reasoning capabilities without external supervision [39][40]. - Experiments indicate that MM-UPT can improve accuracy from 66.3% to 72.9% on the MathVista benchmark, demonstrating its effectiveness compared to previous unsupervised methods [39][40]. Group 4: UI-Genie Framework - UI-Genie is designed to address challenges in GUI agents, focusing on trajectory validation and the acquisition of high-quality training data [45][47]. - The framework includes a reward model that efficiently processes historical context and unifies action-level and task-level rewards, enhancing the agent's learning capabilities [45][50]. - Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks after iterative self-improvement cycles [52].