在线强化学习 - filings, earnings calls, financial reports, news

在线强化学习

Search documents

量子位· 2025-09-23 11:01

Core Viewpoint - The article discusses the introduction of a new training paradigm called Semi-online Reinforcement Learning (Semi-online RL) by Zhejiang University and Tongyi Laboratory's Mobile-Agent team, which enhances the performance of models in dynamic multi-turn tasks without relying on real environment interactions [1][2][4]. Group 1: Methodology - The Semi-online RL framework combines the stability of offline training with the long-term optimization capabilities of online learning, significantly improving model performance in dynamic tasks [2][10]. - The framework utilizes offline data to simulate online interactions, allowing the model to experience contextual changes from its own actions during training [12][15]. - A patching mechanism is introduced to adaptively correct sampling biases when the model deviates from expert trajectories, enhancing the learning process [17][19]. Group 2: Key Technologies - The Semi-online RL framework consists of three core technologies: 1. Semi-online mechanism that simulates online interactions using offline data [12]. 2. Patching Module that self-adaptively repairs sampling biases [17]. 3. Long-term reward modeling that estimates advantages from step-level to trajectory-level [20]. Group 3: Evaluation and Results - The new evaluation metric SOP (Semi-online Performance) is proposed to better reflect the model's performance in multi-turn tasks, aligning closely with real online performance [22][23]. - Experimental results show that the UI-S1-7B model outperforms baseline models, achieving a task success rate of 34.0% in the AndroidWorld task, closely approaching the performance of top proprietary models [25][26]. - The model maintains a +7.1% gain in single-turn tasks, indicating that the semi-online training does not sacrifice local accuracy while optimizing for long-term performance [28]. Group 4: Component Analysis - The patching mechanism significantly enhances data utilization and maintains training stability, allowing for effective error correction and promoting policy diversity [30][37]. - Ablation studies confirm that the combination of trajectory-level and step-level advantage functions, along with multi-frame historical observations, positively impacts the model's decision-making capabilities in complex GUI interactions [44].

全球双榜SOTA！明略科技专有大模型 Mano开启GUI智能操作新时代

机器之心· 2025-09-21 05:26

Core Viewpoint - Minglue Technology's proprietary GUI model, Mano, has achieved record-breaking SOTA results in the recognized benchmarks Mind2Web and OSWorld, establishing a new paradigm for GUI intelligent agents through innovations in online reinforcement learning and automatic data collection [1][14][23]. Group 1: Performance Achievements - Mano achieved a success rate of 40.1% in the OSWorld-Verified benchmark, surpassing other models such as qwen and GUI-Owl [10][19]. - In the Mind2Web benchmark, Mano demonstrated superior performance across various metrics, including element accuracy and step success rate, significantly outperforming all other SOTA methods [18][15]. - The model's success rate in OSWorld-Verified reached 41.6±0.7%, marking an approximate 7 percentage point improvement over competitors [21][19]. Group 2: Innovations and Methodology - Mano introduces online reinforcement learning as a novel training paradigm in the GUI interaction field, enhancing its performance in dynamic environments [22][23]. - The model's architecture consists of three main components: exploration module, processing flow, and optimization process, which collectively improve its reasoning and adaptability [25][26]. - The automatic data collection method developed by the technical team significantly enhances the efficiency and accuracy of data acquisition, allowing for the generation of high-quality interaction trajectory data [48][49]. Group 3: Market Context and Future Directions - The demand for AI agents is expected to surge by 2025, positioning Mano as a key player in differentiated competition by accessing data sources that other agents cannot reach [59][63]. - Minglue Technology plans to continue exploring areas such as data collection, training integration, and CAPTCHA handling to further optimize Mano for real-world applications [66].

首次！流匹配模型引入GRPO，GenEval几近满分，组合生图能力远超GPT-4o

机器之心· 2025-05-13 07:08

Core Viewpoint - The article discusses the introduction of Flow-GRPO, the first algorithm to integrate online reinforcement learning into flow matching models, significantly enhancing their performance in image and video generation tasks [2][22]. Group 1: Introduction and Background - Flow matching models have a solid theoretical foundation and excel in generating high-quality images and videos, but they struggle with complex scenes involving multiple objects and relationships [1]. - Online reinforcement learning has made significant strides in language models but remains in its early stages in image generation applications [1]. Group 2: Flow-GRPO Overview - Flow-GRPO combines online reinforcement learning with flow matching models, achieving a remarkable accuracy increase from 63% to 95% in the GenEval benchmark for SD3.5 Medium [2][14]. - The successful implementation of Flow-GRPO opens new avenues for enhancing various flow matching generation models in terms of controllability, composability, and reasoning capabilities [2][22]. Group 3: Key Strategies of Flow-GRPO - The core of Flow-GRPO lies in two key strategies: 1. ODE-SDE equivalence transformation, which allows for effective exploration in reinforcement learning without altering the fundamental characteristics of the model [6][8]. 2. Denoising reduction, which accelerates data collection by reducing the number of denoising steps during training while maintaining high-quality outputs during inference [12][22]. Group 4: Experimental Results - Flow-GRPO demonstrates exceptional performance in various text-to-image generation tasks, significantly improving complex combination generation capabilities and achieving near-perfect results in object counting, spatial relationship understanding, and attribute binding [14][19]. - The accuracy of visual text rendering improved from 59% to 92%, showcasing the model's ability to accurately render text within images [19][21]. - Flow-GRPO also shows significant progress in human preference alignment tasks, effectively reducing reward hacking issues while maintaining image quality and diversity [21][22]. Group 5: Conclusion and Future Outlook - Flow-GRPO reveals a viable path for continuously enhancing flow matching generation model performance through online reinforcement learning [22]. - The successful application of Flow-GRPO suggests promising potential for future advancements in controllability, composability, and reasoning capabilities across multi-modal generation tasks, including images, videos, and 3D content [22].

流匹配模型

在线强化学习

Artificial Intelligence

Artificial Intelligence

Flow-GRPO

SD3.5 Medium

GPT4o