BALROG - filings, earnings calls, financial reports, news

BALROG

Search documents

机器之心· 2025-10-08 04:13

Core Insights - The article discusses the challenges faced by intelligent agents in maintaining clear reasoning and robust decision-making over long-term tasks, particularly when the task extends to hundreds of steps [2][3] - It introduces Verlog, a multi-turn reinforcement learning framework designed to handle long-horizon tasks effectively, overcoming limitations of traditional frameworks [3][20] Group 1: Framework Overview - Verlog is built on the foundations of VeRL and BALROG, incorporating specialized optimization techniques to ensure stable and efficient training across tasks that can extend beyond 400 steps [3][20] - The framework has been validated in complex environments such as BabyAI, BabaIsAI, and Crafter, demonstrating strong performance in tasks with varying episode lengths [3][19] Group 2: Methodology - The base model for Verlog is the Qwen-2.5 Instruct variant, which allows seamless integration with BALROG and facilitates the use of benchmark testing prompts with minimal modifications [6][7] - A memory mechanism is employed to retain only the latest n + 1 rounds of interactions, optimizing performance for the 3B parameter Qwen model [9][10] Group 3: Algorithmic Innovations - The Dual Discounting GAE algorithm is introduced to decouple tokens from steps, encouraging agents to complete tasks in fewer environment steps [11][20] - The recursive calculation of GAE enhances the stability of training, allowing for effective learning even in sparse reward scenarios [12][14] Group 4: Experimental Results - Verlog was tested on three challenging benchmarks: Crafter, BabyAI, and BabaIsAI, showcasing its ability to adapt to long-duration tasks with sparse rewards [16][19] - The training of the Qwen2.5-7B-Instruct model in the Crafter environment utilized 8 H100 GPUs over approximately 36 hours, while the Qwen2.5-3B-Instruct model for BabyAI and BabaIsAI was trained on 4 A40 GPUs for about 24 hours [19] Group 5: Future Directions - Verlog aims to serve as a flexible research platform to advance the development of long-horizon LLM-Agent reinforcement learning [21][20] - The framework addresses key engineering challenges such as managing long interaction histories, ensuring training stability under sparse rewards, and handling variable trajectory lengths [20][23]