LoopTool
Search documents
登顶开源SOTA!上交大&小红书LoopTool实现工具调用任务的「数据进化」
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the evolution of large language models (LLMs) from merely "speaking" to "doing" through the integration of external tools, emphasizing the need for high-quality, diverse training data to enhance model performance in various tasks [1][5][35] Group 1: LoopTool Framework - Shanghai Jiao Tong University and Xiaohongshu team developed LoopTool, an autonomous, model-aware, iterative data evolution framework that achieves data-model closed-loop optimization for tool-calling tasks [2][35] - LoopTool utilizes the open-source model Qwen3-32B as both data generator and discriminator, outperforming its larger counterpart (32B) with a smaller model (8B) in tool-calling performance [2][35] - The framework has demonstrated its effectiveness by achieving state-of-the-art (SOTA) results on public benchmarks BFCL-v3 and ACEBench, validating the generalizability and effectiveness of closed-loop iterative optimization across different model sizes [2][35] Group 2: Methodology - LoopTool's core concept is to create an automated closed loop of data generation, label correction, and model training, driven by model performance feedback [7][35] - The process begins with seed data construction, where high-quality, diverse seed datasets are generated using semantic and constraint trees to ensure consistency and semantic integrity [9][10] - The iterative optimization phase includes several modules: GRPO training for tool calling, greedy capability probing to identify valuable samples, judgment-guided label verification for correcting mismatched labels, and error-driven data expansion to create new challenging samples [11][12][13][15][17] Group 3: Experimental Results - LoopTool-8B achieved an overall accuracy of 74.93% on BFCL-v3, ranking first among all 8B models, with a notable improvement of +8.59 percentage points over the original Qwen3-8B [20][23] - LoopTool-32B reached an overall accuracy of 79.32%, also ranking first, demonstrating superior performance in both single-turn and multi-turn scenarios [20][21] - The iterative training process showed continuous performance improvement, contrasting with static training methods that plateaued or declined due to mismatched data distribution and model capabilities [23] Group 4: Generalization and Downstream Tasks - LoopTool not only enhances tool-calling capabilities but also improves general reasoning and complex task handling, as evidenced by its performance across various general tasks [30][31] - The model demonstrated significant improvements in instruction following and code generation tasks, indicating that closed-loop data evolution positively impacts broader model capabilities [30][31] - In practical applications, LoopTool's enhanced tool usage ability effectively addresses real-world problems, showcasing its utility in diverse scenarios such as API management and complex task execution [32][33]