Core Viewpoint - Tencent proposes a new method for upgrading large model agents called Training-Free GRPO, which significantly reduces costs and improves performance without the need for parameter tuning [1][5][11]. Group 1: Methodology - The Training-Free GRPO method allows for performance enhancement by learning from brief experiences embedded in prompts, eliminating the need for parameter adjustments [2][11]. - This approach maintains the model parameters in a frozen state while dynamically updating an external knowledge base to optimize performance [14][22]. - The method leverages the core logic of traditional GRPO but transforms it into a non-parametric reasoning process [13]. Group 2: Experimental Results - Experiments demonstrate that the DeepSeek-V3.1-Terminus model using Training-Free GRPO shows significant performance improvements in mathematical reasoning and web search tasks [4][25]. - Compared to fine-tuning a 32B model, Training-Free GRPO requires less training data and incurs lower costs, with a notable example being a cost of approximately $18 compared to over $10,000 for traditional methods [5][28]. - In the AIME24 and AIME25 tests, the model's performance improved from 80.0% to 82.7% and from 67.9% to 73.3%, respectively, showcasing a clear advantage with minimal training samples [28]. Group 3: Performance Evaluation - The method achieved a Pass@1 score of 67.8% on the WebWalkerQA benchmark, a significant increase from the baseline score of 63.2% [35]. - The results indicate that the learned experiences help the model avoid redundant tool calls and improve decision-making efficiency [31][30]. - The effectiveness of Training-Free GRPO is contingent upon the underlying model's reasoning and tool usage capabilities, as demonstrated by its lower performance on less capable models [40].
腾讯发布超低成本AI训练法!120元效果秒杀70000元微调方案