Token弹性

Search documents
ACL 2025 | 基于Token预算感知的大模型高效推理技术
机器之心· 2025-06-05 02:00
Core Insights - The article discusses the development of a new framework called TALE (Token-Budget-Aware LLM Reasoning) aimed at improving the efficiency and accuracy of large language models (LLMs) during reasoning tasks by introducing a "Token budget" constraint [2][9][17] - The framework addresses the issue of excessive token generation during reasoning processes, which leads to increased computational costs and resource consumption, particularly in resource-constrained environments [6][17] Group 1: Background and Motivation - The research highlights the challenges posed by Chain-of-Thought (CoT) reasoning methods, which often result in lengthy and redundant token outputs, significantly increasing computational and economic costs [6][17] - The phenomenon of "Token Elasticity" is identified, where imposing too small a token budget can lead to models exceeding the budget, resulting in higher overall costs [7][9] Group 2: TALE Framework Implementation - TALE introduces two specific implementations: TALE-EP (Estimation and Prompting) and TALE-PT (Post-Training) [9][15] - TALE-EP allows models to self-estimate the required token budget for specific problems and integrates this information into the input prompts, achieving over 60% reduction in token usage while maintaining accuracy [12][13] - TALE-PT internalizes token budget awareness through supervised fine-tuning (SFT) or preference optimization (DPO), reducing average token usage by over 40% while preserving reasoning accuracy [15][16] Group 3: Experimental Results - Experimental results demonstrate that both TALE-EP and TALE-PT significantly outperform traditional CoT methods in terms of token efficiency and accuracy across various datasets [13][16] - The findings indicate that the TALE framework has the potential to enhance the application of LLMs in resource-limited scenarios, expanding their usability [17][19]