Workflow
单Token验证(OTV)框架
icon
Search documents
不改模型也能提升推理性能?ICLR投稿提出测试时扩展新范式OTV
量子位· 2025-10-23 00:08
Core Insights - The article discusses the challenges faced by large language models, including hallucinations, logical errors, and reasoning flaws, prompting researchers to explore new methods to enhance output reliability [1] - A novel approach called One-Token Verification (OTV) is introduced, which allows models to monitor their reasoning process in real-time without altering the original model structure or parameters [2] Summary by Sections Current Mainstream Paradigms - LoRA fine-tuning is highlighted as a popular parameter-efficient tuning method that avoids full parameter training and is easy to deploy, but it often relies on detailed supervised data and can lead to "forgetting effects" [3] - Quality screening of generated results can enhance output credibility but tends to be reactive, making it difficult to correct the model's reasoning in real-time and lacking insight into the internal reasoning process [4] Parallel Thinking Framework - The article introduces the concept of Parallel Thinking, which allows language models to generate multiple reasoning paths simultaneously and then filter them through a specific mechanism [5] - OTV builds on this framework by focusing on efficiently selecting correct reasoning paths at a lower cost rather than generating multiple paths [5] OTV Mechanism - OTV employs an internal verifier that analyzes the reasoning process using a lightweight role vector implemented via LoRA, running in parallel with the original model [9] - The internal verifier utilizes the key-value cache (KV Cache) from the Transformer architecture to capture rich information about the model's internal dynamics during the reasoning process [9] - A special token, referred to as "Token of Truth" (ToT), is inserted during the verification phase to assess the correctness of the reasoning path [9] Training and Efficiency - OTV's internal verifier is designed to be lightweight, with a training logic that assigns heuristic pseudo-labels based on the correctness of the final answer [10] - The training process is highly parallelized, allowing simultaneous scoring predictions for all positions, making it computationally comparable to traditional LoRA fine-tuning [10] Experimental Validation - OTV was systematically evaluated on various open-source models, demonstrating superior accuracy and a preference for shorter, more accurate reasoning paths compared to baseline methods [14] - The results indicate that OTV can read the internal reasoning state and output quality, significantly outperforming general methods that rely solely on output text [15] Dynamic Control of Computational Costs - OTV enables models to dynamically control computational expenses by real-time elimination of low-quality paths based on confidence scores, leading to a reduction in computational load by nearly 90% while maintaining optimal accuracy [17] Future Prospects - The OTV framework opens avenues for deeper integration with original models and the potential for a three-state system that includes "uncertain" states, enhancing selective prediction capabilities [25][26] - The approach could also be extended to different model architectures, optimizing KV cache structures to further improve reasoning efficiency and representation utilization [26]