扩展外部测试时Scaling Law，中关村学院新发现：轻量级验证器可解锁LLM推理最优选择

Core Insights - The article discusses the concept of Test-Time Scaling (TTS) as a method to enhance the reasoning capabilities of large language models (LLMs) by allocating more computational resources during the model's response phase [4][6] - It introduces the TrajSelector method, a lightweight yet powerful Best-of-N strategy that leverages the hidden states of large models to evaluate reasoning paths without the need for expensive process annotations or large reward models [7][10] Summary by Sections Research Background - TTS is categorized into internal and external methods, with the latter focusing on parallel reasoning to generate multiple paths for a final answer [4][6] Existing Methods and Their Limitations - Traditional Best-of-N methods include Majority Voting and Process Reward Model (PRM), both of which have significant drawbacks such as instability and inefficiency [5][10] TrajSelector Methodology - TrajSelector operates through a three-step pipeline: parallel sampling, step scoring, and aggregation to select the optimal reasoning path [12][14] - It utilizes a lightweight scoring model (0.6B parameters) to assess reasoning steps based on the hidden states of a larger strategy model, achieving better scoring performance with reduced parameter size [13][14] Training Approach - TrajSelector employs a weak supervision training scheme that eliminates the need for extensive manual annotations, allowing the model to learn effectively from large datasets [16][17] Experimental Results - The article presents performance metrics for various N values in Best-of-N tasks, demonstrating that TrajSelector outperforms traditional methods across multiple benchmarks [19][20] Conclusion - TrajSelector offers a significant advancement in optimizing reasoning for large models, emphasizing the importance of effectively utilizing existing model capabilities rather than merely increasing model size [22][23]