测试时扩展(Test-Time Scaling)
Search documents
扩展外部测试时Scaling Law,中关村学院新发现:轻量级验证器可解锁LLM推理最优选择
机器之心· 2025-11-06 05:28
Core Insights - The article discusses the concept of Test-Time Scaling (TTS) as a method to enhance the reasoning capabilities of large language models (LLMs) by allocating more computational resources during the model's response phase [4][6] - It introduces the TrajSelector method, a lightweight yet powerful Best-of-N strategy that leverages the hidden states of large models to evaluate reasoning paths without the need for expensive process annotations or large reward models [7][10] Summary by Sections Research Background - TTS is categorized into internal and external methods, with the latter focusing on parallel reasoning to generate multiple paths for a final answer [4][6] Existing Methods and Their Limitations - Traditional Best-of-N methods include Majority Voting and Process Reward Model (PRM), both of which have significant drawbacks such as instability and inefficiency [5][10] TrajSelector Methodology - TrajSelector operates through a three-step pipeline: parallel sampling, step scoring, and aggregation to select the optimal reasoning path [12][14] - It utilizes a lightweight scoring model (0.6B parameters) to assess reasoning steps based on the hidden states of a larger strategy model, achieving better scoring performance with reduced parameter size [13][14] Training Approach - TrajSelector employs a weak supervision training scheme that eliminates the need for extensive manual annotations, allowing the model to learn effectively from large datasets [16][17] Experimental Results - The article presents performance metrics for various N values in Best-of-N tasks, demonstrating that TrajSelector outperforms traditional methods across multiple benchmarks [19][20] Conclusion - TrajSelector offers a significant advancement in optimizing reasoning for large models, emphasizing the importance of effectively utilizing existing model capabilities rather than merely increasing model size [22][23]
视频生成1.3B碾压14B、图像生成直逼GPT-4o!港科&快手开源测试时扩展新范式
机器之心· 2025-06-10 03:58
论文第一作者为何浩然,香港科技大学二年级博士,他的研究方向包括强化学习、生成流模型(GFlowNets)以及具身智能,通讯作者为香港科技大学电子与计算 机工程系、计算机科学与工程系助理教授潘玲。 测试时扩展(Test-Time Scaling)极大提升了大语言模型的性能,涌现出了如 OpenAI o 系列模型和 DeepSeek R1 等众多爆款。那么,什么是视觉领域的 test-time scaling?又该如何定义? 为了回答这一问题,最近 香港科技大学 联合 快手可灵团队 推出 Evolutionary Search (EvoSearch) 方法,通过提高推理时的计算量来大幅提升模型的生成质 量,支持图像和视频生成,支持目前最先进的 diffusion-based 和 flow-based 模型。EvoSearch 无需训练,无需梯度更新,即可在一系列任务上取得显著最优效果, 并且表现出良好的 scaling up 能力、鲁棒性和泛化性。 随着测试时计算量提升,EvoSearch 表明 SD2.1 和 Flux.1-dev 也有潜力媲美甚至超过 GPT4o。对于视频生成,Wan 1.3B 也能超过 Wa ...