阿里通义开源「推理+搜索」预训练新框架：小模型媲美大模型，多个开放域问答数据集表现显著提升

Core Viewpoint - Alibaba's Tongyi Lab has introduced a new framework called MaskSearch to enhance the reasoning and search capabilities of large models, achieving significant performance improvements in both in-domain and cross-domain open-domain question-answering tasks [1][2]. Group 1: MaskSearch Framework - MaskSearch is a general pre-training framework that has shown remarkable performance enhancements over baseline methods in open-domain question-answering tasks [2]. - The framework incorporates a retrieval-augmented masked prediction task (RAMP), where the model uses external knowledge bases to predict masked text segments, thereby improving its reasoning and search capabilities [5][11]. - MaskSearch supports both supervised fine-tuning (SFT) and reinforcement learning (RL) training methods, allowing for flexible model training [6]. Group 2: Training Methodology - The SFT process involves generating chain-of-thought (CoT) data through a multi-agent system that collaborates to produce reasoning chains, ensuring only correct answers are retained [12]. - The RL component utilizes a dynamic sampling strategy and a hybrid reward system to optimize the model's multi-step search and reasoning processes [15][20]. - A curriculum learning strategy is employed to gradually increase the difficulty of training samples based on the number of masked elements, enhancing the model's reasoning skills [16][24]. Group 3: Experimental Results - Experiments demonstrate that the two-stage MaskSearch training framework significantly enhances the search and reasoning capabilities of large models, with improvements noted in recall rates across various datasets [18][19]. - The RL approach shows higher performance ceilings, particularly in in-domain tasks like HotpotQA, indicating its effectiveness in optimizing search and reasoning processes [19][20]. - The scalability of MaskSearch is validated, with smaller models showing significant performance gains post-pre-training, while larger models exhibit more gradual improvements [22]. Group 4: Additional Insights - The masking strategy is crucial in determining the difficulty of the RAMP pre-training task, with experiments indicating that a perplexity-based masking strategy can enhance model recall rates [27][30]. - Different reward functions in the RL training process yield varying impacts on model performance, with model-based reward functions demonstrating superior stability and efficiency [31][33].