AttentionInfluence
Search documents
字节最新大模型秘籍:只挑能有推理潜力的数据训练!1.3B模型无需标签自动挑选
量子位· 2025-05-15 06:26
Core Viewpoint - The ByteSeed team has introduced a significant advancement called AttentionInfluence, which allows for the selection of training data that enhances reasoning capabilities in pre-trained language models without the need for manual labeling or training [1][2]. Group 1: Methodology - Traditional data selection methods rely on supervised classifiers, which can introduce domain-specific biases [3]. - The AttentionInfluence method leverages the retrieval heads in pre-trained models, which are closely related to retrieval and contextual reasoning [4][5]. - The process involves identifying important retrieval heads, masking them to create a "weak" model, and ranking data based on the loss difference between the weak and strong models [6][13]. Group 2: Experimental Results - Applying AttentionInfluence to a 1.3B parameter pre-trained language model resulted in the selection of 73.1 billion tokens from the SmolLM corpus, which was then used to pre-train a 7B model [7][27]. - The model demonstrated performance improvements across various knowledge-intensive and reasoning-intensive benchmarks, with specific gains such as MMLU +1.4%, MMLU-Pro +2.7%, and HumanEval +3.5% [8][30]. Group 3: Performance Analysis - The AttentionInfluence model consistently outperformed baseline models throughout the pre-training process, with performance advantages evident early in training [29][30]. - The selected data often improved the performance of the 7B model on tasks where specific important heads were masked, indicating the predictive capability of the method [30]. Group 4: Quality Assessment - The study introduced two metrics to quantify the quality of selected data, showing that AttentionInfluence achieved significantly higher reasoning scores compared to the FineWeb-Edu classifier [33]. - The average length of samples selected by AttentionInfluence was nearly double that of the FineWeb-Edu classifier in certain domains, indicating a more comprehensive selection process [34]. Group 5: Conclusion - The results validate that the AttentionInfluence method effectively identifies high-quality pre-training data, enhancing the knowledge and reasoning capabilities of large language models, particularly in benchmarks requiring complex reasoning [38].