数据选择
Search documents
不靠海量数据,如何精准喂养大模型?上交Data Whisperer:免训练数据选择法,10%数据逼近全量效果
机器之心· 2025-07-29 06:38
Core Viewpoint - The article introduces "Data Whisperer," a novel framework for efficient data selection in fine-tuning large language models (LLMs) without the need for additional training, achieving near-optimal performance with only 10% of the data compared to full datasets [2][4][36]. Group 1: Methodology and Mechanism - Data Whisperer utilizes the in-context learning (ICL) capabilities of pre-trained models to select "golden training samples" without requiring a scoring model [2][6]. - The framework employs a scoring mechanism based on the model's own outputs and attention weights, ensuring a stable and reasonable selection process [10][12]. - It introduces a new efficiency metric, Selection-to-Tuning Ratio (STR), which shows that Data Whisperer significantly outperforms traditional methods in terms of time efficiency [17][18]. Group 2: Performance Metrics - In various tasks, Data Whisperer achieved impressive results, such as 72.46% accuracy on the GSM8K dataset using only 10% of the data, surpassing the full dataset performance of 71.39% [19]. - The framework also demonstrated superior performance in the DialogSum and BioInstruct tasks, with notable improvements over existing state-of-the-art methods [19][21]. Group 3: Robustness and Adaptability - Data Whisperer shows robustness in input scale, with optimal configurations identified for the number of demonstration and query samples, indicating that it effectively selects core samples rather than relying on sheer volume [26][28]. - The framework supports a weak-to-strong mechanism, allowing smaller models to select tasks for larger models, thus reducing computational burdens while maintaining performance [22][24]. Group 4: Comparative Analysis - Data Whisperer outperforms all mainstream data selection methods across accuracy, efficiency, and stability, particularly in low-budget scenarios [35]. - The framework's theoretical foundation is based on the relationship between ICL and fine-tuning, allowing it to effectively pre-train for training efficiency without adjusting model parameters [36][37]. Group 5: Future Directions - Potential future explorations include applying the method to complex structured tasks in fields like law and medicine, enhancing task alignment capabilities, and integrating human feedback [41][42].
字节最新大模型秘籍:只挑能有推理潜力的数据训练!1.3B模型无需标签自动挑选
量子位· 2025-05-15 06:26
Core Viewpoint - The ByteSeed team has introduced a significant advancement called AttentionInfluence, which allows for the selection of training data that enhances reasoning capabilities in pre-trained language models without the need for manual labeling or training [1][2]. Group 1: Methodology - Traditional data selection methods rely on supervised classifiers, which can introduce domain-specific biases [3]. - The AttentionInfluence method leverages the retrieval heads in pre-trained models, which are closely related to retrieval and contextual reasoning [4][5]. - The process involves identifying important retrieval heads, masking them to create a "weak" model, and ranking data based on the loss difference between the weak and strong models [6][13]. Group 2: Experimental Results - Applying AttentionInfluence to a 1.3B parameter pre-trained language model resulted in the selection of 73.1 billion tokens from the SmolLM corpus, which was then used to pre-train a 7B model [7][27]. - The model demonstrated performance improvements across various knowledge-intensive and reasoning-intensive benchmarks, with specific gains such as MMLU +1.4%, MMLU-Pro +2.7%, and HumanEval +3.5% [8][30]. Group 3: Performance Analysis - The AttentionInfluence model consistently outperformed baseline models throughout the pre-training process, with performance advantages evident early in training [29][30]. - The selected data often improved the performance of the 7B model on tasks where specific important heads were masked, indicating the predictive capability of the method [30]. Group 4: Quality Assessment - The study introduced two metrics to quantify the quality of selected data, showing that AttentionInfluence achieved significantly higher reasoning scores compared to the FineWeb-Edu classifier [33]. - The average length of samples selected by AttentionInfluence was nearly double that of the FineWeb-Edu classifier in certain domains, indicating a more comprehensive selection process [34]. Group 5: Conclusion - The results validate that the AttentionInfluence method effectively identifies high-quality pre-training data, enhancing the knowledge and reasoning capabilities of large language models, particularly in benchmarks requiring complex reasoning [38].