数据选择
Search documents
打破「数据暴力」预训练惯性,阿里Qwen、上交大等提出预训练动态数据选择范式OPUS
机器之心· 2026-03-16 08:34
Core Viewpoint - The article challenges the conventional belief that higher quality data is essential for training large models, presenting evidence that dynamically selecting from medium to low-quality data can outperform traditional high-quality data approaches [2][3][5]. Group 1: Data Selection and Optimization - The study introduces OPUS (Optimizer-induced Projected Utility Selection), which aligns data selection with the actual update direction determined by modern optimizers like AdamW and Muon, rather than relying on outdated methods [8][9][11]. - OPUS defines sample utility in the effective update space induced by the optimizer, allowing for a more principled approach to data selection that maximizes the utility of each update step [9][14]. - The method addresses the misalignment gap that occurs when using original gradients for data selection, emphasizing that data selection must be optimizer-dependent [5][10]. Group 2: Methodology and Implementation - OPUS employs a three-step process: aligning targets with a Bench-Proxy pool, efficiently estimating candidate sample utility, and stabilizing selection through redundancy penalties and Boltzmann sampling [11][17][20]. - The computational overhead of OPUS is approximately 4.7%, making it feasible for large-scale pre-training while maintaining efficiency [21][20]. - The method integrates existing data engineering techniques, allowing static filtering to eliminate low-value samples while OPUS dynamically selects from the remaining candidates [35][36]. Group 3: Experimental Results - In experiments, OPUS demonstrated an average accuracy improvement of 2.2% in FineWeb pre-training and achieved an 8× efficiency gain in GPT-XL settings [22][23]. - OPUS outperformed models trained on higher quality data by achieving a 3.18% accuracy increase using only medium-quality data [26]. - The method consistently achieved the lowest average perplexity across various domains, indicating its effectiveness in enhancing general language modeling capabilities [29][30]. Group 4: Implications and Future Directions - OPUS shifts the focus of pre-training from merely accumulating data to optimizing the efficiency of updates, suggesting a new paradigm in model training [34][37]. - The approach highlights the importance of selecting the right samples at the right time, potentially leading to better model performance with fewer resources [26][35]. - As the industry faces a "data wall," OPUS provides a clear pathway for maximizing the marginal gains from each token, emphasizing the need for precision in data utilization [5][36].
不靠海量数据,如何精准喂养大模型?上交Data Whisperer:免训练数据选择法,10%数据逼近全量效果
机器之心· 2025-07-29 06:38
Core Viewpoint - The article introduces "Data Whisperer," a novel framework for efficient data selection in fine-tuning large language models (LLMs) without the need for additional training, achieving near-optimal performance with only 10% of the data compared to full datasets [2][4][36]. Group 1: Methodology and Mechanism - Data Whisperer utilizes the in-context learning (ICL) capabilities of pre-trained models to select "golden training samples" without requiring a scoring model [2][6]. - The framework employs a scoring mechanism based on the model's own outputs and attention weights, ensuring a stable and reasonable selection process [10][12]. - It introduces a new efficiency metric, Selection-to-Tuning Ratio (STR), which shows that Data Whisperer significantly outperforms traditional methods in terms of time efficiency [17][18]. Group 2: Performance Metrics - In various tasks, Data Whisperer achieved impressive results, such as 72.46% accuracy on the GSM8K dataset using only 10% of the data, surpassing the full dataset performance of 71.39% [19]. - The framework also demonstrated superior performance in the DialogSum and BioInstruct tasks, with notable improvements over existing state-of-the-art methods [19][21]. Group 3: Robustness and Adaptability - Data Whisperer shows robustness in input scale, with optimal configurations identified for the number of demonstration and query samples, indicating that it effectively selects core samples rather than relying on sheer volume [26][28]. - The framework supports a weak-to-strong mechanism, allowing smaller models to select tasks for larger models, thus reducing computational burdens while maintaining performance [22][24]. Group 4: Comparative Analysis - Data Whisperer outperforms all mainstream data selection methods across accuracy, efficiency, and stability, particularly in low-budget scenarios [35]. - The framework's theoretical foundation is based on the relationship between ICL and fine-tuning, allowing it to effectively pre-train for training efficiency without adjusting model parameters [36][37]. Group 5: Future Directions - Potential future explorations include applying the method to complex structured tasks in fields like law and medicine, enhancing task alignment capabilities, and integrating human feedback [41][42].
字节最新大模型秘籍:只挑能有推理潜力的数据训练!1.3B模型无需标签自动挑选
量子位· 2025-05-15 06:26
Core Viewpoint - The ByteSeed team has introduced a significant advancement called AttentionInfluence, which allows for the selection of training data that enhances reasoning capabilities in pre-trained language models without the need for manual labeling or training [1][2]. Group 1: Methodology - Traditional data selection methods rely on supervised classifiers, which can introduce domain-specific biases [3]. - The AttentionInfluence method leverages the retrieval heads in pre-trained models, which are closely related to retrieval and contextual reasoning [4][5]. - The process involves identifying important retrieval heads, masking them to create a "weak" model, and ranking data based on the loss difference between the weak and strong models [6][13]. Group 2: Experimental Results - Applying AttentionInfluence to a 1.3B parameter pre-trained language model resulted in the selection of 73.1 billion tokens from the SmolLM corpus, which was then used to pre-train a 7B model [7][27]. - The model demonstrated performance improvements across various knowledge-intensive and reasoning-intensive benchmarks, with specific gains such as MMLU +1.4%, MMLU-Pro +2.7%, and HumanEval +3.5% [8][30]. Group 3: Performance Analysis - The AttentionInfluence model consistently outperformed baseline models throughout the pre-training process, with performance advantages evident early in training [29][30]. - The selected data often improved the performance of the 7B model on tasks where specific important heads were masked, indicating the predictive capability of the method [30]. Group 4: Quality Assessment - The study introduced two metrics to quantify the quality of selected data, showing that AttentionInfluence achieved significantly higher reasoning scores compared to the FineWeb-Edu classifier [33]. - The average length of samples selected by AttentionInfluence was nearly double that of the FineWeb-Edu classifier in certain domains, indicating a more comprehensive selection process [34]. Group 5: Conclusion - The results validate that the AttentionInfluence method effectively identifies high-quality pre-training data, enhancing the knowledge and reasoning capabilities of large language models, particularly in benchmarks requiring complex reasoning [38].