Workflow
数据选择
icon
Search documents
不靠海量数据,如何精准喂养大模型?上交Data Whisperer:免训练数据选择法,10%数据逼近全量效果
机器之心· 2025-07-29 06:38
Core Viewpoint - The article introduces "Data Whisperer," a novel framework for efficient data selection in fine-tuning large language models (LLMs) without the need for additional training, achieving near-optimal performance with only 10% of the data compared to full datasets [2][4][36]. Group 1: Methodology and Mechanism - Data Whisperer utilizes the in-context learning (ICL) capabilities of pre-trained models to select "golden training samples" without requiring a scoring model [2][6]. - The framework employs a scoring mechanism based on the model's own outputs and attention weights, ensuring a stable and reasonable selection process [10][12]. - It introduces a new efficiency metric, Selection-to-Tuning Ratio (STR), which shows that Data Whisperer significantly outperforms traditional methods in terms of time efficiency [17][18]. Group 2: Performance Metrics - In various tasks, Data Whisperer achieved impressive results, such as 72.46% accuracy on the GSM8K dataset using only 10% of the data, surpassing the full dataset performance of 71.39% [19]. - The framework also demonstrated superior performance in the DialogSum and BioInstruct tasks, with notable improvements over existing state-of-the-art methods [19][21]. Group 3: Robustness and Adaptability - Data Whisperer shows robustness in input scale, with optimal configurations identified for the number of demonstration and query samples, indicating that it effectively selects core samples rather than relying on sheer volume [26][28]. - The framework supports a weak-to-strong mechanism, allowing smaller models to select tasks for larger models, thus reducing computational burdens while maintaining performance [22][24]. Group 4: Comparative Analysis - Data Whisperer outperforms all mainstream data selection methods across accuracy, efficiency, and stability, particularly in low-budget scenarios [35]. - The framework's theoretical foundation is based on the relationship between ICL and fine-tuning, allowing it to effectively pre-train for training efficiency without adjusting model parameters [36][37]. Group 5: Future Directions - Potential future explorations include applying the method to complex structured tasks in fields like law and medicine, enhancing task alignment capabilities, and integrating human feedback [41][42].
字节最新大模型秘籍:只挑能有推理潜力的数据训练!1.3B模型无需标签自动挑选
量子位· 2025-05-15 06:26
西风 发自 凹非寺 量子位 | 公众号 QbitAI 和人工标记数据说拜拜,利用预训练语言模型中的注意力机制就能选择 可激发推理能力的训练数据 ! 字节Seed团队最新宣布了一个重要成果—— At te ntionInflu en ce 。 无需训练,无需标签 ,只需用1.3B模型给7B模型选择数据,就能提升模型推理能力,甚至也能提升代码生成能力。 以往,筛选数据的方法通常依赖于监督分类器,需要人工或大语言模型进行标注,难免引入领域特定偏见。 字节Seed团队注意到: 预训练模型中的检索头与检索和上下文推理紧密相关。 检索头在训练早期就会出现,逐渐增强,并最终在训练的中后期阶段牢固建立,对模型性能起到至关重要的作用。 1.3B参数稠密模型中检索头的演化过程,be like: 但如果直接关闭它们会怎样? 他们用小型预训练语言模型通过简单的 注意力头屏蔽 操作,充当强大的模型的数据选择器。 具体操作是 , 识别重要检索头,屏蔽这些头以创建性能下降的"弱"模型, 计算"弱"模型与原始"强"模型之间的损失差异,根据损失增加幅度 对数据进行排名 ,形成影响分数 。 没想到,实验后他们得到了一个惊人结果。 将Attent ...