RefineX
Search documents
手术刀式去噪突破LLM能力上限,从头预训练模型下游任务平均提高7.2% | 中科院&阿里
量子位· 2025-07-21 04:23
Core Viewpoint - The article discusses RefineX, a new framework developed by the Institute of Computing Technology, Chinese Academy of Sciences, and Alibaba Qwen, aimed at efficiently refining large-scale pre-training data through programmatic editing tasks, addressing noise pollution that affects data quality [1][2]. Group 1: Advantages of RefineX - RefineX distills high-quality end-to-end optimization results into a simplified deletion program based on editing operations, enhancing the efficiency of data refinement [2][11]. - The high-precision distillation process enables the training of an efficient and reliable refine model that systematically optimizes each instance in the corpus [3][12]. - While refining data efficiently, RefineX reliably preserves the diversity and naturalness of the original text [4][19]. Group 2: Performance Metrics - Training a 750M model with 20 billion tokens refined by RefineX achieved an average score of 44.7 across ten tasks, representing a 7.2% improvement over the original data [5][25]. - The model using 10 billion refined tokens outperformed those trained on 20 billion traditional filtered data, indicating that RefineX effectively reduces training token costs while allowing for more diverse text consideration [25]. Group 3: Data Quality Improvement - RefineX demonstrated a 42.2% improvement rate in the quality of low-quality content while maintaining a "zero new vocabulary" policy, thus eliminating any risk of hallucination [29]. - The end-to-end approach, while showing higher improvement rates, introduced external vocabulary at a rate of 15 new words per thousand tokens, posing semantic alteration risks [29]. Group 4: Methodology and Process - RefineX employs a two-stage process for data distillation: first, it executes end-to-end refinement, then compares the refined text with the original to generate more reliable supervision programs [11][16]. - The framework limits program functions to deletion operations only, ensuring that the original text is protected from excessive modifications [19][20]. Group 5: Comparative Analysis - RefineX consistently achieved the highest average scores across various tasks, outperforming both original and previously filtered datasets [26]. - The results indicate that regardless of whether the original data or previously filtered datasets were improved, models trained with RefineX consistently achieved superior performance [26].