全新预训练数据筛选方案,让数据效率提升10倍!配置仅需fastText评分器|港科大vivo出品
量子位·2025-05-15 04:26

Core Viewpoint - vivo has publicly disclosed its self-developed data selection method, PreSelect, which is a lightweight and efficient approach that significantly reduces computational requirements by a factor of 10 [1][2][3]. Group 1: Methodology and Advantages - PreSelect introduces the concept of Predictive Strength, which quantifies the contribution of data to specific model capabilities by evaluating the order of loss across different models [3][5]. - The method is designed to be objective, generalizable, lightweight, and capable of fine-grained data selection, allowing for sample-level filtering and targeting specific ability dimensions [5][13]. - Compared to existing state-of-the-art methods, PreSelect demonstrates a solid theoretical foundation, reducing reliance on subjective human rules and enhancing the objectivity and generalizability of the selection process [13][16]. Group 2: Performance and Results - Experimental results indicate that PreSelect outperforms other data selection methods, achieving an average improvement of 3% over baseline models across 17 downstream tasks [20][22]. - In specific evaluations, models trained with data selected by PreSelect showed significant performance enhancements, with relative improvements of up to 26.67% in certain metrics [23][24]. - The method effectively samples from high-quality content sources, leading to better representation and coverage in the selected datasets, which is crucial for enhancing model performance across various domains [25][26].