Workflow
2024大模型训练数据白皮书
阿里巴巴·2024-05-28 09:55

Industry Overview - The report emphasizes the importance of high-quality and diverse training data for the success of large language models (LLMs) like GPT, highlighting that data quality and scale are key drivers of model performance [11][12] - The training data for LLMs is categorized into three stages: pre-training, supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF), each requiring different types of data [14][16] - For multimodal models, training data includes image-text pairs and video-text pairs, enabling models to understand and generate information across multiple modalities [17] Data Types and Quality - High-quality data is crucial for model performance, as it enhances accuracy, stability, and generalization capabilities [23] - The report identifies three uncertainties in defining high-quality data: the type of data required, the evolution of data forms, and the effective combination of different data types [25][26] - Data quality is evaluated based on quality, scale, and diversity, with no universal standard for high-quality data [27][28] Synthetic Data - Synthetic data is proposed as a solution to address the shortage of training data, offering benefits such as cost efficiency, privacy protection, and the ability to simulate rare events [31][33] - Synthetic data can be generated through algorithms and mathematical models, either based on real datasets or created from scratch using existing models or domain knowledge [34][35] - It is particularly useful in multimodal data generation and domain-specific knowledge creation, enhancing model performance in specialized fields [37][39] Data Governance and Compliance - The report discusses the importance of data governance, emphasizing that model training does not rely on personal information and that the use of copyrighted data for training is considered transformative and falls under fair use [50][51] - It suggests that data governance should focus on output control and post-event remedies rather than imposing strict pre-use restrictions, allowing for more flexibility in data utilization [52] Government and Social Collaboration - The report compares the data ecosystems in the US and China, highlighting that the US government promotes open access to public data, while China faces challenges in data openness and integration [54][55][60] - Social forces in the US play a significant role in integrating public and open data to create high-quality training datasets, whereas in China, the reliance on overseas datasets and limited public data access hinders the development of a robust data ecosystem [56][60] Alibaba's Exploration in LLM Training - Alibaba integrates high-quality Chinese datasets with overseas open-source data, ensuring data compliance and optimizing training data quality [63] - The company employs synthetic data in e-commerce scenarios to enhance recommendation systems, improving both performance and privacy protection [64]