Workflow
浙江大学教授王春晖:高质量数据集是AI大模型训练、推理和验证的关键基础
Zhong Guo Jing Ying Bao·2025-09-21 14:52

Core Insights - The current data industry in China is entering a "fast lane" of development, with the value of data as a key production factor becoming increasingly prominent [1][2] - High-quality datasets are essential for the reliable development of AI models, as low-quality data can lead to misleading outputs known as "hallucinations" [2][3] Data Quality and AI Models - The training data for large language models (LLMs) often comes from the internet, resulting in varying quality and leading to outputs based on "probabilistic matching" rather than "factual judgment" [2] - A study indicates that when training datasets contain only 0.01% false text, harmful content output by the model increases by 11.2%, highlighting the critical issue of insufficient high-quality data supply [2] - High-quality datasets are categorized into general datasets, industry general datasets, and industry-specific datasets, which are foundational for the application of both general and industry models [2][3] Industry-Specific Data - Industry general datasets include knowledge that requires a certain level of professional background to understand, such as healthcare data encompassing personal attributes, health status, and medical application data [3] - Industry-specific datasets require deeper professional knowledge and are crucial for specific business scenarios, such as medical AI relying on high-quality expert-annotated data [3] AI and Data Integration - The trend is shifting towards a data-centric approach in AI development, which does not diminish the value of model-centric AI but rather complements it [3] Prompt Engineering - The ability to ask questions and discern answers is emphasized as crucial in the AI era, with the concept of prompt engineering introduced to guide LLMs in generating useful content [4] - Skilled prompt engineers can enhance AI model efficiency by over 30% in fields like healthcare by designing precise prompts [4] Policy and Industry Development - The Chinese government has issued guidelines to strengthen the construction of high-quality AI datasets, emphasizing application-oriented approaches and the development of data processing and service industries [5] - The shift from "data-entity integration" to "entity-data integration" reflects a focus on promoting high-quality development driven by the needs of the real economy [5]