Workflow
互联网数据“耗尽”后,高质量训练数据从哪里获得?专家热议
Nan Fang Du Shi Bao·2025-07-29 01:53

Group 1 - The 2025 World Artificial Intelligence Conference highlighted the consensus that internet data will be "exhausted" for training large models around 2026, necessitating the creation of new high-quality datasets [1] - The data annotation industry is transitioning from labor-intensive to knowledge-intensive, with increasing involvement from academic scholars and industry experts to enhance the quality of data [3][4] - High-quality datasets are identified as a core driver for AI development, with synthetic data emerging as a potential solution to data shortages, despite inherent issues such as bias and privacy risks [5][6] Group 2 - The industry recognizes the need for high-quality data from vertical sectors, emphasizing the importance of forming data "alliances" among industries to share specialized knowledge [5][6] - Collaborative efforts with academic institutions are encouraged to build high-quality datasets, as many academic fields may advance further than industry in certain areas [6] - The establishment of specialized companies like KuPass aims to address the unique data governance challenges in the AI large model field, which differ significantly from traditional data governance [6][7]