高质量数据集
Search documents
浙江大学教授王春晖:高质量数据集是AI大模型训练、推理和验证的关键基础
Zhong Guo Jing Ying Bao· 2025-09-21 14:52
Core Insights - The current data industry in China is entering a "fast lane" of development, with the value of data as a key production factor becoming increasingly prominent [1][2] - High-quality datasets are essential for the reliable development of AI models, as low-quality data can lead to misleading outputs known as "hallucinations" [2][3] Data Quality and AI Models - The training data for large language models (LLMs) often comes from the internet, resulting in varying quality and leading to outputs based on "probabilistic matching" rather than "factual judgment" [2] - A study indicates that when training datasets contain only 0.01% false text, harmful content output by the model increases by 11.2%, highlighting the critical issue of insufficient high-quality data supply [2] - High-quality datasets are categorized into general datasets, industry general datasets, and industry-specific datasets, which are foundational for the application of both general and industry models [2][3] Industry-Specific Data - Industry general datasets include knowledge that requires a certain level of professional background to understand, such as healthcare data encompassing personal attributes, health status, and medical application data [3] - Industry-specific datasets require deeper professional knowledge and are crucial for specific business scenarios, such as medical AI relying on high-quality expert-annotated data [3] AI and Data Integration - The trend is shifting towards a data-centric approach in AI development, which does not diminish the value of model-centric AI but rather complements it [3] Prompt Engineering - The ability to ask questions and discern answers is emphasized as crucial in the AI era, with the concept of prompt engineering introduced to guide LLMs in generating useful content [4] - Skilled prompt engineers can enhance AI model efficiency by over 30% in fields like healthcare by designing precise prompts [4] Policy and Industry Development - The Chinese government has issued guidelines to strengthen the construction of high-quality AI datasets, emphasizing application-oriented approaches and the development of data processing and service industries [5] - The shift from "data-entity integration" to "entity-data integration" reflects a focus on promoting high-quality development driven by the needs of the real economy [5]
OpenAI:预计今年ChatGPT收入近100亿美元|首席资讯日报
首席商业评论· 2025-09-07 04:09
Group 1 - Xinba, the founder of XinXuan Group, was reported to be taken away for investigation, but the company denied the claims [2] - The film "Nanjing Photo Studio" was released in the UK, providing an opportunity for audiences to understand the history of the Nanjing Massacre [3] - The first AI computing open architecture in China was launched at the World Intelligent Industry Expo, supporting significant AI computing capabilities [4] Group 2 - Yi Huiman, during his tenure as the chairman of the China Securities Regulatory Commission, saw the A-share market breach the 3000-point mark 20 times [5][6] - OpenAI is expected to generate nearly $10 billion in revenue through ChatGPT this year, with total revenue projected to reach $13 billion [10] - A Brazilian billionaire has named football star Neymar as the sole heir to his fortune, estimated to exceed $1 billion [12]
首批85个高质量数据集建设清单发布
Zheng Quan Shi Bao Wang· 2025-09-06 02:48
Core Insights - The 2025 Trusted Data Space High-Quality Data Set Ecological Conference was held in Chongqing, where the first batch of 85 high-quality data set construction lists was released [1] - The conference initiated the pilot projects for high-quality data set construction and the Trusted Data Space National Innovation Development Pilot in Chongqing [1] Industry Developments - In the automotive sector, there will be an accelerated construction of data sets for new energy vehicle power battery safety assessment and intelligent driving algorithm research, aimed at enhancing the trillion-level industrial cluster with a "data new engine" [1] - In the low-altitude economy sector, the construction of data sets such as the Tianmu Constellation Global Atmospheric and Ocean Remote Sensing and Low-altitude Urban Safety Inspection Guardian will be expedited, which will build spatial perception capabilities to empower efficient, refined, and intelligent urban governance [1]
时代风口 数据质变 引领智能文明新跃迁
Zheng Quan Shi Bao· 2025-09-04 21:58
Core Insights - The construction of high-quality data sets in China has entered a large-scale phase, with a total exceeding 400PB and a cumulative transaction value of nearly 4 billion yuan [1] - The transformation from "mass" to "quality" data reflects not only technological evolution but also a deeper philosophical shift in digital civilization, emphasizing the importance of quality over quantity [1] - The emergence of high-quality data sets signifies a fundamental shift in AI development paradigms, moving from a phase of excessive data feeding to one focused on quality and precision [1] Industry Implications - High-quality data sets are considered the "cultural gene pool" of the digital age, integrating traditional Chinese cultural values into the data framework, which presents an opportunity for cultural transmission alongside technological advancement [2] - The risk of creating a digital divide is highlighted, where institutions with access to high-quality data may monopolize AI benefits, potentially leaving data-poor entities behind in the intelligent era [2] - The need for data classification, security measures, and equitable data policies is emphasized to prevent high-quality data from becoming a systemic risk and to ensure fair access to AI advancements [2] Future Directions - Establishing national standards for data quality is essential to provide measurable criteria for "high quality" [3] - Encouraging cross-domain data integration is necessary to break down "data silos" and enable high-quality data to create multiplier effects [3] - It is crucial to embed humanistic values in the data intelligence process to prevent AI from becoming merely a utilitarian tool, ensuring that data is not only high-quality but also meaningful and ethical [3]
时代风口 | 数据质变引领智能文明新跃迁
Zheng Quan Shi Bao· 2025-09-04 18:53
Group 1 - The construction of high-quality data sets in China has entered a large-scale phase, with a total volume exceeding 400PB and a cumulative transaction value of nearly 4 billion yuan [1] - The transformation from "mass" to "high quality" data reflects not only technological evolution but also a deeper philosophical shift in digital civilization, emphasizing the importance of quality over quantity [1] - The emergence of high-quality data sets signifies a fundamental shift in AI development paradigms, moving from a phase of resource wastage to one of refined knowledge cultivation [1] Group 2 - High-quality data sets are described as the "cultural gene pool" of the digital age, integrating traditional Chinese culture and values into the data framework [2] - The construction of high-quality data sets presents both opportunities for technological advancement and challenges related to the digital divide, where institutions with quality data may monopolize AI benefits [2] - There is a need for data policies that balance efficiency and fairness to prevent high-quality data from becoming the private property of a few entities [2] Group 3 - The future requires a "data alchemist" approach to reshape the data value ecosystem, including establishing national standards for data quality and encouraging cross-domain data integration [3] - It is crucial to embed humanistic values in the data intelligence process to prevent AI from becoming merely a utilitarian tool [3] - The era of data quality transformation emphasizes the importance of ethical, quality, and temporal considerations in data, ensuring that AI development achieves a balance between quantity and quality [3]
高质量数据集和AI共振 成为数据流通“硬通货”
Zhong Guo Xin Wen Wang· 2025-09-02 14:32
Group 1 - The core concept of "high-quality data sets" has been introduced to support AI application innovation and the development of new business models such as "data as a service" and "knowledge as a service" [2] - By June 2025, over 35,000 high-quality data sets are expected to be built in China, totaling over 400 petabytes (PB), with 3,364 data trading institutions listing high-quality data sets and a cumulative transaction value of nearly 4 billion yuan [2] - The relationship between high-quality data sets and AI development is symbiotic, with high-quality data sets becoming essential for AI model training and data circulation [3] Group 2 - The quality and security of data set construction are critical for the development of AI models, necessitating a robust data security system and the integration of traditional cultural values [3] - Shenzhen is actively exploring the integration of public and enterprise data resources to support high-quality data applications, achieving positive results in various sectors such as finance and insurance [3]
江苏发布首批高质量数据集重点领域建设清单
Xin Hua Ri Bao· 2025-09-01 23:24
Core Insights - Jiangsu has released a list of key areas for the construction of high-quality datasets, aimed at fostering innovation in artificial intelligence large model technology and enhancing industrial ecosystems [1] Group 1: Key Areas of Focus - The first batch of construction lists targets 16 key areas including industrial manufacturing, transportation, healthcare, scientific research, financial services, cultural tourism, urban governance, human resources, green low-carbon initiatives, rural agriculture, smart energy, education, business, emergency management, meteorological services, and public safety [1] - In addition to the key areas, the list also includes high-quality datasets for general large models, cross-border data, and government services [1] Group 2: Impact on Society - The "Health Information Dataset" in the healthcare sector integrates various medical and public health functions, providing support for health analysis, disease monitoring, clinical decision-making, public health emergency response, and medical quality monitoring [1] - The "Human Resources and Social Security Industry Dataset" consolidates information on individual and corporate social security contributions, vocational qualifications, labor arbitration, and labor inspection, enabling precise public service and credit evaluation [1]
江苏发布高质量数据集重点领域建设清单
Xin Hua Ri Bao· 2025-09-01 22:36
Core Insights - Jiangsu has released a list of key areas for the construction of high-quality data sets, aimed at fostering innovation in artificial intelligence and enhancing industrial ecosystems [1] - The first batch of the construction list focuses on 16 key sectors, including industrial manufacturing, transportation, healthcare, scientific research, financial services, cultural tourism, urban governance, human resources, green low-carbon initiatives, rural agriculture, smart energy, education, business, emergency management, meteorological services, and public safety [1] - Additional areas for high-quality data sets include general large models, cross-border data, and government services [1] Sector-Specific Summaries - In the healthcare sector, the "Health Information Data Set" integrates various medical functions, providing support for health analysis, disease monitoring, clinical decision-making, public health emergency response, and medical quality monitoring [1] - The "Human Resources and Social Security Data Set" compiles information on individual and enterprise social security contributions, vocational qualifications, labor arbitration, and labor inspection, enabling precise public service and credit evaluation [1]
专题发布数据基础设施建设成果、入选高质量数据集典型案例
Nan Jing Ri Bao· 2025-09-01 02:18
Core Insights - The 2025 China International Big Data Industry Expo concluded in Guiyang, focusing on themes such as data-driven industrial momentum and new development chapters, featuring 375 participating companies and 26 exchange activities [1] Group 1: Data Infrastructure and Transactions - The National Data Bureau announced the successful cross-domain data transaction between Nanjing and Dalian, marking the first instance of credible data flow between cities in China [2] - This transaction utilized the national data circulation infrastructure, validating the feasibility and reliability of cross-domain data operations [2] Group 2: High-Quality Data Sets - The National Data Bureau released the first batch of 104 high-quality data set case studies, with two cases from Nanjing included, highlighting the importance of high-quality data sets for AI model training [4] - The two selected cases from Nanjing are the "Public Credit Archive High-Quality Data Set" by Nanjing Lais Information Technology Co., Ltd. and the "Intelligent Inspection and Safety Control Data Set for China Huadian Power Generation" by Nanjing Nanzi Information Technology Co., Ltd. [4] Group 3: Public Credit Archive Data Set - The "Public Credit Archive High-Quality Data Set" has collected over 800 billion data entries, covering more than 180 million social entities and 800 million individuals, with an annual increase of over 2 billion entries [5] - This data set is utilized in various sectors, including government services and social governance, enhancing administrative efficiency and reducing market operation costs [5] Group 4: Intelligent Inspection Data Set - The "Intelligent Inspection and Safety Control Data Set" addresses challenges in the power generation sector by creating a comprehensive dataset covering various energy types, including wind, solar, hydro, and thermal power [6] - It establishes a standard system for data collection, annotation, and application, promoting industry development [6] Group 5: High-Quality Data Set Industry Base - The "China High-Quality Data Set Industry Base (Nanjing)" was inaugurated, focusing on key technology breakthroughs and standard system construction for high-quality data sets [7] - The base aims to foster collaboration among industry players and enhance the development of AI data sets, contributing to the innovation and intelligence of various sectors [7]
我省4项目入选国家高质量数据集典型案例
Xin Hua Ri Bao· 2025-08-30 23:21
Group 1 - The National Data Bureau officially released the first batch of 104 high-quality data set typical case lists at the China International Big Data Industry Expo held from August 28 to 30 [1] - Four cases from Jiangsu Province were selected, including China Mobile's "China Mobile R&D Large Model High-Quality Data Set," Nanjing Lais Information Technology's "Public Credit Archive High-Quality Data Set," Nanjing Nanzi Information Technology's "China Huadian Power Generation Intelligent Inspection and Safety Control High-Quality Data Set," and China Energy Conservation Solar Technology's "Integrated Energy High-Quality Data Set Construction" [1] - The "China Mobile R&D Large Model High-Quality Data Set" has a total data volume exceeding 10TB, covering 8 categories and 17 data sets, which can be reused in various vertical industries such as industrial, financial, and transportation for improving and evaluating data quality of large models [1] Group 2 - The "Public Credit Archive High-Quality Data Set" has connected with 47 ministries, 31 provincial units, and the Xinjiang Production and Construction Corps, accumulating over 80 billion records by June this year, widely applied in government services and social governance [1] - The "China Huadian Power Generation Intelligent Inspection and Safety Control High-Quality Data Set" has constructed a visual data set covering all types of power generation including wind, solar, hydro, and thermal [1] - China Energy Conservation Solar Technology (Zhenjiang) Co., Ltd. provides integrated green and low-carbon operational scenarios and delivery service capabilities through the construction of the integrated energy high-quality data set [1]