高质量数据集
Search documents
共创自然资源数据应用新生态 自然资源行业高质量数据集建设与创新应用论坛成功举办
Sou Hu Wang· 2025-11-12 07:39
2025年11月7日,于浙江德清召开的第二届中国测绘地理信息大会期间,"自然资源行业高质量数据集建 设与创新应用论坛"成功举行。本次论坛由自然资源部网络安全和信息化领导小组办公室、自然资源部 科技发展司指导,自然资源部信息中心、国家数据发展研究院、自然资源部国土空间大数据工程技术创 新中心、中国地理信息产业协会智慧国土工作委员会、中国地理信息产业协会自然资源信息工作委员 会、山东省土地发展集团有限公司、北京数慧时空信息技术有限公司联合承办,旨在汇聚行业顶尖智 慧,共商数据建设标准,共享创新应用成果,共创自然资源数据应用新生态。论坛吸引了来自政府、企 业、科研院所的众多专家与业内人士,成为大会期间备受瞩目的焦点活动。 自然资源部信息中心副主任兼总工程师吴洪涛在题为《自然资源行业高质量数据集构建与后土大模型应 用探索》的报告中,高屋建瓴地指出,高质量数据集已成为全球AI竞争的战略焦点。他描绘了自然资 源高质量数据集建设的闭环体系:从"数据精炼场"(数据供给)到"用数实验室"(数据赋能),再到"价值运 营中心"(数据服务),最终形成"大模型开放数字生态"(共建共享)和"行业大模型标准体系"(统一标准)。 他特别指出 ...
建设高质量数据集,江苏势在必行、必须先行
Xin Hua Ri Bao· 2025-11-06 08:16
Core Insights - The "2025 National High-Quality Data Set and Data Annotation Industry Supply and Demand Matching Conference" held in Nanjing successfully attracted over 500 companies and resulted in more than 90 collaborations with a transaction value exceeding 900 million yuan [1] - Jiangsu province aims to leverage its rich data resources to enhance the construction of high-quality data sets, which is essential for seizing opportunities in artificial intelligence development [1][2] - The definition of high-quality data sets varies across industries, but they must meet the training needs of AI large models [2] Industry Overview - Jiangsu has established 321 high-quality data sets across key sectors such as healthcare, transportation, industry, energy, and cultural tourism, with a total data scale exceeding 93PB [1] - The province has implemented a "1+N" policy framework to optimize the environment for artificial intelligence development, focusing on collaboration between supply and demand enterprises [2][7] Challenges in Data Annotation - Data annotation is crucial for AI development, requiring specialized knowledge and skills, particularly in complex fields like medical data [3][4] - The industry faces challenges such as insufficient data supply and a lack of skilled data annotators, which hinder the progress of large models in niche areas [4] Cost Considerations - The high costs associated with data storage and processing are significant challenges for companies, with many high-quality data sets being discarded due to storage expenses [5][6] - Companies are exploring solutions like establishing cold storage centers in less developed regions to reduce costs associated with data storage [5] Financial Support and Standards - The data industry is knowledge and capital-intensive, with a significant portion of costs tied to acquiring raw data [6] - Financial institutions are encouraged to provide support for data collection and annotation, potentially through innovative financing models [6] - The establishment of standards for high-quality data sets is underway, with guidelines and quality assessment protocols being developed to address current challenges [6]
人工智能高质量数据集生态发展大会在重庆永川举行
Xin Hua Wang· 2025-09-29 08:41
Core Insights - The conference focused on building high-quality datasets to empower AI development, emphasizing data labeling industry practices and innovations [1][6] - A partnership was established between the Chongqing Big Data Application Development Management Bureau and the Yongchuan District government to create a "Chongqing Data Set Construction Application Base" [3][4] - The West Data Labeling Research Institute and West Data Set Production Base were inaugurated to enhance digital technology sharing and data industry incubation [4][6] Group 1: Conference Highlights - The conference featured policy introductions, case sharing, and industry dialogues to promote AI data infrastructure and regional data innovation [1][6] - The Yongchuan District aims to enhance data labeling efficiency and usability to support the city's AI capabilities and business scenarios [3][6] Group 2: Strategic Initiatives - Yongchuan District signed cooperation projects with 12 companies, including major telecom operators and technology firms, to advance high-quality dataset construction and application [6][7] - The district plans to establish a data labeling industry park and implement a "data labeling + application" model to integrate digital and physical economies [6][7] Group 3: Future Goals - Yongchuan aims to become a hub for data element circulation and a data labeling service base by 2027, focusing on four key actions: building a data labeling industry park, creating a "data labeling +" ecosystem, implementing talent development initiatives, and promoting data value release [7]
超10万亿Tokens的高质量数据集是怎么炼成的?专访中国电信天翼AI阮宜龙
量子位· 2025-09-26 02:08
Core Viewpoint - The article emphasizes the importance of high-quality datasets in developing and training AI models, highlighting that such datasets are crucial for enhancing model performance and accuracy [4][6][14]. Group 1: High-Quality Data Sets - The company has amassed over 10 trillion tokens of general model corpus data and specialized datasets covering 14 key industries, with a total storage capacity of 350TB [1][6]. - These datasets are not just raw data but are meticulously labeled and optimized, making them ready for immediate application in various industries [3][4]. - High-quality datasets are essential as they directly influence the accuracy, generalization, and usability of AI models, serving as the foundation for effective model training [4][5]. Group 2: Technological Infrastructure - The company has developed the Xingchen MaaS platform, which operates as a data refinery, creating a complete closed loop of "data-model-service" [6][17]. - The platform includes a data toolchain that efficiently processes various data types and a model toolchain that enhances data into usable models, ensuring a robust data lifecycle management [18][19]. - The platform's capabilities allow for the generation of synthetic data for rare or extreme scenarios, enhancing model robustness and safety [18][19]. Group 3: Strategic Considerations - The company's investment in high-quality datasets is driven by national strategy, market demand, and its own operational advantages, positioning itself as a key player in the AI landscape [15][16]. - The government has recognized AI as a national strategy, prompting the company to build data infrastructure that supports AI technology breakthroughs [15][16]. - The company aims to leverage its extensive data resources and customer base to enhance its capabilities in high-quality dataset development [16]. Group 4: Industry Applications - The company has successfully implemented AI solutions in various sectors, such as textile quality inspection, achieving over 95% accuracy in defect detection, significantly improving production efficiency [9][26]. - High-quality datasets have been developed for multiple industries, including healthcare, agriculture, and smart cities, demonstrating the versatility and impact of AI applications [36][37]. - The company has collaborated with various sectors to create tailored datasets that address specific industry challenges, enhancing operational efficiency and service quality [36][37]. Group 5: Future Vision - The company envisions becoming a leading provider of general AI services, focusing on technological advancement, inclusive applications, and an open ecosystem for collaboration [42][43]. - It aims to cultivate a skilled workforce in AI, ensuring that technological innovations translate into practical applications that benefit society [43][44]. - The ultimate goal is to enhance the digital economy while ensuring safety and fairness in AI applications, contributing to a more equitable society [44][45].
浙江大学教授王春晖:高质量数据集是AI大模型训练、推理和验证的关键基础
Zhong Guo Jing Ying Bao· 2025-09-21 14:52
Core Insights - The current data industry in China is entering a "fast lane" of development, with the value of data as a key production factor becoming increasingly prominent [1][2] - High-quality datasets are essential for the reliable development of AI models, as low-quality data can lead to misleading outputs known as "hallucinations" [2][3] Data Quality and AI Models - The training data for large language models (LLMs) often comes from the internet, resulting in varying quality and leading to outputs based on "probabilistic matching" rather than "factual judgment" [2] - A study indicates that when training datasets contain only 0.01% false text, harmful content output by the model increases by 11.2%, highlighting the critical issue of insufficient high-quality data supply [2] - High-quality datasets are categorized into general datasets, industry general datasets, and industry-specific datasets, which are foundational for the application of both general and industry models [2][3] Industry-Specific Data - Industry general datasets include knowledge that requires a certain level of professional background to understand, such as healthcare data encompassing personal attributes, health status, and medical application data [3] - Industry-specific datasets require deeper professional knowledge and are crucial for specific business scenarios, such as medical AI relying on high-quality expert-annotated data [3] AI and Data Integration - The trend is shifting towards a data-centric approach in AI development, which does not diminish the value of model-centric AI but rather complements it [3] Prompt Engineering - The ability to ask questions and discern answers is emphasized as crucial in the AI era, with the concept of prompt engineering introduced to guide LLMs in generating useful content [4] - Skilled prompt engineers can enhance AI model efficiency by over 30% in fields like healthcare by designing precise prompts [4] Policy and Industry Development - The Chinese government has issued guidelines to strengthen the construction of high-quality AI datasets, emphasizing application-oriented approaches and the development of data processing and service industries [5] - The shift from "data-entity integration" to "entity-data integration" reflects a focus on promoting high-quality development driven by the needs of the real economy [5]
OpenAI:预计今年ChatGPT收入近100亿美元|首席资讯日报
首席商业评论· 2025-09-07 04:09
Group 1 - Xinba, the founder of XinXuan Group, was reported to be taken away for investigation, but the company denied the claims [2] - The film "Nanjing Photo Studio" was released in the UK, providing an opportunity for audiences to understand the history of the Nanjing Massacre [3] - The first AI computing open architecture in China was launched at the World Intelligent Industry Expo, supporting significant AI computing capabilities [4] Group 2 - Yi Huiman, during his tenure as the chairman of the China Securities Regulatory Commission, saw the A-share market breach the 3000-point mark 20 times [5][6] - OpenAI is expected to generate nearly $10 billion in revenue through ChatGPT this year, with total revenue projected to reach $13 billion [10] - A Brazilian billionaire has named football star Neymar as the sole heir to his fortune, estimated to exceed $1 billion [12]
首批85个高质量数据集建设清单发布
Zheng Quan Shi Bao Wang· 2025-09-06 02:48
Core Insights - The 2025 Trusted Data Space High-Quality Data Set Ecological Conference was held in Chongqing, where the first batch of 85 high-quality data set construction lists was released [1] - The conference initiated the pilot projects for high-quality data set construction and the Trusted Data Space National Innovation Development Pilot in Chongqing [1] Industry Developments - In the automotive sector, there will be an accelerated construction of data sets for new energy vehicle power battery safety assessment and intelligent driving algorithm research, aimed at enhancing the trillion-level industrial cluster with a "data new engine" [1] - In the low-altitude economy sector, the construction of data sets such as the Tianmu Constellation Global Atmospheric and Ocean Remote Sensing and Low-altitude Urban Safety Inspection Guardian will be expedited, which will build spatial perception capabilities to empower efficient, refined, and intelligent urban governance [1]
时代风口 数据质变 引领智能文明新跃迁
Zheng Quan Shi Bao· 2025-09-04 21:58
Core Insights - The construction of high-quality data sets in China has entered a large-scale phase, with a total exceeding 400PB and a cumulative transaction value of nearly 4 billion yuan [1] - The transformation from "mass" to "quality" data reflects not only technological evolution but also a deeper philosophical shift in digital civilization, emphasizing the importance of quality over quantity [1] - The emergence of high-quality data sets signifies a fundamental shift in AI development paradigms, moving from a phase of excessive data feeding to one focused on quality and precision [1] Industry Implications - High-quality data sets are considered the "cultural gene pool" of the digital age, integrating traditional Chinese cultural values into the data framework, which presents an opportunity for cultural transmission alongside technological advancement [2] - The risk of creating a digital divide is highlighted, where institutions with access to high-quality data may monopolize AI benefits, potentially leaving data-poor entities behind in the intelligent era [2] - The need for data classification, security measures, and equitable data policies is emphasized to prevent high-quality data from becoming a systemic risk and to ensure fair access to AI advancements [2] Future Directions - Establishing national standards for data quality is essential to provide measurable criteria for "high quality" [3] - Encouraging cross-domain data integration is necessary to break down "data silos" and enable high-quality data to create multiplier effects [3] - It is crucial to embed humanistic values in the data intelligence process to prevent AI from becoming merely a utilitarian tool, ensuring that data is not only high-quality but also meaningful and ethical [3]
时代风口 | 数据质变引领智能文明新跃迁
Zheng Quan Shi Bao· 2025-09-04 18:53
Group 1 - The construction of high-quality data sets in China has entered a large-scale phase, with a total volume exceeding 400PB and a cumulative transaction value of nearly 4 billion yuan [1] - The transformation from "mass" to "high quality" data reflects not only technological evolution but also a deeper philosophical shift in digital civilization, emphasizing the importance of quality over quantity [1] - The emergence of high-quality data sets signifies a fundamental shift in AI development paradigms, moving from a phase of resource wastage to one of refined knowledge cultivation [1] Group 2 - High-quality data sets are described as the "cultural gene pool" of the digital age, integrating traditional Chinese culture and values into the data framework [2] - The construction of high-quality data sets presents both opportunities for technological advancement and challenges related to the digital divide, where institutions with quality data may monopolize AI benefits [2] - There is a need for data policies that balance efficiency and fairness to prevent high-quality data from becoming the private property of a few entities [2] Group 3 - The future requires a "data alchemist" approach to reshape the data value ecosystem, including establishing national standards for data quality and encouraging cross-domain data integration [3] - It is crucial to embed humanistic values in the data intelligence process to prevent AI from becoming merely a utilitarian tool [3] - The era of data quality transformation emphasizes the importance of ethical, quality, and temporal considerations in data, ensuring that AI development achieves a balance between quantity and quality [3]
高质量数据集和AI共振 成为数据流通“硬通货”
Zhong Guo Xin Wen Wang· 2025-09-02 14:32
Group 1 - The core concept of "high-quality data sets" has been introduced to support AI application innovation and the development of new business models such as "data as a service" and "knowledge as a service" [2] - By June 2025, over 35,000 high-quality data sets are expected to be built in China, totaling over 400 petabytes (PB), with 3,364 data trading institutions listing high-quality data sets and a cumulative transaction value of nearly 4 billion yuan [2] - The relationship between high-quality data sets and AI development is symbiotic, with high-quality data sets becoming essential for AI model training and data circulation [3] Group 2 - The quality and security of data set construction are critical for the development of AI models, necessitating a robust data security system and the integration of traditional cultural values [3] - Shenzhen is actively exploring the integration of public and enterprise data resources to support high-quality data applications, achieving positive results in various sectors such as finance and insurance [3]