Workflow
高质量数据集
icon
Search documents
人工智能高质量数据集生态发展大会在重庆永川举行
Xin Hua Wang· 2025-09-29 08:41
原标题:人工智能高质量数据集生态发展大会在重庆永川举行 西部数据标注研究院是由中国信息协会与永川区人民政府共同发起成立的数字技术共享平台、数字产业 孵化平台及数字生态构建平台。研究院将聚焦人工智能与数字重庆建设双向赋能,围绕人工智能、高质 量数据集、数据标注等领域,开展新兴技术科研创新、顶层设计、课题研究、标准制定、质量评测等业 务,并组建专家智库,培养复合型数据标注人才。未来,研究院还将探索建设数据领域科学实验室、技 术创新中心、企业技术中心等科技创新平台,加大对数据领域基础研究和前沿技术、原创性技术创新的 支持力度。其主要研究方向包括:开展高质量数据集建设应用研究,数据标注产业链、人才、技术方面 研究,数据标注场景综合试验等方面。 西部数据集生产基地由中国信息协会与永川区政府共建。协会依托会员企业资源,推动更多数据集生产 类企业落地永川,双方合力促成数据要素在永川汇集,打造基地以带动西部、辐射全国数据产业。 9月28日,人工智能高质量数据集生态发展大会在重庆市永川区举行。大会以"构建高质量数据集,赋能 AI新发展"为主题,聚焦数据标注产业实践与探索,通过政策宣介、案例分享、揭牌签约、产业对话等 多种形式, ...
超10万亿Tokens的高质量数据集是怎么炼成的?专访中国电信天翼AI阮宜龙
量子位· 2025-09-26 02:08
金磊 发自 凹非寺 量子位 | 公众号 QbitAI 正所谓 "得数据者得天下" ,这家央企算是把 高质量数据集 给玩明白了—— 超过 10万亿 tokens的通用大模型语料数据,以及覆盖 14个 关键行业的专业数据集,总存储量高达 350TB! 如此庞大的体量,还不是杂乱无章的原始数据,而是经过精心标注和优化且包含多模态在内的行业数据,是随时可以在行业里"上岗"的那 种。 或许有小伙伴就要问了,这很重要吗?答案是非常确定的。 高质量数据集是经过采集、加工等数据处理,可直接用于开发和训练人工智能模型,能有效提升模型性能的数据的集合。建设高质量数据集 至关重要,因为它直接决定了AI模型的准确性、泛化性和可用性——优质数据是训练出高效准确模型的基础。 重要程度,可见一斑了。 那么这家央企到底是谁? 不卖关子,它正是AI国家队—— 中国电信天翼AI ,其打造的 星辰MaaS平台 是建设高质量数据集的关键。 星辰MaaS平台像是一个数据精炼厂,通过四大核心协同运作,构建"数据—模型—服务"的完整闭环。 其中, 基模 作为"动力引擎",提供基础认知与推理能力; 数据工具链 作为"原料库",持续输送高质量的数据资源; 模 ...
浙江大学教授王春晖:高质量数据集是AI大模型训练、推理和验证的关键基础
Core Insights - The current data industry in China is entering a "fast lane" of development, with the value of data as a key production factor becoming increasingly prominent [1][2] - High-quality datasets are essential for the reliable development of AI models, as low-quality data can lead to misleading outputs known as "hallucinations" [2][3] Data Quality and AI Models - The training data for large language models (LLMs) often comes from the internet, resulting in varying quality and leading to outputs based on "probabilistic matching" rather than "factual judgment" [2] - A study indicates that when training datasets contain only 0.01% false text, harmful content output by the model increases by 11.2%, highlighting the critical issue of insufficient high-quality data supply [2] - High-quality datasets are categorized into general datasets, industry general datasets, and industry-specific datasets, which are foundational for the application of both general and industry models [2][3] Industry-Specific Data - Industry general datasets include knowledge that requires a certain level of professional background to understand, such as healthcare data encompassing personal attributes, health status, and medical application data [3] - Industry-specific datasets require deeper professional knowledge and are crucial for specific business scenarios, such as medical AI relying on high-quality expert-annotated data [3] AI and Data Integration - The trend is shifting towards a data-centric approach in AI development, which does not diminish the value of model-centric AI but rather complements it [3] Prompt Engineering - The ability to ask questions and discern answers is emphasized as crucial in the AI era, with the concept of prompt engineering introduced to guide LLMs in generating useful content [4] - Skilled prompt engineers can enhance AI model efficiency by over 30% in fields like healthcare by designing precise prompts [4] Policy and Industry Development - The Chinese government has issued guidelines to strengthen the construction of high-quality AI datasets, emphasizing application-oriented approaches and the development of data processing and service industries [5] - The shift from "data-entity integration" to "entity-data integration" reflects a focus on promoting high-quality development driven by the needs of the real economy [5]
OpenAI:预计今年ChatGPT收入近100亿美元|首席资讯日报
首席商业评论· 2025-09-07 04:09
Group 1 - Xinba, the founder of XinXuan Group, was reported to be taken away for investigation, but the company denied the claims [2] - The film "Nanjing Photo Studio" was released in the UK, providing an opportunity for audiences to understand the history of the Nanjing Massacre [3] - The first AI computing open architecture in China was launched at the World Intelligent Industry Expo, supporting significant AI computing capabilities [4] Group 2 - Yi Huiman, during his tenure as the chairman of the China Securities Regulatory Commission, saw the A-share market breach the 3000-point mark 20 times [5][6] - OpenAI is expected to generate nearly $10 billion in revenue through ChatGPT this year, with total revenue projected to reach $13 billion [10] - A Brazilian billionaire has named football star Neymar as the sole heir to his fortune, estimated to exceed $1 billion [12]
首批85个高质量数据集建设清单发布
Core Insights - The 2025 Trusted Data Space High-Quality Data Set Ecological Conference was held in Chongqing, where the first batch of 85 high-quality data set construction lists was released [1] - The conference initiated the pilot projects for high-quality data set construction and the Trusted Data Space National Innovation Development Pilot in Chongqing [1] Industry Developments - In the automotive sector, there will be an accelerated construction of data sets for new energy vehicle power battery safety assessment and intelligent driving algorithm research, aimed at enhancing the trillion-level industrial cluster with a "data new engine" [1] - In the low-altitude economy sector, the construction of data sets such as the Tianmu Constellation Global Atmospheric and Ocean Remote Sensing and Low-altitude Urban Safety Inspection Guardian will be expedited, which will build spatial perception capabilities to empower efficient, refined, and intelligent urban governance [1]
时代风口 数据质变 引领智能文明新跃迁
Zheng Quan Shi Bao· 2025-09-04 21:58
据中国新闻网报道,我国高质量数据集建设已进入规模化阶段,总量超400PB,累计交易额近40亿元。 当AI浪潮将数据推至战略资源的高度,这场从"海量"到"高质"的变革,揭示的不仅是技术演进规律,更 是数字文明发展的深层哲学——从量的扩张转向质的淬炼,正是人类每一次产业革命走向成熟的必然路 径。 数据的"质变"背后,是AI发展范式的根本性转变。早期粗放式的数据投喂,实则是算力与能源的挥霍; 而高质量数据集的出现,标志着AI从"大力出奇迹"的蛮荒阶段,迈入"精耕出真知"的文明阶段。这如同 人类从采集狩猎转向农耕文明——不是追求更多果实,而是培育更优种子。清华大学张小劲教授"双轮 驱动"的比喻,恰恰揭示了数据与AI这种共生共演的螺旋上升关系:优质数据滋养AI进化,智能化的AI 又反哺数据价值挖掘,形成正向循环。 跳出技术层面,高质量数据集堪称数字时代的"文化基因库"。中国工程院院士吴世忠强调"融入中华优 秀传统文化",实则是呼吁在数据底层注入文明价值观。西方主导的互联网早期,数据曾携带大量文化 偏见;如今我们构建高质量数据集,既是技术攻坚,更是文明传承的机遇——用数据编码华夏智慧,让 AI不仅"聪明",更具备东方的伦 ...
时代风口 | 数据质变引领智能文明新跃迁
Zheng Quan Shi Bao· 2025-09-04 18:53
Group 1 - The construction of high-quality data sets in China has entered a large-scale phase, with a total volume exceeding 400PB and a cumulative transaction value of nearly 4 billion yuan [1] - The transformation from "mass" to "high quality" data reflects not only technological evolution but also a deeper philosophical shift in digital civilization, emphasizing the importance of quality over quantity [1] - The emergence of high-quality data sets signifies a fundamental shift in AI development paradigms, moving from a phase of resource wastage to one of refined knowledge cultivation [1] Group 2 - High-quality data sets are described as the "cultural gene pool" of the digital age, integrating traditional Chinese culture and values into the data framework [2] - The construction of high-quality data sets presents both opportunities for technological advancement and challenges related to the digital divide, where institutions with quality data may monopolize AI benefits [2] - There is a need for data policies that balance efficiency and fairness to prevent high-quality data from becoming the private property of a few entities [2] Group 3 - The future requires a "data alchemist" approach to reshape the data value ecosystem, including establishing national standards for data quality and encouraging cross-domain data integration [3] - It is crucial to embed humanistic values in the data intelligence process to prevent AI from becoming merely a utilitarian tool [3] - The era of data quality transformation emphasizes the importance of ethical, quality, and temporal considerations in data, ensuring that AI development achieves a balance between quantity and quality [3]
高质量数据集和AI共振 成为数据流通“硬通货”
Zhong Guo Xin Wen Wang· 2025-09-02 14:32
Group 1 - The core concept of "high-quality data sets" has been introduced to support AI application innovation and the development of new business models such as "data as a service" and "knowledge as a service" [2] - By June 2025, over 35,000 high-quality data sets are expected to be built in China, totaling over 400 petabytes (PB), with 3,364 data trading institutions listing high-quality data sets and a cumulative transaction value of nearly 4 billion yuan [2] - The relationship between high-quality data sets and AI development is symbiotic, with high-quality data sets becoming essential for AI model training and data circulation [3] Group 2 - The quality and security of data set construction are critical for the development of AI models, necessitating a robust data security system and the integration of traditional cultural values [3] - Shenzhen is actively exploring the integration of public and enterprise data resources to support high-quality data applications, achieving positive results in various sectors such as finance and insurance [3]
江苏发布首批高质量数据集重点领域建设清单
Xin Hua Ri Bao· 2025-09-01 23:24
Core Insights - Jiangsu has released a list of key areas for the construction of high-quality datasets, aimed at fostering innovation in artificial intelligence large model technology and enhancing industrial ecosystems [1] Group 1: Key Areas of Focus - The first batch of construction lists targets 16 key areas including industrial manufacturing, transportation, healthcare, scientific research, financial services, cultural tourism, urban governance, human resources, green low-carbon initiatives, rural agriculture, smart energy, education, business, emergency management, meteorological services, and public safety [1] - In addition to the key areas, the list also includes high-quality datasets for general large models, cross-border data, and government services [1] Group 2: Impact on Society - The "Health Information Dataset" in the healthcare sector integrates various medical and public health functions, providing support for health analysis, disease monitoring, clinical decision-making, public health emergency response, and medical quality monitoring [1] - The "Human Resources and Social Security Industry Dataset" consolidates information on individual and corporate social security contributions, vocational qualifications, labor arbitration, and labor inspection, enabling precise public service and credit evaluation [1]
江苏发布高质量数据集重点领域建设清单
Xin Hua Ri Bao· 2025-09-01 22:36
Core Insights - Jiangsu has released a list of key areas for the construction of high-quality data sets, aimed at fostering innovation in artificial intelligence and enhancing industrial ecosystems [1] - The first batch of the construction list focuses on 16 key sectors, including industrial manufacturing, transportation, healthcare, scientific research, financial services, cultural tourism, urban governance, human resources, green low-carbon initiatives, rural agriculture, smart energy, education, business, emergency management, meteorological services, and public safety [1] - Additional areas for high-quality data sets include general large models, cross-border data, and government services [1] Sector-Specific Summaries - In the healthcare sector, the "Health Information Data Set" integrates various medical functions, providing support for health analysis, disease monitoring, clinical decision-making, public health emergency response, and medical quality monitoring [1] - The "Human Resources and Social Security Data Set" compiles information on individual and enterprise social security contributions, vocational qualifications, labor arbitration, and labor inspection, enabling precise public service and credit evaluation [1]