Workflow
数据质量
icon
Search documents
AI 新悖论:模型越智能,数据越糟糕
3 6 Ke· 2026-01-07 23:11
编者按:人工智能模型的可靠性取决于其底层数据的质量。然而,目前有一个新的悖论,模型越智能,数据可能变得越糟糕。本文来自编译,希望对您有所 启发。 神译局是36氪旗下编译团队,关注科技、商业、职场、生活等领域,重点介绍国外的新技术、新观点、新风向。 人工智能承诺带来更智能、更快速、更高效的未来,但乐观表象之下潜藏着日益恶化的隐忧:数据本身的问题。我们常讨论算法,却鲜少关注支撑算法的基 础设施。事实上,创新速度永远无法超越其输入数据的质量,而当前这些输入数据正显露疲态。当根基开始动摇,即便是最先进的系统也会失灵。 十年前,规模与精度尚能并行不悖。但如今,这两个目标往往背道而驰。隐私法规、设备授权限制及平台新规,使得获取高质量的第一方数据比以往任何时 候都更困难。为填补缺口,市场充斥着看似合法实则虚假的循环利用、伪造或推断信号。 [图片来源:Eugene Mymrin/Getty Images] 1. 当数据量沦为噪音 多年来,行业普遍认为数据越多洞察越精深。数据量象征着实力,输入越多意味着智能越强。但如今数据过剩已沦为干扰噪音。为维持规模,部分供应商采 用了填充数据或虚假信号,使系统看似健康,实则侵蚀了数据的可靠 ...
数据治理框架:贯穿人员、流程和技术的三重要素
3 6 Ke· 2025-12-25 09:44
什么构成不良数据?治理框架的重点领域是什么?以及如何在大型组织中驾驭人员、流程和技术方面的 细微差别? 一 什么构成"坏数据" 脏数据是指不完整、不准确、过时或重复的信息,它会对组织造成严重破坏。这是一个代价高昂的问 题,它会滋生不信任、浪费资源并损害决策。尽管数据质量至关重要,但它却常常被忽视,从而导致严 重的业务中断和机会损失。 1.脏数据造成的误判影响 数据质量差的后果十分严重且影响深远。研究表明,数据质量差每年给企业造成数百万美元的损失。销 售团队浪费时间和金钱去追踪无效线索,财务部门在报告中出现错误,营销活动也因为目标受众错误而 效果不佳。 更令人担忧的是,基于错误数据做出的决策可能会使整个公司偏离正轨,导致错失良机、资源错配和战 略失误。 例如,医疗机构如果患者记录不准确,就会造成严重后果。由于数据过时或不匹配而导致的错误诊断, 不仅会危及患者安全,还可能引发法律责任。在金融等行业,数据驱动的风险评估指导着数十亿美元的 投资,因此容错空间更小。 2.错误数据极其普遍 尽管风险显而易见,但脏数据问题几乎在每个组织中都持续存在,其根本原因通常是数据治理实践不 善、系统孤立以及缺乏责任感。 太多公司 把 ...
深陷信任危机!特朗普“政治清洗”正引发数据质量崩溃?
Jin Shi Shu Ju· 2025-12-10 13:31
根据美国顶级统计专家的一份新报告,美国数据机构急需特朗普政府和国会的帮助,以确保在危机日益 加深之际履行基本职责并恢复公众信心。 由美国统计协会(ASA)牵头的一项研究指出,这些机构正受困于能力脆弱和信任受损, 并需要更多 资金和人员。报告列举了自去年的初版报告发布以来变得更加严峻的挑战,当时特朗普总统尚未重返白 宫。 就在这场闹剧上演的前一天,这项研究背后的统计专家曾发布一份中期报告,称他们确信数据是可以信 任的,没有迹象表明行政部门进行了干预。特朗普针对劳工统计局的举动迫使他们迅速重新思考。该文 件被修改为:总统的行为"通过指责统计机构负责人过去存在政治操纵行为,破坏了对未来的信任"。 劳工统计局(BLS)、经济分析局(BEA)和普查局等政府部门负责发布涵盖经济及许多其他主题的各 类数据,这些数据对政策制定者、投资者、公司以及广大公众的决策至关重要。长期存在的问题如预算 缩减和调查回复率下降,以及近期对其独立性和完整性的威胁,使得它们的工作变得更加艰难。 该组织的新报告引用了一项调查,发现公众对联邦数据的信任度从6月的57%下降到了9月的52%。 这份于周三发布的报告称,"必须立即采取行动,以阻止联邦统计 ...
不融资、不烧钱、不扩团队,华裔 CEO 创办的AI独角兽打入谷歌、Anthropic核心供应链,如今营收近百亿
3 6 Ke· 2025-12-10 09:12
在 Meta 豪掷 143 亿美元入股竞争对手 Scale AI 时,这家由谷歌前工程师创立、员工仅为对手十分之一的公司,已悄然实现了年营收超 10 亿美元的业 绩,且从未接受外部投资。 AI 竞技场上,聚光灯总在追逐着 OpenAI、Google 等发布下一个万亿参数模型的明星。而决定模型"思维"与"品格"的训练数据,则像被遗忘的地基。 硅谷正上演一幕对比鲜明的戏剧:一边是 Meta 豪掷 143 亿美元收购数据标注公司 Scale AI 近半股份,使其创始人亚历山大·王成为硅谷红人。 另一边,是其低调的对手 Surge AI:成立近五年没有任何融资、过去两年几乎不发新闻稿、员工仅为对手十分之一,却悄悄实现了超过 10 亿美元的营 收,在财务上已超越获得巨资的 Scale AI。 这次故事的主角轮到了 Edwin Chen。 Surge AI 的创始人兼 CEO Edwin Chen 是一位美籍华裔,曾在 Massachusetts Institute of Technology(MIT)学习数学与语言学。毕业后,Edwin 踏入职场 —— 他曾在包括 Google、 Meta Platforms(前身 F ...
智能座舱竞争转向“数据质量、场景颗粒度与深度适配”之争
Xin Jing Bao· 2025-11-28 03:47
Core Insights - The automotive industry is experiencing a trend of functional homogenization due to the widespread adoption of large model capabilities, shifting competition from "model scale" to "data quality, scene granularity, and depth adaptation" [1] - Companies need to build an integrated closed-loop capability of "scene-data-model" to achieve "model as application," creating differentiated experiences in real-world usage scenarios [1] Industry Trends - There is an increasing willingness among users to pay for comfort hardware such as smart seats and smart audio systems, with experience depth becoming a key value anchor for future smart cockpits [1] - The interaction paradigm of smart vehicles is transitioning from "passive response tools" to "proactive cognitive partners," moving beyond user-triggered commands to proactively predicting and providing services based on integrated sensor data, user behavior, and contextual needs [1] Future Directions - By 2025, smart cockpits are expected to shift from "digital redundancy" to "pragmatism," with a rational transformation in smart cockpit interactions, leading to a "rebalancing" of touch controls and physical buttons [1] - The competitive focus of future smart cockpits will return from "breadth of functionality" to "depth of experience" [1]
AI要向电商“抽佣”了
Di Yi Cai Jing Zi Xun· 2025-11-26 16:14
Core Insights - The rise of AI is fundamentally transforming the e-commerce landscape, shifting traffic from traditional search engines to AI-driven platforms like ChatGPT and Doubao [2][3][4] - Major e-commerce players are adapting to this trend, with companies like OpenAI and Shopify collaborating to integrate shopping experiences directly into conversational interfaces [3][4] - In China, Doubao, supported by ByteDance, is creating a closed-loop e-commerce ecosystem with Douyin, indicating a significant shift in consumer shopping behavior [5][6] E-commerce Traffic Shift - AI is changing the way consumers shop, moving from traditional search methods to conversational interactions, as highlighted by Shopify's partnership with OpenAI [3][4] - ChatGPT has surpassed 700 million monthly active users, becoming a new entry point for e-commerce, competing with established platforms like Google and Amazon [3][4] - Traditional search engines are experiencing a decline in usage, while AI models are gaining traction, which could disrupt the advertising revenue model of companies like Google and Baidu [4][9] AI Integration in E-commerce - Doubao allows users to ask for product recommendations directly within the app, linking to Douyin's shopping platform, thus streamlining the shopping process [5][6] - Alibaba's recent launch of the Qianwen app aims to compete in the AI-driven e-commerce space, although it has yet to achieve significant user engagement compared to Doubao [6][10] - The integration of AI in e-commerce is expected to enhance user experience by providing personalized recommendations based on historical behavior [9][12] Future of Search Engines - Experts predict that traditional search engines may gradually decline as AI-driven conversational interfaces become the norm for information retrieval and shopping [7][9] - Companies like Baidu are developing AI products to adapt to this shift, with their Wenxin assistant showing significant user growth [10][11] Data Quality and Trust Issues - The effectiveness of AI in e-commerce is heavily dependent on the quality of data used, with concerns about the reliability of information generated by AI models [13][16] - The challenge of misinformation and the need for accurate data integration are critical for the successful implementation of AI in commercial settings [14][17]
打破数据质量鸿沟!清华腾讯Bee项目发布1500万高质量数据集,刷新MLLM全栈开源SOTA
量子位· 2025-11-11 04:24
Core Insights - The article discusses the launch of the Bee project by Tsinghua University and Tencent's Mixuan team, aimed at bridging the performance gap between fully open-source multimodal large language models (MLLMs) and their closed or semi-open counterparts [2][5][26]. Group 1: Background and Motivation - The current MLLM landscape exhibits a three-tier structure: (1) top-tier closed-source models (e.g., Gemini 2.5, GPT-5), (2) semi-open models with private data (e.g., Qwen2.5-VL), and (3) significantly underperforming fully open-source models [5]. - The core bottleneck is identified as the "data quality gap" rather than model architecture [2][10]. Group 2: Key Contributions of the Bee Project - **Honey-Data-15M**: A high-quality SFT dataset comprising 15 million samples, enhanced through a dual-layer Chain of Thought (CoT) approach [6][16]. - **HoneyPipe & DataStudio**: An open-source, end-to-end data enhancement pipeline that provides a transparent and reproducible methodology for data cleaning and CoT augmentation [6][12]. - **Bee-8B**: A new 8 billion parameter model trained on Honey-Data-15M, achieving state-of-the-art (SOTA) results in various benchmarks, rivaling or surpassing mainstream semi-open models [6][21][26]. Group 3: Data Quality Issues - Existing open-source datasets suffer from two main issues: pervasive noise (e.g., factual inaccuracies, mismatched images) and a lack of complex reasoning data [11][14]. - The Bee project emphasizes that the most viable path for the open-source community is to focus on "data quality" rather than merely increasing "data quantity" [11][26]. Group 4: HoneyPipe Process - The HoneyPipe process involves a meticulous "filter-enhance-validate" workflow that produces high-quality datasets [15][18]. - The process includes three stages: noise and irrelevance filtering, short CoT enhancement and validation, and long CoT enhancement for complex queries [18]. Group 5: Performance of Bee-8B - Bee-8B demonstrates superior performance across various benchmarks, including MathVerse and LogicVista, where it achieved scores of 67.0 and 61.3, respectively, outperforming semi-open models [28]. - In general VQA tasks, Bee-8B achieved excellent SOTA scores in multiple benchmarks, including MMStar and CountBench [28]. Group 6: Conclusion - The Bee project effectively addresses the core data quality issues hindering the development of fully open-source MLLMs, advocating for a methodology that prioritizes data quality over sheer volume [26].
喂了几个月的垃圾推文,大模型得了「脑腐」,这病还治不好
机器之心· 2025-10-21 03:43
Core Viewpoint - The article discusses a study indicating that large language models (LLMs) can experience cognitive decline, referred to as "brain rot," due to prolonged exposure to low-quality internet content, similar to the effects observed in humans [4][10][12]. Group 1: Research Findings - The study conducted by Texas A&M University, the University of Texas at Austin, and Purdue University demonstrates that LLMs can suffer from cognitive degradation when trained on viral Twitter data characterized by short, engaging posts [4][6]. - Cognitive functions such as reasoning and long-term memory showed significant declines, with reasoning ability decreasing by 23% and long-term memory by 30% after exposure to low-quality data [14][15]. - The research established a "brain rot hypothesis," suggesting that continuous exposure to poor-quality text leads to a sustained decline in cognitive abilities of LLMs [12][29]. Group 2: Experimental Methodology - Researchers utilized a controlled experiment on real Twitter data, creating datasets based on engagement (M1) and semantic quality (M2) to assess the impact of low-quality content on LLMs [13][20]. - M1 focused on the popularity and brevity of posts, while M2 evaluated the sensationalism or superficiality of the content, with both methods indicating a negative correlation between data quality and cognitive performance [13][22]. Group 3: Implications and Recommendations - The findings highlight the necessity for regular "cognitive health checks" for deployed LLMs, emphasizing the importance of data quality in maintaining their cognitive capabilities [17][29]. - The study suggests that the effects of exposure to low-quality data are not easily mitigated through standard fine-tuning techniques, indicating a need for improved data curation practices [29].
穿越市场不确定性:晨星,让投资一路畅行
Morningstar晨星· 2025-10-16 01:05
Core Insights - The article emphasizes the importance of bridging the information gap between individual investors and professional institutions, a mission that Morningstar has pursued since its founding in 1984 [2]. Group 1: Data Quality and Investment Solutions - Morningstar has built one of the largest and highest-quality investment databases globally, covering approximately 800,000 investment products, with a strong focus on rigorous data quality checks [3]. - The company connects disparate data sources by acquiring firms like PitchBook, enabling a comprehensive view of both public and private markets [3]. - Morningstar offers unique analytical tools, such as medal ratings and sustainability ratings, to facilitate quicker decision-making from vast amounts of investment data [3]. Group 2: Services for Asset Managers and Institutional Investors - Morningstar provides professional research support for public funds and bank wealth management, offering independent and objective evaluation systems for investment decisions [10]. - The company assists institutions in constructing robust investment portfolios that align with long-term goals through macro and strategic research support [12]. - Insights into global product innovation trends are shared to help institutions develop forward-looking financial products that meet investor needs [13]. Group 3: Commitment to Client Interests - Morningstar aims to empower investment advisors to better serve their clients, believing that those who genuinely represent client interests will ultimately achieve market returns [7]. - The company maintains an independent viewpoint and cautious approach in its research, providing a foundation for investment advisors to guide clients towards long-term strategies amidst market noise [6]. Group 4: Company Overview - Morningstar, Inc. is a leading investment research firm with operations across North America, Europe, Australia, and Asia, providing financial information, fund, and stock analysis to various professional investors [17]. - As of June 30, 2025, Morningstar manages and advises on assets totaling approximately $352 billion across 33 global markets [20].
How the government shutdown complicates the Fed's rate cut options
Youtube· 2025-10-09 21:44
Core Insights - The ongoing government shutdown is complicating the Federal Reserve's ability to make informed decisions due to a lack of critical labor market and inflation data [1][3][10] - The Fed may need to pause its decision-making process until more reliable data becomes available, which could lead to increased market volatility [3][9][10] Labor Market Analysis - There are signs of a slowdown in the labor market, but the Fed is uncertain whether this trend is stabilizing due to missing data [2][4] - Alternative data sets are available, but they do not fully compensate for the absence of official data, leading to uncertainty in estimating payroll changes [10][11] Economic Impact - The government shutdown is estimated to create a quarterly GDP drag of approximately 10 basis points per week, equating to about $400 million per day in lost federal worker compensation [6] - The integrity of economic data, such as the Consumer Price Index (CPI), is being compromised, with a significant increase in the percentage of estimated goods in the CPI basket [4][5] Market Reactions - If the Fed decides to hold off on rate cuts due to the lack of data, it could negatively impact market sentiment, as investors have been pricing in expected cuts [8][9] - Despite the uncertainty, underlying earnings remain strong, with growth estimates for Q3 rising from 8% to 8.8%, indicating resilience in the market [14][16]