数据质量
Search documents
打破数据质量鸿沟!清华腾讯Bee项目发布1500万高质量数据集,刷新MLLM全栈开源SOTA
量子位· 2025-11-11 04:24
Core Insights - The article discusses the launch of the Bee project by Tsinghua University and Tencent's Mixuan team, aimed at bridging the performance gap between fully open-source multimodal large language models (MLLMs) and their closed or semi-open counterparts [2][5][26]. Group 1: Background and Motivation - The current MLLM landscape exhibits a three-tier structure: (1) top-tier closed-source models (e.g., Gemini 2.5, GPT-5), (2) semi-open models with private data (e.g., Qwen2.5-VL), and (3) significantly underperforming fully open-source models [5]. - The core bottleneck is identified as the "data quality gap" rather than model architecture [2][10]. Group 2: Key Contributions of the Bee Project - **Honey-Data-15M**: A high-quality SFT dataset comprising 15 million samples, enhanced through a dual-layer Chain of Thought (CoT) approach [6][16]. - **HoneyPipe & DataStudio**: An open-source, end-to-end data enhancement pipeline that provides a transparent and reproducible methodology for data cleaning and CoT augmentation [6][12]. - **Bee-8B**: A new 8 billion parameter model trained on Honey-Data-15M, achieving state-of-the-art (SOTA) results in various benchmarks, rivaling or surpassing mainstream semi-open models [6][21][26]. Group 3: Data Quality Issues - Existing open-source datasets suffer from two main issues: pervasive noise (e.g., factual inaccuracies, mismatched images) and a lack of complex reasoning data [11][14]. - The Bee project emphasizes that the most viable path for the open-source community is to focus on "data quality" rather than merely increasing "data quantity" [11][26]. Group 4: HoneyPipe Process - The HoneyPipe process involves a meticulous "filter-enhance-validate" workflow that produces high-quality datasets [15][18]. - The process includes three stages: noise and irrelevance filtering, short CoT enhancement and validation, and long CoT enhancement for complex queries [18]. Group 5: Performance of Bee-8B - Bee-8B demonstrates superior performance across various benchmarks, including MathVerse and LogicVista, where it achieved scores of 67.0 and 61.3, respectively, outperforming semi-open models [28]. - In general VQA tasks, Bee-8B achieved excellent SOTA scores in multiple benchmarks, including MMStar and CountBench [28]. Group 6: Conclusion - The Bee project effectively addresses the core data quality issues hindering the development of fully open-source MLLMs, advocating for a methodology that prioritizes data quality over sheer volume [26].
喂了几个月的垃圾推文,大模型得了「脑腐」,这病还治不好
机器之心· 2025-10-21 03:43
Core Viewpoint - The article discusses a study indicating that large language models (LLMs) can experience cognitive decline, referred to as "brain rot," due to prolonged exposure to low-quality internet content, similar to the effects observed in humans [4][10][12]. Group 1: Research Findings - The study conducted by Texas A&M University, the University of Texas at Austin, and Purdue University demonstrates that LLMs can suffer from cognitive degradation when trained on viral Twitter data characterized by short, engaging posts [4][6]. - Cognitive functions such as reasoning and long-term memory showed significant declines, with reasoning ability decreasing by 23% and long-term memory by 30% after exposure to low-quality data [14][15]. - The research established a "brain rot hypothesis," suggesting that continuous exposure to poor-quality text leads to a sustained decline in cognitive abilities of LLMs [12][29]. Group 2: Experimental Methodology - Researchers utilized a controlled experiment on real Twitter data, creating datasets based on engagement (M1) and semantic quality (M2) to assess the impact of low-quality content on LLMs [13][20]. - M1 focused on the popularity and brevity of posts, while M2 evaluated the sensationalism or superficiality of the content, with both methods indicating a negative correlation between data quality and cognitive performance [13][22]. Group 3: Implications and Recommendations - The findings highlight the necessity for regular "cognitive health checks" for deployed LLMs, emphasizing the importance of data quality in maintaining their cognitive capabilities [17][29]. - The study suggests that the effects of exposure to low-quality data are not easily mitigated through standard fine-tuning techniques, indicating a need for improved data curation practices [29].
穿越市场不确定性:晨星,让投资一路畅行
Morningstar晨星· 2025-10-16 01:05
Core Insights - The article emphasizes the importance of bridging the information gap between individual investors and professional institutions, a mission that Morningstar has pursued since its founding in 1984 [2]. Group 1: Data Quality and Investment Solutions - Morningstar has built one of the largest and highest-quality investment databases globally, covering approximately 800,000 investment products, with a strong focus on rigorous data quality checks [3]. - The company connects disparate data sources by acquiring firms like PitchBook, enabling a comprehensive view of both public and private markets [3]. - Morningstar offers unique analytical tools, such as medal ratings and sustainability ratings, to facilitate quicker decision-making from vast amounts of investment data [3]. Group 2: Services for Asset Managers and Institutional Investors - Morningstar provides professional research support for public funds and bank wealth management, offering independent and objective evaluation systems for investment decisions [10]. - The company assists institutions in constructing robust investment portfolios that align with long-term goals through macro and strategic research support [12]. - Insights into global product innovation trends are shared to help institutions develop forward-looking financial products that meet investor needs [13]. Group 3: Commitment to Client Interests - Morningstar aims to empower investment advisors to better serve their clients, believing that those who genuinely represent client interests will ultimately achieve market returns [7]. - The company maintains an independent viewpoint and cautious approach in its research, providing a foundation for investment advisors to guide clients towards long-term strategies amidst market noise [6]. Group 4: Company Overview - Morningstar, Inc. is a leading investment research firm with operations across North America, Europe, Australia, and Asia, providing financial information, fund, and stock analysis to various professional investors [17]. - As of June 30, 2025, Morningstar manages and advises on assets totaling approximately $352 billion across 33 global markets [20].
How the government shutdown complicates the Fed's rate cut options
Youtube· 2025-10-09 21:44
Core Insights - The ongoing government shutdown is complicating the Federal Reserve's ability to make informed decisions due to a lack of critical labor market and inflation data [1][3][10] - The Fed may need to pause its decision-making process until more reliable data becomes available, which could lead to increased market volatility [3][9][10] Labor Market Analysis - There are signs of a slowdown in the labor market, but the Fed is uncertain whether this trend is stabilizing due to missing data [2][4] - Alternative data sets are available, but they do not fully compensate for the absence of official data, leading to uncertainty in estimating payroll changes [10][11] Economic Impact - The government shutdown is estimated to create a quarterly GDP drag of approximately 10 basis points per week, equating to about $400 million per day in lost federal worker compensation [6] - The integrity of economic data, such as the Consumer Price Index (CPI), is being compromised, with a significant increase in the percentage of estimated goods in the CPI basket [4][5] Market Reactions - If the Fed decides to hold off on rate cuts due to the lack of data, it could negatively impact market sentiment, as investors have been pricing in expected cuts [8][9] - Despite the uncertainty, underlying earnings remain strong, with growth estimates for Q3 rising from 8% to 8.8%, indicating resilience in the market [14][16]
让大湾区成为数据安全使用典范
Nan Fang Du Shi Bao· 2025-09-15 23:10
Core Insights - The most critical aspect of artificial intelligence development is data quality, as emphasized by the Director of the Information Hub at Hong Kong University of Science and Technology (Guangzhou) [2][10] - The establishment of a collaborative laboratory in the Guangdong-Hong Kong-Macao Greater Bay Area aims to integrate research capabilities from various universities to enhance the safe development of generative AI [2][4][10] Data Quality - Data quality is essential for effective AI applications, and the laboratory aims to create a big data platform for testing large models and improving their performance [4][10] - The laboratory will explore new methods to ensure data quality, including collaboration with industry to accumulate high-quality data for practical applications [4][9] Data Security - Data security poses significant challenges, requiring a balance between data integration and safety, with suggestions for using techniques like homomorphic encryption and privacy computing [5][9] - The laboratory is expected to establish a data security governance framework that includes both technical solutions and policy guidance to ensure proper data usage [5][8] AI Model Development - The successful application of large models in industries depends on their practical use, with examples provided from the insurance sector where large models can streamline claims processing [6][10] - The laboratory is tasked with addressing the security of AI models, particularly concerning the protection of sensitive information and the prevention of data poisoning attacks [9][10] Collaborative Efforts - The laboratory aims to form alliances through agreements that promote data security and encourage participants to prioritize data safety and privacy [8] - By fostering collaboration among universities and industries, the laboratory seeks to create a virtuous cycle of data sharing and usage, positioning the Greater Bay Area as a model for data security practices [8][9]
AI下半场哨声吹响:数据质量成胜负手——业界首个企业应用AI成熟度模型重磅发布
2 1 Shi Ji Jing Ji Bao Dao· 2025-09-12 13:00
Core Insights - The article emphasizes the transition in AI competition from model parameters to data quality, highlighting the importance of unique data assets and industry knowledge for businesses [2][3][10] Group 1: AI Adoption Maturity Model (AIM) - The AIM model was jointly developed by Shanghai Jiao Tong University and several industry leaders to provide a navigation system for enterprises in AI application [1][6] - AIM consists of six interconnected dimensions: strategy, organization, data, technology, application, and commercial value, covering the entire process from design to value realization [6][9] - The model aims to help businesses assess their current AI maturity level and guide them on future steps in the unique Chinese market environment [6][9] Group 2: Industry-Specific Insights - In the financial sector, companies have strong data foundations but need to enhance commercial value; the focus is shifting from auxiliary decision-making to autonomous financial intelligence [6][7] - The automotive industry is transitioning from product intelligence to a dual focus on product and enterprise intelligence, emphasizing ROI-driven AI development [6][7] - The health sector is moving towards personalized health services, leveraging AI to connect various resources and improve service efficiency [7] - The retail industry is evolving from workflow improvement to consumer-centric experiences, with companies like L'Oréal integrating AI throughout the consumer journey [5][7] Group 3: Actionable Guidelines - AIM provides a five-level framework for enterprises to progress from initial AI exploration to becoming AI-native organizations, emphasizing the importance of integrating AI into the core business [9][10] - The model breaks down the complex AI implementation process into manageable stages, helping companies identify weaknesses and plan development paths effectively [9][10] - The future of AI competition will hinge on systemic capabilities, necessitating deep integration of AI into the core value chain for sustainable competitive advantage [10]
特朗普提名经济学家安东尼担任美国劳工统计局局长 后者曾建议暂停发布月度就业报告
智通财经网· 2025-08-12 22:23
Group 1 - The nomination of EJ Antoni as the next director of the Bureau of Labor Statistics (BLS) has raised concerns due to his previous statements suggesting the suspension of monthly employment reports, citing reliability issues and overestimations [1][2] - The White House has indicated that despite Antoni's suggestion, the Trump administration plans to continue releasing monthly employment reports to maintain public trust in the data [1][2] - Recent monthly employment reports have shown weak job growth, with significant downward revisions to previous months' data, leading to the dismissal of the former BLS director by President Trump [1] Group 2 - The BLS has tracked monthly data revisions since 1979, with an average revision of 51,000 jobs since the introduction of new sampling methods in 2003; however, revisions for May and June this year were notably higher at 120,000 and 133,000 jobs respectively [2] - Economists are divided on the issue, with some advocating for improved data quality rather than halting the publication of monthly reports [2] - Challenges faced by the BLS include budget cuts, declining public trust, and a drop in survey response rates, which have fallen from 60% in early 2020 to below 43% [2]
美联储9月降息悬念陡增 沪金陷入宽幅震荡
Jin Tou Wang· 2025-08-12 05:44
Group 1 - The core viewpoint indicates that the market consensus expects a mild rise in inflation, which is unlikely to alter the Federal Reserve's anticipated easing cycle starting in September [2] - The July core CPI is projected to show a year-on-year increase of 3.0%, up from 2.9% in June, highlighting a potential upward trend in inflation [2] - The quality of data collection and potential biases are critical concerns that could affect market reactions to the Federal Reserve's decisions [2] Group 2 - Current gold futures are trading around 777.34 yuan per gram, with a decline of 0.95%, indicating a short-term oscillating trend [1] - Key resistance levels for gold futures are identified between 788 yuan per gram and 847 yuan per gram, while important support levels are between 770 yuan per gram and 820 yuan per gram [3]
bootstrap 到十亿美元 ARR:Surge AI 这匹黑马如何颠覆 Scale 霸权 ?
海外独角兽· 2025-07-25 09:52
Core Insights - Surge AI, founded in 2020, has rapidly become a leading player in the data annotation market, achieving an ARR of over $1 billion by 2024, surpassing Scale AI's $870 million revenue [3][4] - The company focuses on providing high-quality data annotation services for AI models, emphasizing the importance of data quality over quantity [3][4] - Surge AI's client base includes top tech companies such as Google, OpenAI, and Meta, highlighting its reputation in the industry [3] Group 1: Data Annotation Market - The data annotation market is divided into two main categories: BPO "human intermediaries" and AI-native "factories" like Surge AI, which provide comprehensive services to meet complex market demands [11][12] - Clients prioritize data quality, processing speed, cost, scalability, compliance, and expertise when selecting data suppliers [12] - The market exhibits high client relationship fluidity, with customers often employing a "multi-supplier parallel" strategy to avoid over-reliance on a single vendor [12] Group 2: Founding Intent of Surge - Edwin Chen, the founder, faced challenges in obtaining quality data for model training, leading to the creation of Surge AI to address these needs [24] - Surge AI's approach diverges from typical Silicon Valley practices by focusing on product quality and customer satisfaction rather than rapid fundraising [25] - The company's commitment to data quality has established it as a recognized leader in the industry [25] Group 3: Underlying Technology for High-Quality Delivery - Surge AI employs a combination of machine learning and human feedback to enhance its annotation capabilities, creating a feedback loop that improves data quality [27] - The company emphasizes the importance of understanding language nuances and context in data annotation, particularly in specialized fields [28][30] - Surge AI's unique evaluation metrics include emotional tone and intent judgment, allowing for more accurate data classification [29] Group 4: Customer Case Studies - Surge AI developed the GSM8K dataset for OpenAI, which includes 8,500 elementary math problems, ensuring high quality through rigorous standards and expert involvement [36][40] - For Anthropic, Surge AI provided a tailored data annotation solution that addressed challenges in acquiring high-quality human feedback data for their Claude model [42][50] Group 5: Founding Team - Edwin Chen, the CEO, has a strong background in machine learning and data annotation, having worked at major tech companies like Google and Facebook [55][56] - The team includes experts from various fields, ensuring a diverse skill set that enhances Surge AI's capabilities in data annotation [59][62]
鲍威尔直面数据危纸白银攻防白热化
Jin Tou Wang· 2025-06-26 06:05
Group 1 - The current trading price of silver is above 8.342, with a recent increase of 0.77% to 8.368 per gram, indicating a bullish short-term trend [1] - The key support level for silver is identified between 8.131 and 8.200 per gram, with a potential downward pressure if this range is breached [3] - The resistance level for silver is concentrated between 8.410 and 8.490 per gram, with a breakthrough potentially leading to a test of the critical level at 8.500 per gram [3] Group 2 - Federal Reserve Chairman Jerome Powell is facing pressure for interest rate cuts, while also expressing concerns about the declining quality of economic data from the Bureau of Labor Statistics [2] - Economists have noted that approximately 30% of the CPI data for May was estimated, which is three times the historical average, raising concerns about the accuracy of inflation and employment data [2] - The May non-farm payroll report indicated an addition of 139,000 jobs, but analysts believe this number may be revised down to around 100,000 [2]