合成数据
Search documents
ICML 2025 | 如何在合成文本数据时避免模型崩溃?
机器之心· 2025-05-14 04:36
Core Insights - The rapid development of generative artificial intelligence technology has made synthetic data an essential component for training large models like GPT series. However, uncontrolled use of synthetic data can lead to "model collapse," significantly degrading model performance and generalization to real-world data [1][2][6]. Group 1: Challenges of Synthetic Data - The phenomenon of "Non-iterative Collapse" occurs when a high proportion of synthetic data is mixed into training data, leading to a significant drop in model performance even after a single pre-training session [6]. - Synthetic data has two structural defects compared to human-generated data: a lack of low-frequency and long-tail samples, which hinders the representation of language diversity, and an over-concentration of language features, increasing the risk of model overfitting [13]. Group 2: Token-Level Editing Method - The Token-Level Editing method introduces fine-grained "micro-editing" operations on real data instead of generating entire segments, creating more stable and generalizable "semi-synthetic" data, thus mitigating the risk of model collapse [3][10]. - The editing process retains the long-tail structure of the original data while only adjusting "overconfident" tokens, ensuring that the model maintains coverage of the real data distribution and avoids feature over-concentration [11][15]. Group 3: Theoretical Results - The testing error of the Token-Level Editing process has a finite upper bound, preventing model collapse, and the error does not increase with the number of iterations [12][16]. - The theoretical framework indicates that even in multi-round training, Token-Level Editing can mathematically prevent unbounded error growth, establishing a "theoretically non-collapsing" data augmentation path [16]. Group 4: Experimental Validation - The effectiveness of Token-Level Editing was validated through systematic experiments across three key stages of language model training: pre-training, continual pre-training, and supervised fine-tuning [17]. - In the pre-training phase, models using edited data outperformed those using purely synthetic data, with an average task score increase of +0.36 percentage points on benchmarks like PIQA, BoolQ, and Winogrande [18]. - In the continual pre-training phase, significant cross-domain generalization improvements were observed, such as a +13.6% accuracy increase in the PubMedQA task [18]. - During the supervised fine-tuning phase, the method demonstrated strong robustness in complex tasks, with LLaMA-3 showing an average improvement of +0.4 to +0.5% [18].
具身空间数据技术的路线之争:合成重建VS全端生成
量子位· 2025-04-20 13:24
Core Viewpoint - The breakthrough in embodied intelligence relies heavily on high-quality data, with a significant focus on synthetic data generation due to the high costs of real data collection [1][2]. Group 1: Data Challenges - The current state of embodied intelligence data is characterized by scarcity and inadequacy, with existing sources being limited and not sufficiently diverse [16][18]. - Three main categories of existing data sources are identified: real scan data, game engine environments, and open-source synthetic datasets, each with its limitations [17]. - The indoor embodied intelligence scenarios require structured, semantic, and interactive 3D scene data, which is challenging to collect due to the unique layouts and usage patterns of individual households [18][19]. Group 2: Technical Approaches - There are two primary technical routes for synthetic data generation: "video synthesis + 3D reconstruction" and "end-to-end 3D generation" [3][24]. - The "video synthesis + 3D reconstruction" approach involves generating video or images first, which can lead to cumulative errors and limited structural accuracy [24][39]. - The "end-to-end 3D generation" method aims for direct synthesis of structured spatial data but faces challenges such as low generation quality and lack of common sense [67][68]. Group 3: Innovations in Data Generation - A new technical solution called "modal encoding" is proposed to address the common sense gap in end-to-end 3D generation, allowing for the digital encoding and implicit learning of spatial solutions [5][91]. - The Sengine SimHub is introduced as a system that integrates design knowledge into the generation process, enhancing the stability and adaptability of the generated data [75][78]. - The focus is on creating a data generation system that not only produces space but also generates "understandable and usable" environments, incorporating design logic and user preferences [91][96]. Group 4: Future Directions - The industry is at a critical juncture where the need for a new approach to data generation is evident, moving beyond mere data accumulation to creating "useful data" [95][96]. - The future of embodied intelligence may hinge on how space is defined and understood, emphasizing the importance of integrating rules and preferences into spatial data generation [96][100].
阿里云前高层创办的Lepton AI,被英伟达看中了!
Zheng Quan Shi Bao Wang· 2025-03-28 12:32
Core Viewpoint - Nvidia is set to acquire AI startup Lepton AI, founded by former Alibaba Cloud executive Jia Yangqing, for a valuation of several hundred million dollars, allowing multiple investors to exit successfully [1][3]. Group 1: Company Overview - Lepton AI was established in 2023 by Jia Yangqing and Junjie Bai, currently employing around 20 staff members and serving several venture-backed startups [3][7]. - The platform developed by Lepton AI is optimized for AI workloads, enabling clients to train AI models and perform inference tasks in production environments [3]. Group 2: Technology and Features - Lepton AI offers a visual interface for setting up training clusters in the cloud and provides various Nvidia GPUs for user selection [3]. - The platform includes error detection capabilities to improve output quality and identifies subtle technical issues, such as excessive memory usage during training [3]. - After model development, clients can deploy their models on optimized inference instances in the company's cloud, with a processing speed exceeding 600 tokens per second and latency below 10 milliseconds [3]. Group 3: Strategic Implications - Nvidia's acquisition of Lepton AI is seen as a strategic move to enhance its technology ecosystem and meet the growing demand for efficient AI solutions [4]. - The acquisition will provide Lepton AI with additional resources and support from Nvidia, accelerating its technological innovation and market expansion [4]. - Nvidia is also acquiring another AI startup, Gretel, which specializes in synthetic data creation for training AI models, with a reported transaction value exceeding $320 million [5]. Group 4: Founder Background - Jia Yangqing is a prominent figure in AI architecture, having previously worked at Google and Facebook, where he contributed to the development of the deep learning framework Caffe [6][7]. - After joining Alibaba in 2019, Jia left to establish Lepton AI in 2023, securing angel funding from notable investors including Sequoia China [7].
大模型“神仙打架”,掀起复现潮、技术大升级后,我们需要关注什么? | 万有引力
AI科技大本营· 2025-03-25 01:45
以下文章来源于CSDN ,作者万有引力 CSDN . 成就一亿技术人 作者 | 万有引力 出品 | CSDN(ID:CSDNnews) 在过去短短的几周里,大模型赛道的信息密度飙升至前所未有的高度。DeepSeek 连续 五天开源 ,直接引发了一场复现热潮;阿里巴巴通义实验室、 腾讯相继推出面向视觉文档的 RAG 系统 ViDoRAG、新一代混元快思考模型 Turbo S ,加速了大模型的演进步伐;马斯克用 20 万张 GPU 训练出的 Grok 3 ,超越了许多业界标杆,再次验证了"大力出奇迹"的定律; Claude 3.7 Sonnet 迎来编码能力大升级,AI 编程的技术平权时代正在加速到来; DeepSeek 论文与 Kimi"撞车",越来越多公司开始布局稀疏注意力与线性注意力机制,这些技术正成为 Transformer 之后的关键探索方向;此外, Manus 模式的"虚拟机"概 念迅速走红,正在重塑大模型的运行方式... 在这场眼花缭乱的技术竞赛背后,真正值得我们关注的是什么?DeepSeek 的五连发 究竟意欲何为?在 545% 的成本利润率之下,其他大模型公司是 否也能找到盈利空间?面对行业变 ...
速递|英伟达天价收购80人团队Gretel,利用合成数据补全AI基础设施
Z Potentials· 2025-03-20 02:56
Core Viewpoint - Nvidia has acquired the startup Gretel, which specializes in generating synthetic AI training data, with the acquisition price reportedly exceeding nine figures, surpassing Gretel's recent valuation of $320 million [1][3]. Group 1: Acquisition Details - Gretel, founded in 2019 by Alex Watson, Laszlo Bock, John Myers, and Ali Golshan, will be integrated into Nvidia, with its technology becoming part of Nvidia's generative AI service suite [2]. - The acquisition is strategically timed as major tech companies like Microsoft, Meta, OpenAI, and Anthropic are increasingly utilizing synthetic data to train their flagship AI models due to the depletion of real-world data resources [3]. Group 2: Financial Background - Prior to the acquisition, Gretel raised over $67 million from investors including Anthos Capital, Greylock, and Moonshots Capital [3].
AI 月报:马斯克加速 GPU 竞赛;大模型真撞墙了? 风口转到 Agent
晚点LatePost· 2024-12-11 14:30
新栏目上线试运行。 文丨 贺乾明 编辑丨黄俊杰 到了 11 月,越来越多的人说,成就 OpenAI 的这条路似乎撞到了墙: 多家媒体报道,Google、OpenAI、Anthropic 等公司,开发下一代模型时,都没能像前些年那样让模型能力大幅提升。 硅谷风投 a16z 创始合伙人、投资了 OpenAI 等多家大模型公司的马克·安德森(Marc Andreessen)说:"我们以相 同的速度增加(GPU),根本没有智能提升。" OpenAI 联合创始人、前首席科学家伊尔亚·苏茨克维 (Ilya Sutskever) 说:"2010 年代是扩大规模的时代,现在我 们再次回到了需要奇迹和新发现的时代。" 这些公司的高管否认了 "撞墙" 的说法,也有证据表明他们仍在想办法突破,毕竟建设更大规模的算力中心的势头并没 有放缓,甚至还在加速。 他们同步在大模型应用上倾注更多的资源。从 OpenAI、Anthropic 到 Google、微软,再到风投机构,都把 Agent——让 大模型理解人类指令,调度数据库和工具完成复杂任务的系统——当作下一个赛点。 11 月,ChatGPT 迎来两周年,却是 OpenAI 官方相对沉 ...
IDEA研究院沈向洋:从PMF到TMF, AI For Science是今天一定要做的事
IPO早知道· 2024-11-23 01:04
人工智能向前发展要造数据、合成数据,有可能带来大模型创业下新的百亿美金的问题。 本文为IPO早知道原创 作者|苏打 微信公众号|ipozaozhidao "如果说今天有什么事是我们一定要做的,那就是AI For Science。难以想象今天还有什么事情比它 更重要,今年诺贝尔奖的颁布便是最好证明。" 11月22日召开的2024年IDEA大会上,IDEA研究院创院理事长、美国国家工程院外籍院士沈向洋在 题为《从技术突破到产业融合》的主题演讲中指出,在技术大爆发时期开展创新,对技术的深度理解 尤为重要。 沈向洋表示,从长远的人类社会发展角度来看,巨大跃迁都是由技术创新带来的。工业时代的全球 GDP年均增速约为1%-2%,信息时代在3%-4%,人工智能时代,这个数字会是多少?与此同时, 他强调,随着AI的各项能力逼近、甚至超越人类,AI治理已成亟待全球共同面对的议题。 深圳或将成全球算力中心之一 "过去这几年人工智能的蓬勃发展,令大家对整个行业充满着期待。其中,算力、算法、数据是绕不 开的'三件套'。"现场,沈向洋分享了对上述三要素的新理解。 首先,算力是关键生产力。过去四五十年间,计算行业的发展中,最重要的一件事 ...