Workflow
数据合成
icon
Search documents
NIPS 2025 Spotlight | 港大提出TreeSynth方法,一句话生成百万规模数据集
机器之心· 2025-10-03 03:39
本文第一作者王升,陈鹏安与周靖淇均来自香港大学。通讯作者为香港大学计算机科学系吴川教授与孔令鹏教授。其他作者还包括来自香港大学的李沁桐、董经 纬、高佳慧,以及香港中文大学的薛博阳、江继越。 想象一下,你接手了一个新项目,需要在没有数据的情况下提升模型表现。"TreeSynth" 就这样起源于作者们最初的构想:"如何通过一句任务描述生成海量数据, 完成模型训练?" 同时,大规模 scalibility 对合成数据的多样性提出了新的要求。 相比之下,传统的数据合成方法就像一个缺乏规划的农夫漫无目的地四处撒种, 结果发现许多肥沃的土地被遗漏,而某些贫瘠的角落却种满了庄稼。 这正是当前数据合成领域面临的核心挑战:如何从 0 系统性地生成多样化、高质量的训练数据?现有方法往往受限于模型偏见、种子数据局限和低变种 prompt, 导致合成数据缺乏多样性,分布不均匀。更为关键的是,随着数据规模的增加,这种问题会变得愈发严重。 基于这一挑战,香港大学和香港中文大学的研究团队提出了 TreeSynth—— 一种受决策树启发的树引导子空间数据合成方法。它从整个数据空间的根节点出发,通 过层层分支将复杂的数据领域逐步细分,直到每个 ...
前端程序员请注意!首个截图就能生成现代前端代码的AI来了 | 已开源
量子位· 2025-02-26 03:51
Core Viewpoint - The article introduces Flame, an open-source multimodal large model solution aimed at modern front-end code generation, addressing the complexities and requirements of contemporary front-end development [1][25]. Group 1: Model Capabilities - Flame generates code that adheres to modern front-end development standards, featuring clear external styles and a modular component structure [4]. - Unlike top models like GPT-4o, which produce static components, Flame's approach allows for dynamic rendering and proper definition of component states and event responses [5][7]. Group 2: Data Challenges - The primary challenge for large visual language models (LVLM) in generating professional front-end code is the scarcity of high-quality training data [9][12]. - Existing datasets, such as websight, are inadequate as they only cover static HTML, failing to meet the needs of modern front-end frameworks like React [13]. Group 3: Data Synthesis Solutions - Flame's team proposes data synthesis as a solution to the data scarcity issue, employing a self-reflective intelligent workflow to generate high-quality data for front-end development [16]. - Three synthesis methods are designed: - Evolution-Based Synthesis, which generates diverse code variants through random evolution [18]. - Waterfall-Model-Based Synthesis, which ensures clear structure and logical consistency in generated code [20]. - Additive Development Synthesis, which incrementally adds functionality to existing code [22]. Group 4: Performance Evaluation - Flame's performance is evaluated using a high-quality test set of 80 items, with a focus on code that compiles correctly and adheres to coding standards [26]. - In comparison to leading models like GPT-4o, which achieved a maximum Pass@1 of only 11%, Flame reached over 52% under similar conditions, demonstrating significant potential [27]. - Flame accomplished this with approximately 200,000 data points, validating the effectiveness of its data synthesis methods [27].