数据合成
Search documents
NIPS 2025 Spotlight | 港大提出TreeSynth方法,一句话生成百万规模数据集
机器之心· 2025-10-03 03:39
Core Insights - TreeSynth is a novel data synthesis method inspired by decision trees, addressing the challenge of generating diverse and high-quality training data from scratch [6][7][25] - The method ensures systematic coverage of the data space, overcoming limitations of traditional data synthesis approaches [4][25] Methodology - TreeSynth employs a two-phase workflow: data space partitioning and subspace data synthesis [8][12] - In the first phase, the data space is divided into mutually exclusive subspaces using pivot samples and core criteria [9][12] - The second phase involves generating samples within each atomic subspace based on the path description from the root to the leaf node [13][14] Performance and Validation - Experimental results show that TreeSynth consistently outperforms baseline methods in various benchmarks, achieving significant performance improvements [19][23] - For instance, accuracy on the GSM8K dataset increased from 45.2% to 55.8% using the LLaMA3.1-8B model [19] - TreeSynth also demonstrated a 45% increase in data diversity compared to baseline methods, with improved distribution in the embedding space [23] Future Directions - TreeSynth opens new avenues for synthesizing diverse and comprehensive training datasets, with potential for scalability in large data scenarios [26][27] - Future exploration may focus on optimizing tree depth and partitioning criteria, as well as adapting to complex real-world scenarios [28]
前端程序员请注意!首个截图就能生成现代前端代码的AI来了 | 已开源
量子位· 2025-02-26 03:51
Core Viewpoint - The article introduces Flame, an open-source multimodal large model solution aimed at modern front-end code generation, addressing the complexities and requirements of contemporary front-end development [1][25]. Group 1: Model Capabilities - Flame generates code that adheres to modern front-end development standards, featuring clear external styles and a modular component structure [4]. - Unlike top models like GPT-4o, which produce static components, Flame's approach allows for dynamic rendering and proper definition of component states and event responses [5][7]. Group 2: Data Challenges - The primary challenge for large visual language models (LVLM) in generating professional front-end code is the scarcity of high-quality training data [9][12]. - Existing datasets, such as websight, are inadequate as they only cover static HTML, failing to meet the needs of modern front-end frameworks like React [13]. Group 3: Data Synthesis Solutions - Flame's team proposes data synthesis as a solution to the data scarcity issue, employing a self-reflective intelligent workflow to generate high-quality data for front-end development [16]. - Three synthesis methods are designed: - Evolution-Based Synthesis, which generates diverse code variants through random evolution [18]. - Waterfall-Model-Based Synthesis, which ensures clear structure and logical consistency in generated code [20]. - Additive Development Synthesis, which incrementally adds functionality to existing code [22]. Group 4: Performance Evaluation - Flame's performance is evaluated using a high-quality test set of 80 items, with a focus on code that compiles correctly and adheres to coding standards [26]. - In comparison to leading models like GPT-4o, which achieved a maximum Pass@1 of only 11%, Flame reached over 52% under similar conditions, demonstrating significant potential [27]. - Flame accomplished this with approximately 200,000 data points, validating the effectiveness of its data synthesis methods [27].