Workflow
TreeSynth
icon
Search documents
NIPS 2025 Spotlight | 港大提出TreeSynth方法,一句话生成百万规模数据集
机器之心· 2025-10-03 03:39
Core Insights - TreeSynth is a novel data synthesis method inspired by decision trees, addressing the challenge of generating diverse and high-quality training data from scratch [6][7][25] - The method ensures systematic coverage of the data space, overcoming limitations of traditional data synthesis approaches [4][25] Methodology - TreeSynth employs a two-phase workflow: data space partitioning and subspace data synthesis [8][12] - In the first phase, the data space is divided into mutually exclusive subspaces using pivot samples and core criteria [9][12] - The second phase involves generating samples within each atomic subspace based on the path description from the root to the leaf node [13][14] Performance and Validation - Experimental results show that TreeSynth consistently outperforms baseline methods in various benchmarks, achieving significant performance improvements [19][23] - For instance, accuracy on the GSM8K dataset increased from 45.2% to 55.8% using the LLaMA3.1-8B model [19] - TreeSynth also demonstrated a 45% increase in data diversity compared to baseline methods, with improved distribution in the embedding space [23] Future Directions - TreeSynth opens new avenues for synthesizing diverse and comprehensive training datasets, with potential for scalability in large data scenarios [26][27] - Future exploration may focus on optimizing tree depth and partitioning criteria, as well as adapting to complex real-world scenarios [28]