Workflow
Why Synthetic Data Is Overrated
20VC with Harry Stebbingsยท2025-08-07 05:00

Synthetic Data Limitations - Synthetic data models excel in academic benchmark problems but struggle with real-world applications [1] - Companies are realizing the limitations of synthetic data after investing significant time (months) in training models with it, leading to discarding large portions of the data [2] - High-quality human-generated data, even in small quantities (e g, a thousand or a couple thousand pieces), can be more valuable than large volumes (e g, 10 million pieces) of synthetic data [3] Real-World Application - Models trained heavily on synthetic data are often ineffective in real-world use cases [2] - Companies have spent considerable time training models on synthetic data, only to discover its shortcomings later [2]