不用任何人类语言训练，大模型反而更强了？

Core Viewpoint - The article explores the hypothesis that language may not be the only pathway to intelligence, suggesting that training language models on non-language synthetic data could yield better performance than traditional methods [1][6]. Group 1: Research Findings - A new training paradigm called "pre-pre-training" is proposed, where models are first trained on synthetic data generated by Neural Cellular Automata (NCA) before being fine-tuned on natural language [7][6]. - This approach has shown to improve language modeling performance by up to 6%, accelerate training convergence by 40%, and enhance reasoning capabilities in downstream tasks [2][38]. - Models trained with NCA data outperformed those trained on natural text, even when the latter had significantly larger datasets [22][27]. Group 2: Data Characteristics - NCA data possesses rich spatiotemporal structures and statistical properties similar to natural language, while being controllable and cost-effective to generate [8][10]. - Each NCA sequence corresponds to a unique latent rule, compelling the model to infer these rules from context, which is fundamental for developing reasoning abilities [12][39]. Group 3: Implications for Training - The study indicates that attention mechanisms are crucial for transferring learned capabilities, while MLP layers encode more domain-specific knowledge [34]. - The complexity of NCA data can be tailored to match specific tasks, allowing for customized training approaches [42][44]. - The long-term vision is to develop models that acquire reasoning capabilities through synthetic data before learning semantics from carefully selected natural language corpora [45][46].