柏拉图表征假说

Search documents
读万卷书,大模型就能「看」懂视觉世界?Meta揭秘LLM视觉先验的起源
机器之心· 2025-10-11 04:18
Core Insights - The research reveals that visual priors in large language models (LLMs) are not a singular capability but can be divided into two distinct types: reasoning priors and perception priors [4][6][21] - Reasoning priors are abstract, cross-modal abilities acquired through reasoning-focused pre-training data, while perception priors relate to the recognition of specific visual concepts [4][6] Reasoning Priors - Reasoning priors are developed through pre-training on structured texts such as code, mathematics, and academic papers, enabling LLMs to solve complex visual problems [4][11] - The study indicates that increasing the proportion of reasoning-intensive text in pre-training data significantly enhances the model's visual reasoning capabilities until it reaches around 75% [11][13] Perception Priors - Perception priors emerge from diverse general corpora and are sensitive to visual instruction fine-tuning and the choice of visual encoders [6][13] - Unlike reasoning priors, perception priors depend more on post-training visual fine-tuning data and the characteristics of the visual encoder [13][15] Experimental Findings - The research involved over 100 controlled experiments and utilized 500,000 GPU hours to systematically uncover the sources of LLM visual priors [2][8] - The experiments demonstrated that a small amount of visual description is sufficient, while a large amount of reasoning data is crucial for enhancing visual capabilities [7][11] Data Pre-training Recipe - The research team developed an optimal data mixing scheme that balances language capabilities and visual potential, leading to superior performance in both language and visual benchmarks [17][18] - The balanced model trained with this recipe outperformed models optimized solely for language tasks across all visual benchmark tests [19] Implications and Future Directions - This study shifts the cultivation of multimodal model capabilities from downstream fine-tuning to the language pre-training stage, supporting the Platonic Representation Hypothesis [21] - It suggests that model designers can consider future multimodal applications from the outset by embedding visual seeds during the pre-training phase [21]