谢赛宁、李飞飞、LeCun联手提出多模态LLM新范式，「空间超感知」登场

Core Insights - The article discusses the new research achievement named "Cambrian-S," which represents a significant step in exploring supersensing in video space [1][4] - It builds upon the previous work "Cambrian-1," focusing on enhancing AI's visual representation learning capabilities [2] Group 1: Definition and Importance of Supersensing - Supersensing is defined as how a digital entity truly experiences the world, absorbing endless input streams and continuously learning [4][5] - The research emphasizes that before developing "superintelligence," it is crucial to establish "supersensing" capabilities [4] Group 2: Development Path of Multimodal Intelligence - The team outlines a developmental path for multimodal intelligence, identifying video as the ultimate medium for human experience and a direct projection of real-life experiences [6] - They categorize the evolution of multimodal intelligence into several stages, from linguistic-only understanding to predictive world modeling [9] Group 3: Benchmarking Supersensing - The researchers conducted a two-part study to establish benchmarks for measuring supersensing capabilities, revealing that existing benchmarks primarily focus on language understanding and semantic perception, neglecting advanced spatial and temporal reasoning [14][25] - They introduced a new benchmark called VSI-Super, specifically designed to detect spatial intelligence in continuous scenarios [15][26] Group 4: Challenges in Current Models - The article highlights that current models, including Gemini-2.5-Flash, struggle with tasks requiring true spatial cognition and long-term memory, indicating a fundamental gap in the current paradigm [35][38] - The performance of advanced models on the VSI-Super benchmark was notably poor, underscoring the challenges of integrating continuous sensory experiences [35][36] Group 5: Predictive Sensing as a New Paradigm - The researchers propose predictive sensing as a forward path, where models learn to predict sensory inputs and build internal world models to handle unbounded visual streams [42][43] - This approach is inspired by human cognitive theories, emphasizing selective retention of sensory inputs and the ability to predict incoming stimuli [42][44] Group 6: Case Studies and Results - The article presents case studies demonstrating the effectiveness of surprise-driven event segmentation in improving performance on the VSI-Super benchmark [49][53] - The results indicate that the surprise-driven method outperformed existing models, showcasing better generalization capabilities [55][57]