Core Insights - The rise of large language models (LLMs) marks a significant leap in AI technology, but achieving Artificial General Intelligence (AGI) requires more than just text understanding and generation [1] - The future of AI development lies in the integration of multimodal information and interaction with the physical world, with a shift towards multimodal models expected to accelerate [1][2] - The realization of AGI necessitates long-term technological accumulation and iterative scene development, overcoming key bottlenecks such as spatial perception and data scarcity [2][8] Multimodal Development - The evolution of large models is transitioning from single-language models to native multimodal architectures, which integrate various types of information during the pre-training process [4][5] - Current multimodal models need to extend from understanding to thinking, incorporating both logical and visual thinking processes [4][5] - Domestic companies are expected to adopt multimodal models comprehensively by the second half of 2025, moving away from standalone language models [5] Challenges in Achieving AGI - Key challenges include the generalization of reasoning capabilities from narrow domains to complex real-life scenarios, as well as the current limitations in spatial perception of multimodal models [2][7] - The development of agents, seen as crucial for AI's real-world application, faces significant gaps in understanding complex conditions and specific industry needs [6][7] - The ability of agents to effectively solve problems in real scenarios is essential for their perceived value and reliability [6] Bottlenecks in Embodied Intelligence - Embodied intelligence must bridge the gap between digital and physical spaces, with current data acquisition methods relying heavily on limited robotic operations [8] - The data throughput for embodied intelligence is significantly lower than that available from the internet, creating a challenge for effective development [8] - To advance embodied intelligence, leveraging prior knowledge and multimodal data from the internet is necessary, as relying solely on real-world data is insufficient [8]
商汤科技林达华:具身智能需数字空间与物理空间连接