Workflow
原生多模态
icon
Search documents
商汤林达华万字长文回答AGI:4层破壁,3大挑战
量子位· 2025-08-12 09:35
Core Viewpoint - The article emphasizes the significance of "multimodal intelligence" as a key trend in the development of large models, particularly highlighted during the WAIC 2025 conference, where SenseTime introduced its commercial-grade multimodal model, "Riri Xin 6.5" [1][2]. Group 1: Importance of Multimodal Intelligence - Multimodal intelligence is deemed essential for achieving Artificial General Intelligence (AGI) as it allows AI to interact with the world in a more human-like manner, processing various forms of information such as images, sounds, and text [7][8]. - The article discusses the limitations of traditional language models that rely solely on text data, arguing that true AGI requires the ability to understand and integrate multiple modalities [8]. Group 2: Technical Pathways to Multimodal Models - SenseTime has identified two primary technical pathways for developing multimodal models: Adapter-based Training and Native Training. The latter is preferred as it allows for a more integrated understanding of different modalities from the outset [11][12]. - The company has committed significant computational resources to establish a "native multimodal" approach, moving away from a dual-track system of language and image models [10][12]. Group 3: Evolutionary Path of Multimodal Intelligence - SenseTime outlines a "four-breakthrough" framework for the evolution of AI capabilities, which includes advancements in sequence modeling, multimodal understanding, multimodal reasoning, and interaction with the physical world [13][22]. - The introduction of "image-text intertwined reasoning" is a key innovation that allows models to generate and manipulate images during the reasoning process, enhancing their cognitive capabilities [16][18]. Group 4: Data Challenges and Solutions - The article highlights the challenges of acquiring high-quality image-text pairs for training multimodal models, noting that SenseTime has developed automated pipelines to generate these pairs at scale [26][27]. - SenseTime employs a rigorous "continuation validation" mechanism to ensure data quality, only allowing data that demonstrates performance improvement to be used in training [28][29]. Group 5: Model Architecture and Efficiency - The focus on efficiency over sheer size in model architecture is emphasized, with SenseTime optimizing its model to achieve over three times the efficiency while maintaining performance [38][39]. - The company believes that future model development will prioritize performance-cost ratios rather than simply increasing parameter sizes [39]. Group 6: Organizational and Strategic Insights - SenseTime's success is attributed to its strong technical foundation in computer vision, which has provided deep insights into the value of multimodal capabilities [40]. - The company has restructured its research organization to enhance resource allocation and foster innovation, ensuring a focus on high-impact projects [41]. Group 7: Long-term Vision and Integration of Technology and Business - The article concludes that the path to AGI is a long-term endeavor that requires a symbiotic relationship between technological ideals and commercial viability [42][43]. - SenseTime aims to create a virtuous cycle between foundational infrastructure, model development, and application, ensuring that real-world challenges inform research directions [43].
腾讯张正友:具身智能必须回答的三个「真问题」
机器之心· 2025-08-10 04:31
Core Viewpoint - Tencent has launched the Tairos platform for embodied intelligence, aiming to provide a modular support system for the development and application of large models, development tools, and data services [2][3]. Group 1: Platform Development - The Tairos platform is a culmination of over seven years of research by Tencent's Robotics X Lab, which has developed various robotic prototypes to explore full-stack robotic technologies [2][3]. - The establishment of the Tairos platform reflects Tencent's response to current industry challenges and its strategic positioning for future ecosystems [2][3]. Group 2: Architectural Choices - The debate between end-to-end and layered architectures in embodied intelligence is ongoing, with a preference for layered architecture due to its efficiency and practicality [4][5]. - Layered architecture allows for the integration of human prior knowledge into model structures, enhancing training efficiency and reducing data dependency [6][7]. Group 3: Knowledge Feedback Mechanism - The SLAP³ architecture proposed by Tencent includes multi-modal perception models, planning models, and action models, with dynamic collaboration and information flow between layers based on task complexity [7][11]. - A memory bank captures unique interaction data from the action model, which can be used to update the perception and planning models, creating a feedback loop for continuous learning [11][12]. Group 4: Evolution of Models - The architecture is designed for continuous iteration, allowing for the adjustment of prior knowledge as new insights are gained, similar to the evolution of the Transformer architecture [12][15]. - The goal is to transition towards a more efficient and native multi-modal intelligence form, despite current limitations in data availability and model exploration [15][16]. Group 5: Innovation and Commercialization - The influx of talent and capital into the embodied intelligence field is beneficial, but there is a need for balance between short-term commercial gains and long-term technological goals [23][24]. - Companies must maintain a clear vision of their ultimate objectives and have the courage to forgo immediate commercial opportunities to focus on foundational scientific challenges [25].