Workflow
日日新V6.5多模态模型
icon
Search documents
商汤林达华:破解图文交错思维链技术,商汤的“两步走”路径
3 6 Ke· 2025-08-15 09:09
Core Insights - SenseTime has launched the Riri Xin V6.5 multimodal model, which is the first commercial-grade model in China to achieve "image-text interleaved thinking chain" technology [2] - The development of multimodal intelligence is essential for achieving Artificial General Intelligence (AGI), as it allows for the integration of various forms of information processing, similar to human sensory perception [4][5] - SenseTime's approach to building multimodal intelligence involves a progressive evolution through four key breakthroughs, culminating in the integration of digital and physical spaces [5][12] Multimodal Intelligence and AGI - Multimodal intelligence is seen as a necessary pathway to AGI, as it enables autonomous interaction with the external world beyond just language [4] - The ability to process and analyze different modalities of information is crucial for practical applications and achieving comprehensive value [4] Development Pathway - SenseTime's development strategy includes the early introduction of multimodal models and significant advancements in multimodal reasoning capabilities [5][8] - The company has achieved a significant milestone by completing the training of a billion-parameter multimodal model, which ranks first in domestic evaluations [8] Native Multimodal Training - SenseTime has opted for native multimodal training, which integrates multiple modalities from the pre-training phase, as opposed to the more common adaptive training method [7][9] - This approach allows for a deeper understanding of the relationships between language and visual modalities, leading to a more cohesive model [7] Model Architecture and Efficiency - The architecture of the Riri Xin 6.5 model has been optimized for efficiency, allowing for better processing of high-resolution images and long videos, achieving over three times the efficiency compared to previous models [11] - The design philosophy emphasizes the distinction between visual perception and language processing, leading to a more effective model structure [11] Challenges and Solutions in Embodied Intelligence - Transitioning AI from digital to physical spaces requires addressing interaction learning efficiency, which is facilitated by a virtual system that simulates real-world interactions [12] - SenseTime's "world model" leverages extensive data to enhance the simulation and generation capabilities, improving the training of intelligent driving systems [12] Balancing Technology and Commercialization - SenseTime views the pursuit of AGI as a long-term endeavor that requires a balance between technological breakthroughs and commercial viability [13] - The company has established a three-pronged strategy focusing on infrastructure, models, and applications to create a positive feedback loop between technology and business [13][14] Recent Achievements - Over the past year, SenseTime has made significant progress in its foundational technology, achieving innovations such as native fusion training and multimodal reinforcement learning [14] - The commercial landscape is rapidly expanding, with AI performance leading to increased deployment in various intelligent hardware and robotics applications [14]