Workflow
日日新V6.5多模态推理大模型
icon
Search documents
具身智能迎来实力派!十年多模态打底,世界模型开路,商汤「悟能」来了
量子位· 2025-07-27 11:57
Core Viewpoint - SenseTime officially announced its entry into the field of embodied intelligence with the launch of the "Wuneng" embodied intelligence platform at the WAIC 2025 large model forum [1][2]. Group 1: SenseTime's Technological Advancements - SenseTime introduced the "Riri Xin V6.5" multimodal reasoning model, which features a unique image-text interleaved thinking chain that significantly enhances cross-modal reasoning accuracy [3][4]. - The new model outperforms Gemini 2.5 Pro in multimedia reasoning capabilities across multiple datasets, showcasing its competitive edge [8]. - Compared to its predecessor, Riri Xin 6.0, the V6.5 model has improved performance by 6.99% while reducing reasoning costs to only 30% of the previous version, resulting in a fivefold increase in cost-effectiveness [10]. Group 2: Transition to Embodied Intelligence - SenseTime's shift towards embodied intelligence is a natural progression from its expertise in visual perception and multimodal capabilities to physical world interactions [12][13]. - The company has accumulated over ten years of industry experience, particularly in autonomous driving, which has provided valuable data and world model experience for the development of embodied intelligence [13]. - The "Wuneng" platform integrates the general capabilities of the Riri Xin multimodal model with the experience of building and utilizing world models, aiming to create an ecosystem for embodied intelligence [14]. Group 3: World Model Capabilities - The "KAIWU" world model supports the generation of multi-perspective videos and can maintain temporal consistency for up to 150 seconds, utilizing a database of over 100,000 3D assets [16][18]. - It can understand occlusion and layering spatially, as well as temporal changes and motion patterns, allowing for realistic object representation [17][20]. - The platform can simultaneously process people, objects, and environments, creating a 4D representation of the real world [21]. Group 4: Industry Collaboration and Data Utilization - SenseTime is pursuing a "soft and hard collaboration" strategy, partnering with various humanoid robot and logistics platform manufacturers to pre-install its models, enhancing the multimodal perception and reasoning capabilities of hardware [29]. - The company is addressing the common industry challenge of data scarcity by generating synthetic data in virtual environments and using real-world samples for calibration [32][33]. - The integration of first-person and third-person perspectives in training enhances the model's ability to learn from human demonstrations while executing tasks from its own sensory input [26][35]. Group 5: Future Outlook and Competitive Edge - SenseTime is establishing a self-reinforcing data ecosystem through large-scale simulations, real data feedback from hardware, and the fusion of different perspectives, which is expected to drive continuous model upgrades [39]. - The company is positioned to lead the future of embodied intelligence by leveraging multimodal capabilities and hardware collaboration to build a competitive moat in the industry [40].