开悟世界模型

Search documents
商汤林达华:破解图文交错思维链技术,商汤的“两步走”路径
3 6 Ke· 2025-08-15 09:09
Core Insights - SenseTime has launched the Riri Xin V6.5 multimodal model, which is the first commercial-grade model in China to achieve "image-text interleaved thinking chain" technology [2] - The development of multimodal intelligence is essential for achieving Artificial General Intelligence (AGI), as it allows for the integration of various forms of information processing, similar to human sensory perception [4][5] - SenseTime's approach to building multimodal intelligence involves a progressive evolution through four key breakthroughs, culminating in the integration of digital and physical spaces [5][12] Multimodal Intelligence and AGI - Multimodal intelligence is seen as a necessary pathway to AGI, as it enables autonomous interaction with the external world beyond just language [4] - The ability to process and analyze different modalities of information is crucial for practical applications and achieving comprehensive value [4] Development Pathway - SenseTime's development strategy includes the early introduction of multimodal models and significant advancements in multimodal reasoning capabilities [5][8] - The company has achieved a significant milestone by completing the training of a billion-parameter multimodal model, which ranks first in domestic evaluations [8] Native Multimodal Training - SenseTime has opted for native multimodal training, which integrates multiple modalities from the pre-training phase, as opposed to the more common adaptive training method [7][9] - This approach allows for a deeper understanding of the relationships between language and visual modalities, leading to a more cohesive model [7] Model Architecture and Efficiency - The architecture of the Riri Xin 6.5 model has been optimized for efficiency, allowing for better processing of high-resolution images and long videos, achieving over three times the efficiency compared to previous models [11] - The design philosophy emphasizes the distinction between visual perception and language processing, leading to a more effective model structure [11] Challenges and Solutions in Embodied Intelligence - Transitioning AI from digital to physical spaces requires addressing interaction learning efficiency, which is facilitated by a virtual system that simulates real-world interactions [12] - SenseTime's "world model" leverages extensive data to enhance the simulation and generation capabilities, improving the training of intelligent driving systems [12] Balancing Technology and Commercialization - SenseTime views the pursuit of AGI as a long-term endeavor that requires a balance between technological breakthroughs and commercial viability [13] - The company has established a three-pronged strategy focusing on infrastructure, models, and applications to create a positive feedback loop between technology and business [13][14] Recent Achievements - Over the past year, SenseTime has made significant progress in its foundational technology, achieving innovations such as native fusion training and multimodal reinforcement learning [14] - The commercial landscape is rapidly expanding, with AI performance leading to increased deployment in various intelligent hardware and robotics applications [14]
商汤王晓刚:世界模型将加快AI从数字空间进入物理世界,「悟能」想做那个桥梁
机器之心· 2025-08-12 07:34
Core Viewpoint - The article discusses the emergence of embodied intelligence and the significance of the "world model" as a core component in advancing AI towards human-like intelligence, highlighting the competitive landscape in the AI industry as it evolves towards embodied intelligence [1][2]. Industry Developments - Major companies like Google, Huawei, and ByteDance are launching various embodied intelligence platforms and models, indicating a rapid evolution in this field [3]. - SenseTime, leveraging its expertise in computer vision and multi-modal large models, aims to empower the industry through its "Wuneng" embodied intelligence platform, which integrates years of technological accumulation [3][5]. Technical Challenges - The industry faces challenges such as data scarcity, difficulty in large-scale production, and the need for generalization in embodied intelligence applications [5][13]. - The reliance on computer vision expertise is seen as a potential solution to enhance the learning of world models and improve the capabilities of embodied intelligence [14]. World Model Significance - The world model is recognized as a crucial element for predicting and planning in autonomous systems, enabling robots to interact intelligently with their environments [12][17]. - SenseTime's "Kaigu" world model is designed to provide extensive data and facilitate simulation-based learning, significantly reducing data collection costs [17][20]. Platform Features - The "Wuneng" platform offers a comprehensive approach by combining first-person and third-person perspectives for robot learning, enhancing the understanding of robot behavior [27][29]. - The platform aims to address the data challenges in the industry by providing synthetic data and facilitating the development of various robotic applications [26][31]. Future Implications - As embodied intelligence matures, it is expected to transform human-robot interactions and create new social networks involving robots, enhancing their roles in daily life [36][37]. - The integration of embodied intelligence into common environments like homes and workplaces is anticipated to unlock significant value and functionality [39].
ChatGPT见顶后,AI新战场世界模型:中国已经先行一步!
老徐抓AI趋势· 2025-07-31 01:03
Core Viewpoint - The article discusses the transition from large language models (LLMs) to "world models" as the next competitive focus in AI, highlighting the limitations of LLMs and the potential of world models to reshape AI's future and drive economic growth [2][5][28]. Summary by Sections AI's Evolution - AI development is categorized into three stages: perceptual AI, generative AI, and embodied AI, with each stage representing significant technological advancements [5][18]. Stage One: Perceptual AI - The breakthrough in perceptual AI occurred in 2012 when Geoffrey Hinton's team surpassed human image recognition accuracy, but its capabilities were limited to recognition without reasoning or cross-domain learning [7][9]. Stage Two: Generative AI - The introduction of the Transformer architecture in 2017 marked a qualitative leap, enabling AI to train on vast amounts of text data, significantly increasing its knowledge base [12][13]. However, this growth is nearing a limit, with predictions that usable internet data for training will peak around 2028 [15]. Stage Three: Embodied AI - The next phase involves embodied AI, where AI learns through interaction with the real world rather than just textual data, necessitating the development of world models [16][18]. What is a World Model? - A world model is a high-precision simulator that adheres to physical laws, allowing AI to learn through trial and error in a virtual environment, significantly reducing the data collection costs associated with real-world training [19][20]. Challenges of World Models - Unlike simple video generation, world models must ensure consistency with physical laws to be effective for training AI, addressing issues like physical inconsistencies in generated scenarios [20][22]. Breakthroughs by SenseTime - SenseTime's "KAIWU" world model allows users to describe scenarios in natural language, generating videos that comply with physical laws, thus revolutionizing training for autonomous driving and robotics [22][24]. Implications of World Models - The shift to world models will change data production methods, enhance training efficiency, and transform industries such as autonomous driving, robotics, manufacturing, healthcare, and education [28]. Future Outlook - The emergence of world models is anticipated to accelerate economic growth, with the potential for a "ChatGPT moment" in the next 1-2 years, driven by unprecedented investment and innovation in the AI sector [28][29].
能讲PPT、懂指令!商汤“悟能”平台让机器人“玩转”现实世界|聚焦世界人工智能大会
Guo Ji Jin Rong Bao· 2025-07-27 19:20
Core Insights - The evolution of AI has transitioned from perceptual intelligence to generative intelligence, with future breakthroughs dependent on AI's ability to actively explore and interact with the real world [1][3] Group 1: AI Development Stages - The development of human intelligence is rooted in continuous interaction with the physical world, while machine intelligence has been limited by the finite supply of human knowledge [3] - The rise of deep learning algorithms, such as CNN and ResNet, from 2011 to 2012 led to an explosive growth in perceptual AI, but these models are constrained by their reliance on manually labeled data [3] - The introduction of the Transformer architecture around 2017 to 2018 enabled AI to extract knowledge from natural language, with models like GPT-3 processing text equivalent to 100,000 years of human creative output [3] Group 2: Challenges in AI Evolution - A looming crisis is anticipated as natural language data may be exhausted by 2027 to 2028, while visual data, although abundant, is challenging to extract knowledge from effectively [3] - The production rate of visual data lags behind the growth of computational power, leading to a mismatch in data requirements for models [3] - The next phase of AI development necessitates overcoming the challenge of scarce high-quality interactive data, drawing inspiration from human learning through physical world interactions [3] Group 3: Solutions and Innovations - The high cost of real-world interactions presents a significant barrier, with traditional solutions relying on simulators that often fail to bridge the "Sim-to-Real Gap" [4] - To address these challenges, the company has introduced the "KAIWU" world model, which generates high-quality simulated data by considering temporal and spatial consistency [4] - The "WUNENG" embodied intelligence platform, powered by the company's world model, provides robust perception, visual navigation, and multimodal interaction capabilities for robots and smart devices [6] Group 4: Practical Applications - The "WUNENG" platform enables various hardware to achieve a deep understanding of the world, supporting integration into edge chips for enhanced adaptability [6] - The embodied world model can generate multi-perspective videos while ensuring temporal and spatial consistency, allowing machines to understand, generate, and edit real-world scenarios [8] - Users can interact with the embodied world model through simple prompts, enabling autonomous generation of poses, actions, and commands in real-world contexts [8]
具身智能迎来实力派!十年多模态打底,世界模型开路,商汤「悟能」来了
量子位· 2025-07-27 11:57
Core Viewpoint - SenseTime officially announced its entry into the field of embodied intelligence with the launch of the "Wuneng" embodied intelligence platform at the WAIC 2025 large model forum [1][2]. Group 1: SenseTime's Technological Advancements - SenseTime introduced the "Riri Xin V6.5" multimodal reasoning model, which features a unique image-text interleaved thinking chain that significantly enhances cross-modal reasoning accuracy [3][4]. - The new model outperforms Gemini 2.5 Pro in multimedia reasoning capabilities across multiple datasets, showcasing its competitive edge [8]. - Compared to its predecessor, Riri Xin 6.0, the V6.5 model has improved performance by 6.99% while reducing reasoning costs to only 30% of the previous version, resulting in a fivefold increase in cost-effectiveness [10]. Group 2: Transition to Embodied Intelligence - SenseTime's shift towards embodied intelligence is a natural progression from its expertise in visual perception and multimodal capabilities to physical world interactions [12][13]. - The company has accumulated over ten years of industry experience, particularly in autonomous driving, which has provided valuable data and world model experience for the development of embodied intelligence [13]. - The "Wuneng" platform integrates the general capabilities of the Riri Xin multimodal model with the experience of building and utilizing world models, aiming to create an ecosystem for embodied intelligence [14]. Group 3: World Model Capabilities - The "KAIWU" world model supports the generation of multi-perspective videos and can maintain temporal consistency for up to 150 seconds, utilizing a database of over 100,000 3D assets [16][18]. - It can understand occlusion and layering spatially, as well as temporal changes and motion patterns, allowing for realistic object representation [17][20]. - The platform can simultaneously process people, objects, and environments, creating a 4D representation of the real world [21]. Group 4: Industry Collaboration and Data Utilization - SenseTime is pursuing a "soft and hard collaboration" strategy, partnering with various humanoid robot and logistics platform manufacturers to pre-install its models, enhancing the multimodal perception and reasoning capabilities of hardware [29]. - The company is addressing the common industry challenge of data scarcity by generating synthetic data in virtual environments and using real-world samples for calibration [32][33]. - The integration of first-person and third-person perspectives in training enhances the model's ability to learn from human demonstrations while executing tasks from its own sensory input [26][35]. Group 5: Future Outlook and Competitive Edge - SenseTime is establishing a self-reinforcing data ecosystem through large-scale simulations, real data feedback from hardware, and the fusion of different perspectives, which is expected to drive continuous model upgrades [39]. - The company is positioned to lead the future of embodied intelligence by leveraging multimodal capabilities and hardware collaboration to build a competitive moat in the industry [40].
商汤董事长兼CEO徐立:数据耗尽后,AI演进需与物理世界链接
2 1 Shi Ji Jing Ji Bao Dao· 2025-07-27 02:41
Core Insights - The evolution of AI has transitioned from perceptual intelligence to generative intelligence, with future breakthroughs relying on active exploration and interaction with the real world [2] - The current natural language data may be exhausted by 2027-2028, while visual data, although abundant, is challenging to extract knowledge from [2][3] - The growth of computational power is outpacing the generation of data, leading to a mismatch in model data requirements [3] Group 1 - The origin of human intelligence is rooted in continuous interaction with the physical world, which has been a limitation for machine intelligence due to the finite supply of human knowledge [2] - Deep learning algorithms, such as CNN and ResNet, spurred the explosion of perceptual AI from 2011 to 2012, but these models are limited by their reliance on manually labeled data [2] - The introduction of the Transformer architecture in 2017-2018 allowed AI to extract knowledge from natural language, with models like GPT-3 processing text equivalent to a hundred thousand years of human creative output [2] Group 2 - The next stage of AI development requires overcoming the challenge of scarce active interaction data, as human learning is based on interaction with the physical world rather than solely on language or visual inputs [3] - The high cost of real-world interaction and the limitations of traditional solutions, such as simulators, contribute to the "Sim-to-Real Gap," where generated data may not accurately reflect reality [3] - The company has introduced the "KAIWU" world model, which aims to provide high-quality simulated data by considering time and spatial consistency, enhancing AI training capabilities [3]
独家丨哪吒汽车智驾高级总监王俊平加入商汤绝影
雷峰网· 2025-03-24 10:04
Core Viewpoint - SenseTime's R-UniAD end-to-end autonomous driving solution is set to be unveiled at the Shanghai Auto Show in April, with real vehicle deployment completed and expected delivery by the end of the year [1][3]. Group 1: Company Developments - Wang Junping, former senior director of intelligent driving at Nezha Auto, joined SenseTime's autonomous driving division in February 2023, previously being part of Baidu's intelligent driving team [2]. - SenseTime has been collaborating with Nezha Auto since September 2021, focusing on intelligent driving and smart cockpit technologies [2]. - Wang Weibao, who took over from Shijianping as the head of intelligent driving, joined SenseTime at the end of 2023 and has a background in Apple's autonomous driving team and as CTO at New Stone Unmanned Vehicle Company [3]. Group 2: Industry Context - The autonomous driving sector is experiencing intensified competition, particularly for companies not in the top tier, highlighting the challenges faced by solution providers [3]. - SenseTime collaborates with over 30 automotive companies, including GAC, BYD, Honda, and NIO, with solutions already deployed in models like the Haobo and Nezha's super sedan [3].