Workflow
NEO多模态模型架构
icon
Search documents
商汤(00020)涨3.38% 发布及开源全新多模态模型架构“NEO”
Xin Lang Cai Jing· 2025-12-04 11:25
Core Viewpoint - Company shares of SenseTime (00020) rose by 3.38% to HKD 2.14, with a trading volume of HKD 644 million, following the announcement of a new multimodal model architecture "NEO" developed in collaboration with Nanyang Technological University's S-Lab [1][1]. Group 1 - The newly released "NEO" is expected to be the industry's first usable native multimodal architecture (Native VLM) that achieves deep integration [1][1]. - The innovative design of "NEO" is based on fundamental principles specifically tailored for multimodal applications, aiming for a breakthrough in performance, efficiency, and versatility [1][1]. - This new architecture lays the foundation for the SenseNova multimodal model and signifies a new era in AI multimodal technology with the introduction of "native architecture" [1][1].
商汤科技进军具身智能行业:“大晓机器人”对标Figure AI,股价上涨3.38%
Sou Hu Cai Jing· 2025-12-04 09:27
Industry Overview - The embodied intelligence industry is experiencing rapid growth, with a projected market size of 400 billion yuan by 2030 and over 1 trillion yuan by 2035 in China, driven by innovation and demand release [1] - The industry is expected to grow at an annual rate exceeding 50% in recent years [1] Company Developments - SenseTime's co-founder and executive director, Wang Xiaogang, has been appointed as the chairman of the embodied intelligence company "Daxiao Robot," which will launch several leading technologies and products on December 18 [1] - Following this announcement, SenseTime's stock price rose by 3.38%, closing at 2.14 HKD on December 4 [1] Technological Advancements - Daxiao Robot has assembled a team of top AI scientists and industry experts, including the chief scientist Tao Dacheng, an academician of the Australian Academy of Science, and award-winning professionals from prestigious universities [3] - The company focuses on embodied intelligence and has developed the ACE R&D paradigm, creating a comprehensive technology system based on visual data [4] - SenseTime has released and open-sourced the multi-modal model architecture NEO, providing robust technical support for applications in embodied interaction and video understanding [4] Strategic Positioning - SenseTime views embodied intelligence not as a trend but as a natural extension of its technological path, evolving from computer vision to multi-modal models and integrating various capabilities into the "Wuneng" platform [5] - The continuous technological breakthroughs and systematic integration of capabilities have become a rare resource among tech companies [5]
商汤开源NEO多模态模型架构,实现视觉、语言深层统一
Xin Lang Cai Jing· 2025-12-02 11:25
Core Insights - SenseTime has launched and open-sourced a new multimodal model architecture called NEO, developed in collaboration with Nanyang Technological University's S-Lab, which aims to break the traditional modular paradigm and achieve deep integration of vision and language through core architectural innovations [1][4]. Group 1: Architectural Innovations - NEO demonstrates high data efficiency, requiring only 1/10 of the data volume (39 million image-text pairs) compared to models of similar performance to develop top-tier visual perception capabilities [2][5]. - The architecture does not rely on massive datasets or additional visual encoders, allowing it to match the performance of leading modular flagship models like Qwen2-VL and InternVL3 in various visual understanding tasks [2][5]. - NEO's design achieves a balance of performance across multiple authoritative evaluations, outperforming other native VLMs while maintaining "lossless accuracy" [2][5]. Group 2: Limitations of Traditional Models - Current mainstream multimodal models typically follow a "visual encoder + projector + language model" modular paradigm, which, while compatible with image inputs, remains language-centric and limits the integration of image and language to a data level [2][5]. - This "patchwork" design results in inefficient learning and restricts the model's ability to handle complex multimodal scenarios, such as capturing image details or understanding complex spatial structures [2][5]. Group 3: Key Features of NEO - NEO incorporates innovations in three critical dimensions: attention mechanisms, positional encoding, and semantic mapping, enabling the model to inherently unify the processing of vision and language [2][5]. - The architecture features a Native Patch Embedding that eliminates discrete image tokenizers, allowing for a continuous mapping from pixels to tokens, which enhances the model's ability to capture image details [3][6]. - NEO also implements a Native Multi-Head Attention mechanism that accommodates both autoregressive attention for text tokens and bidirectional attention for visual tokens, significantly improving the model's utilization of spatial structure relationships [3][6].