SENSETIME-商汤开源NEO多模态模型架构，实现视觉、语言深层统一

Core Insights - SenseTime has launched and open-sourced a new multimodal model architecture called NEO, developed in collaboration with Nanyang Technological University's S-Lab, which aims to break the traditional modular paradigm and achieve deep integration of vision and language through core architectural innovations [1][4]. Group 1: Architectural Innovations - NEO demonstrates high data efficiency, requiring only 1/10 of the data volume (39 million image-text pairs) compared to models of similar performance to develop top-tier visual perception capabilities [2][5]. - The architecture does not rely on massive datasets or additional visual encoders, allowing it to match the performance of leading modular flagship models like Qwen2-VL and InternVL3 in various visual understanding tasks [2][5]. - NEO's design achieves a balance of performance across multiple authoritative evaluations, outperforming other native VLMs while maintaining "lossless accuracy" [2][5]. Group 2: Limitations of Traditional Models - Current mainstream multimodal models typically follow a "visual encoder + projector + language model" modular paradigm, which, while compatible with image inputs, remains language-centric and limits the integration of image and language to a data level [2][5]. - This "patchwork" design results in inefficient learning and restricts the model's ability to handle complex multimodal scenarios, such as capturing image details or understanding complex spatial structures [2][5]. Group 3: Key Features of NEO - NEO incorporates innovations in three critical dimensions: attention mechanisms, positional encoding, and semantic mapping, enabling the model to inherently unify the processing of vision and language [2][5]. - The architecture features a Native Patch Embedding that eliminates discrete image tokenizers, allowing for a continuous mapping from pixels to tokens, which enhances the model's ability to capture image details [3][6]. - NEO also implements a Native Multi-Head Attention mechanism that accommodates both autoregressive attention for text tokens and bidirectional attention for visual tokens, significantly improving the model's utilization of spatial structure relationships [3][6].