彻底告别VE与VAE！商汤硬核重构多模态：砍掉所有中间编码器

Core Viewpoint - The development paradigm of multimodal large models is being fundamentally restructured, with the introduction of NEO-unify by SenseTime and Nanyang Technological University, marking a significant breakthrough in achieving a truly "native, unified, end-to-end" multimodal model architecture [1][2][3]. Group 1: Technical Breakthroughs - NEO-unify eliminates the reliance on traditional visual encoders (VE) and variational autoencoders (VAE), moving away from component-based approaches to a model that directly processes near-lossless pixel and text inputs [3][10]. - The innovative Mixture-of-Transformer (MoT) architecture enables a seamless integration of visual and language understanding and generation capabilities within the same framework [4][13]. - This architecture signifies a shift from "modal connection" to "native unified intelligence," laying the groundwork for future integrated cross-modal cognitive and generative systems [5][6]. Group 2: Current Challenges in Multimodal Intelligence - The existing multimodal architectures have created a natural divide between perception and generation, which has been a persistent challenge in the field [7]. - Recent attempts to create "shared encoders" have often led to new structural design trade-offs, highlighting the need for a more cohesive approach [8][9]. Group 3: Model Performance and Efficiency - NEO-unify demonstrates superior performance metrics compared to other models, achieving notable results in various benchmarks, such as a score of 86.71 in the WISE benchmark and 0.914 in LongText-en [19]. - The model's design allows for high data and computational efficiency, achieving better performance with fewer training tokens compared to other models like Bagel [49]. Group 4: Future Implications - The introduction of NEO-unify represents not just a model architecture innovation but also a clear pathway towards the next generation of intelligent systems, where multimodal AI evolves from "component stacking" to "essential unity" [51][54]. - Ongoing research and development efforts are in a critical phase of scaling and iteration, with upcoming model results and open-source contributions expected to be released soon [55].