谷歌开源Gemma 3n：2G内存就能跑，100亿参数内最强多模态模型

Core Viewpoint - Google has made a significant advancement in edge AI with the release of the new multimodal model Gemma 3n, which brings powerful multimodal capabilities to devices like smartphones, tablets, and laptops, previously only available on advanced cloud models [2][3]. Group 1: Model Features - Gemma 3n supports native multimodal input and output, including images, audio, video, and text [5]. - The model is optimized for device efficiency, with two versions (E2B and E4B) that require only 2GB and 3GB of memory to run, despite having original parameter counts of 5 billion and 8 billion respectively [5]. - The architecture includes innovative components such as the MatFormer architecture for computational flexibility and a new audio and vision encoder based on MobileNet-v5 [5][7]. Group 2: Architectural Innovations - The MatFormer architecture allows for elastic reasoning and dynamic switching between E4B and E2B inference paths, optimizing performance and memory usage based on current tasks [12]. - The use of per-layer embedding (PLE) technology significantly enhances memory efficiency, allowing a large portion of parameters to be loaded and computed on the CPU, reducing the memory burden on GPU/TPU [14][15]. Group 3: Performance Enhancements - Gemma 3n has achieved quality improvements in multilingual support, mathematics, coding, and reasoning, with the E4B version scoring over 1300 on the LMArena benchmark [5]. - The model introduces key-value cache sharing (KV Cache Sharing) to accelerate the processing of long content inputs, improving the time-to-first-token for streaming applications [18][19]. Group 4: Audio and Visual Capabilities - The audio capabilities of Gemma 3n include a universal speech model (USM) that generates tokens every 160 milliseconds, enabling high-quality speech-to-text transcription and translation [21]. - The MobileNet-V5-300M visual encoder provides advanced performance for multimodal tasks on edge devices, supporting various input resolutions and achieving high throughput for real-time video analysis [24][26]. Group 5: Future Developments - Google plans to release more details in an upcoming technical report on MobileNet-V5, highlighting its significant performance improvements and architectural innovations [28].