最低仅需2G显存，谷歌开源端侧模型刷新竞技场纪录，原生支持图像视频

Core Viewpoint - Google has officially announced the launch of Gemma 3n, a model that natively supports multiple modalities including text, images, and audio-video [2][20]. Group 1: Model Performance and Specifications - Gemma 3n achieved a score of 1303, becoming the first model under 10 billion parameters to exceed 1300 points [3]. - The model comes in two versions: 5 billion (E2B) and 8 billion (E4B) parameters, with VRAM usage comparable to 2B and 4B models, requiring as little as 2GB [4][17]. - The architecture allows for low memory consumption, making it suitable for edge devices [6][17]. Group 2: Technical Architecture - The core of Gemma 3n is the MatFormer (Matryoshka Transformer) architecture, designed for elastic inference with a nested structure [11][12]. - The concept of "effective parameters" is introduced, allowing for simultaneous optimization of E4B and E2B models during training [10][15]. - Google will release a tool called MatFormer Lab to help find the best model configurations [16]. Group 3: Edge Device Optimization - The model employs Progressive Layer Embedding (PLE) technology to enhance model quality without increasing memory usage [18]. - Gemma 3n optimizes the generation of the first token, improving pre-filling performance by 2 times compared to the previous model [19]. Group 4: Multimodal Support - Gemma 3n supports various input modalities, including advanced audio encoding for speech recognition and translation, capable of processing 30 seconds of audio [20]. - The visual component utilizes a new efficient visual encoder, MobileNet-V5-300M, which can handle resolutions of 256x256, 512x512, and 768x768 pixels, achieving 60 frames per second on Google Pixel [21].