T5Gemma模型再更新,谷歌还在坚持编码器-解码器架构
机器之心·2025-12-19 03:42

Core Viewpoint - Google has recently intensified its model releases, introducing the Gemini 3 Flash and the unexpected T5Gemma 2, which builds on the capabilities of the Gemini 3 series [1][3]. Group 1: T5Gemma 2 Overview - T5Gemma 2 is a new generation encoder-decoder model that is the first to support multi-modal and long-context capabilities, built on the robust features of Gemini 3 [9]. - The model offers three pre-trained scales: 270M-270M, 1B-1B, and 4B-4B, and is the first high-performance encoder-decoder model in the community to support ultra-long contexts of up to 128K tokens [9][11]. Group 2: Innovations and Upgrades - T5Gemma 2 continues the adaptation training approach of T5Gemma, converting a pre-trained decoder model into an encoder-decoder model, while leveraging key innovations from Gemini 3 to extend into the visual-language model domain [13]. - Significant architectural innovations include: 1. Shared word embeddings between the encoder and decoder, reducing overall parameter count and allowing for more effective capabilities within the same memory footprint [15]. 2. Merging self-attention and cross-attention into a unified attention layer, enhancing model parallelization efficiency and inference performance [16] [15]. Group 3: Model Capabilities - T5Gemma 2 demonstrates significant upgrades in capabilities: 1. Multi-modal capability, enabling the model to understand and process both images and text, facilitating tasks like visual question answering and multi-modal reasoning [17]. 2. Extended context support, with the ability to handle context windows of up to 128K tokens through a local-global alternating attention mechanism [18]. 3. Large-scale multilingual support, capable of operating in over 140 languages due to training on larger and more diverse datasets [19]. Group 4: Performance Results - T5Gemma 2 sets a new standard for compact encoder-decoder models, showing outstanding performance in key capability areas and inheriting the powerful multi-modal and long-context features of Gemini 3 [21]. - In benchmark tests, T5Gemma 2 outperforms both Gemini 3 and T5Gemma in multi-modal performance, long-context capability, and overall general capabilities across various tasks [25][29].