Workflow
跨模态理解
icon
Search documents
阿里通义发布并开源Qwen3-VL-Embedding和Qwen3-VL-Reranker模型
智通财经网· 2026-01-09 01:31
Core Insights - The Qwen3-VL model series excels in handling multiple modalities such as text, images, visual documents, and videos within a unified framework, achieving industry-leading performance in various tasks like image-text retrieval and visual question answering [1][8]. Group 1: Multi-Modal Capabilities - The model series can process diverse input types, including text, images, visual documents, and videos, demonstrating superior performance in tasks like multi-modal content clustering [1]. - Qwen3-VL-Embedding effectively generates rich semantic vector representations, mapping visual and textual information into a shared semantic space for efficient cross-modal similarity computation and retrieval [2][4]. Group 2: Model Architecture and Performance - Qwen3-VL-Embedding employs a dual-tower architecture for efficient independent encoding of different modalities, suitable for large-scale parallel computation [14]. - Qwen3-VL-Reranker, as a complementary model, utilizes a single-tower architecture with cross-attention mechanisms to analyze semantic relationships between queries and documents, enhancing precision in relevance scoring [14]. - In benchmark tests like MMEB-v2 and MMTEB, Qwen3-VL-Embedding-8B achieved leading results, surpassing all previous open-source models and closed-source commercial services [8][11]. Group 3: Practical Utility - The Qwen3-VL series supports over 30 languages, making it suitable for global deployment, and offers flexible vector dimension selection and task instruction customization [6]. - The models maintain excellent performance even after quantization, facilitating integration into existing systems for developers [6]. Group 4: Performance Metrics - Qwen3-VL-Embedding-2B achieved an average score of 73.4 in MMEB-v2 retrieval tasks, while Qwen3-VL-Reranker-8B scored 79.2, indicating superior performance in most tasks [13]. - The performance of Qwen3-VL-Reranker models consistently outperformed baseline models, with the 8B version achieving the best results across various tasks [11].
语言先验「基础过强」,MLLMs 视觉衰减有何解?
机器之心· 2025-11-01 02:30
Core Viewpoint - The article discusses the limitations of Multimodal Large Language Models (MLLMs) in effectively integrating visual information, highlighting a systemic bias towards text and the diminishing attention to visual tokens during extended reasoning chains [1]. Group 1: Visual Information Neglect in MLLMs - MLLMs, based on Transformer architecture, have made progress in tasks like visual question answering and image description by combining language model reasoning with visual encoding capabilities [5]. - There is a systemic bias in MLLMs' attention distribution, leading to an over-reliance on language and a neglect of visual information, especially in complex reasoning scenarios [5][6]. - As reasoning chains lengthen, the model's focus on image content significantly decreases, while attention to language tokens increases, resulting in a reliance on language cues over visual content [5][6]. Group 2: Amplification of Visual Errors in Deep Reasoning - The imbalance in modalities within MLLMs stems from the disproportionate focus on text data during training, which is often in the trillions, giving LLMs strong language priors [8]. - Visual features, despite being represented in high dimensions, are often overshadowed by language features, leading to their neglect during the initial fusion process [8][9]. - The training objectives of MLLMs favor language data, which is more abstract and compact, causing the model to adopt shortcut learning strategies that prioritize text over complex visual information [9].