跨模态理解 - filings, earnings calls, financial reports, news

跨模态理解

Search documents

阿里通义发布并开源Qwen3-VL-Embedding和Qwen3-VL-Reranker模型

智通财经网· 2026-01-09 01:31

Core Insights - The Qwen3-VL model series excels in handling multiple modalities such as text, images, visual documents, and videos within a unified framework, achieving industry-leading performance in various tasks like image-text retrieval and visual question answering [1][8]. Group 1: Multi-Modal Capabilities - The model series can process diverse input types, including text, images, visual documents, and videos, demonstrating superior performance in tasks like multi-modal content clustering [1]. - Qwen3-VL-Embedding effectively generates rich semantic vector representations, mapping visual and textual information into a shared semantic space for efficient cross-modal similarity computation and retrieval [2][4]. Group 2: Model Architecture and Performance - Qwen3-VL-Embedding employs a dual-tower architecture for efficient independent encoding of different modalities, suitable for large-scale parallel computation [14]. - Qwen3-VL-Reranker, as a complementary model, utilizes a single-tower architecture with cross-attention mechanisms to analyze semantic relationships between queries and documents, enhancing precision in relevance scoring [14]. - In benchmark tests like MMEB-v2 and MMTEB, Qwen3-VL-Embedding-8B achieved leading results, surpassing all previous open-source models and closed-source commercial services [8][11]. Group 3: Practical Utility - The Qwen3-VL series supports over 30 languages, making it suitable for global deployment, and offers flexible vector dimension selection and task instruction customization [6]. - The models maintain excellent performance even after quantization, facilitating integration into existing systems for developers [6]. Group 4: Performance Metrics - Qwen3-VL-Embedding-2B achieved an average score of 73.4 in MMEB-v2 retrieval tasks, while Qwen3-VL-Reranker-8B scored 79.2, indicating superior performance in most tasks [13]. - The performance of Qwen3-VL-Reranker models consistently outperformed baseline models, with the 8B version achieving the best results across various tasks [11].

多模态信息检索

跨模态理解

Artificial Intelligence

Artificial Intelligence

Qwen3-VL-Embedding

Qwen3-VL-Reranker

语言先验「基础过强」，MLLMs 视觉衰减有何解？

机器之心· 2025-11-01 02:30

Core Viewpoint - The article discusses the limitations of Multimodal Large Language Models (MLLMs) in effectively integrating visual information, highlighting a systemic bias towards text and the diminishing attention to visual tokens during extended reasoning chains [1]. Group 1: Visual Information Neglect in MLLMs - MLLMs, based on Transformer architecture, have made progress in tasks like visual question answering and image description by combining language model reasoning with visual encoding capabilities [5]. - There is a systemic bias in MLLMs' attention distribution, leading to an over-reliance on language and a neglect of visual information, especially in complex reasoning scenarios [5][6]. - As reasoning chains lengthen, the model's focus on image content significantly decreases, while attention to language tokens increases, resulting in a reliance on language cues over visual content [5][6]. Group 2: Amplification of Visual Errors in Deep Reasoning - The imbalance in modalities within MLLMs stems from the disproportionate focus on text data during training, which is often in the trillions, giving LLMs strong language priors [8]. - Visual features, despite being represented in high dimensions, are often overshadowed by language features, leading to their neglect during the initial fusion process [8][9]. - The training objectives of MLLMs favor language data, which is more abstract and compact, causing the model to adopt shortcut learning strategies that prioritize text over complex visual information [9].