智谱推出全球100B级最强开源多模态模型GLM-4.5V：获41个榜单SOTA

Core Viewpoint - The article discusses the launch of GLM-4.5V, a state-of-the-art open-source visual reasoning model by Zhipu, which is a significant step towards achieving Artificial General Intelligence (AGI) [3][4]. Group 1: Model Overview - GLM-4.5V features a total of 106 billion parameters, with 12 billion activation parameters, and is designed for multi-modal reasoning, which is essential for AGI [3][4]. - The model builds on the previous GLM-4.1V-Thinking, showcasing enhanced performance across various visual tasks, including image, video, and document understanding [4][6]. Group 2: Performance Metrics - In 41 public multi-modal benchmarks, GLM-4.5V achieved state-of-the-art (SOTA) performance, outperforming other models in tasks such as general visual question answering (VQA) and visual grounding [5][6]. - Specific performance metrics include a general VQA score of 88.2 on MMBench v1.1 and 91.3 on RefCOCO-avg for visual grounding tasks [5]. Group 3: Technical Features - The model incorporates a visual encoder, MLP adapter, and language decoder, supporting 64K multi-modal long contexts and enhancing video processing efficiency through 3D convolution [6][8]. - It utilizes a three-stage training strategy: pre-training, supervised fine-tuning (SFT), and reinforcement learning (RL), which collectively improve its multi-modal understanding and reasoning capabilities [8]. Group 4: Practical Applications - Zhipu has developed a desktop assistant application that leverages GLM-4.5V for real-time screen capture and processing various visual reasoning tasks, enhancing user interaction and productivity [8][9]. - The company aims to empower developers through model open-sourcing and API services, encouraging innovative applications of multi-modal models [9].