Workflow
课程采样强化学习
icon
Search documents
9B“小”模型干了票“大”的:性能超8倍参数模型,拿下23项SOTA | 智谱开源
量子位· 2025-07-02 04:46
Core Viewpoint - The article discusses the release of Zhipu's new visual language model, GLM-4.1V-9B-Thinking, which excels in reasoning capabilities and has achieved state-of-the-art results in various evaluations, outperforming larger models in certain tasks [3][4][5]. Summary by Sections Model Performance - GLM-4.1V-9B-Thinking achieved 23 state-of-the-art results out of 28 evaluations, making it the best-performing model in the 10 billion parameter category [3]. - The model demonstrates strong reasoning abilities, as evidenced by its performance on complex tasks such as interpreting art and solving math problems [11][15][19]. Technical Architecture - The model consists of three main components: a visual encoder, a language decoder, and a multi-layer perceptron adapter [25][33]. - The visual encoder uses a 3D convolution approach to process video efficiently, while the language decoder has been upgraded to better understand spatial relationships [26][28]. - The training process includes three phases: pre-training, supervised fine-tuning, and reinforcement learning with curriculum sampling [29][35][38]. Training Methodology - During pre-training, the model underwent 120,000 training steps with a batch size of 1,536, focusing on diverse data types including image-text pairs and OCR [31]. - The supervised fine-tuning phase utilized high-quality "chain-of-thought" data to enhance the model's ability to handle complex reasoning tasks [36]. - The reinforcement learning phase employed a curriculum learning strategy to progressively challenge the model with more difficult tasks, improving its overall performance [40]. Applications and Capabilities - The model can analyze long videos, perform intelligent image question answering, assist in solving science problems, and process professional documents [32]. - It is capable of recognizing and interacting with graphical user interfaces, as well as generating code based on design images [42].