智谱新模型也用DeepSeek的MLA,苹果M5就能跑
量子位·2026-01-20 04:17

Core Viewpoint - The article discusses the launch of the new lightweight language model GLM-4.7-Flash by Zhipu AI, which aims to replace its predecessor GLM-4.5-Flash and is available for free API access. Group 1: Model Specifications - GLM-4.7-Flash features a total of 30 billion parameters, with only 3 billion activated during inference, significantly reducing computational costs while maintaining performance [4][10]. - The model is designed as a mixed expert (MoE) architecture, specifically positioned for local programming and intelligent assistant tasks [4][9]. - It achieved a score of 59.2 in the SWE-bench Verified code repair test, outperforming similar models like Qwen3-30B and GPT-OSS-20B [4]. Group 2: Performance and Applications - The model is optimized for efficiency and retains core capabilities in coding and reasoning from the GLM-4 series [7]. - Besides programming, GLM-4.7-Flash is recommended for creative writing, translation, long-context tasks, and role-playing scenarios [8]. - Initial tests on a 32GB unified memory Apple laptop showed a speed of 43 tokens per second [17]. Group 3: Technical Innovations - The introduction of the MLA (Multi-head Latent Attention) architecture marks a significant advancement, previously validated by DeepSeek-v2 [12]. - The model's structure is similar in depth to GLM-4.5 Air and Qwen3-30B-A3B, but it utilizes 64 experts, activating only 5 during inference [13]. Group 4: Market Position and Pricing - GLM-4.7-Flash is offered for free on the official API platform, with a high-speed version available at a low cost [19]. - Compared to similar models, GLM-4.7-Flash has advantages in context length support and output token pricing, although latency and throughput require further optimization [19].