寒武纪国产芯片 - filings, earnings calls, financial reports, news

寒武纪国产芯片

Search documents

Guan Cha Zhe Wang· 2025-10-01 01:37

Core Insights - The latest model GLM-4.6 from Zhiyu, part of the domestic large model "Six Little Dragons," has been released, showcasing improvements in programming, long context handling, reasoning capabilities, information retrieval, writing skills, and agent applications [1] Group 1: Model Enhancements - GLM-4.6 demonstrates enhanced coding capabilities aligning with Claude Sonnet4 in public benchmarks and real programming tasks [4] - The context window has been increased from 128K to 200K, allowing for longer code and intelligent agent tasks [4] - The new model improves reasoning abilities and supports tool invocation during reasoning processes [4] Group 2: Technological Innovations - The "MoCore linkage" is a key focus of the new model, with GLM-4.6 achieving FP8+Int4 mixed-precision deployment on domestic Cambricon chips, marking the industry's first production of an FP8+Int4 model chip solution on domestic hardware [4] - FP8 (Floating-Point8) offers a wide dynamic range with minimal precision loss, while Int4 (Integer4) provides high compression ratios with low memory usage but more significant precision loss [4][5] Group 3: Resource Optimization - The mixed FP8+Int4 mode allocates quantization formats based on the functional differences of the model's modules, optimizing memory usage [5] - Core parameters, which account for 60%-80% of the total memory, can be compressed to 1/4 of FP16 size through Int4 quantization, significantly reducing chip memory pressure [5] - Temporary dialogue data accumulated during inference can be compressed using Int4 while keeping precision loss to a "slight" level [5] Group 4: Industry Collaboration - Moer Thread has completed adaptation of GLM-4.6 based on the vLLM inference framework, demonstrating the advantages of the MUSA architecture and full-function GPU in ecological compatibility and rapid adaptation [5] - The collaboration between Cambricon and Moer Thread signifies that domestic GPUs are now capable of iterating with cutting-edge large models, accelerating the establishment of a self-controlled AI technology ecosystem [5] - GLM-4.6, combined with domestic chips, will first be offered to enterprises and the public through the Zhiyu MaaS platform [5]

智谱发布GLM-4.6，寒武纪，摩尔线程完成适配

Guan Cha Zhe Wang· 2025-10-01 01:36

Core Insights - The latest model GLM-4.6 from Zhiyu, one of the "Six Little Dragons" of domestic large models, has been released, showcasing improvements in programming, long context handling, reasoning ability, information retrieval, writing skills, and intelligent applications [1] Model Enhancements - GLM-4.6 aligns its coding capabilities with Claude Sonnet 4 in public benchmarks and real programming tasks [4] - The context window has been increased from 128K to 200K, allowing for longer code and intelligent agent tasks [4] - The new model enhances reasoning capabilities and supports tool invocation during reasoning processes [4] - The model's tool invocation and search intelligence have been improved [4] Chip Integration and Cost Efficiency - A key focus of the new model is "module core linkage," with GLM-4.6 achieving FP8+Int4 mixed-precision deployment on domestic Cambrian chips, marking the first industry implementation of this model on domestic chips [4] - This mixed-precision approach reduces inference costs while maintaining accuracy, exploring feasible paths for localized operation of large models on domestic chips [4] - FP8 (Floating-Point 8) offers a wide dynamic range with minimal precision loss, while Int4 (Integer 4) provides high compression ratios with low memory usage but relatively higher precision loss [4] Memory Optimization - Core parameters of the large model, which account for 60%-80% of total memory, can be compressed to 1/4 of FP16 size through Int4 quantization, significantly reducing the memory pressure on chip graphics [5] - Temporary dialogue data accumulated during inference can be compressed using Int4 while keeping precision loss to a "slight" level [5] - FP8 is utilized for numerically sensitive modules to minimize precision loss and retain fine semantic information [5] Ecosystem Development - The adaptation of GLM-4.6 by Cambrian and Moore Threads signifies that domestic GPUs are capable of collaborating and iterating with cutting-edge large models, accelerating the construction of a self-controlled AI technology ecosystem [6] - The combination of GLM-4.6 and domestic chips will first be offered to enterprises and the public through the Zhiyu MaaS platform [6]

Di Yi Cai Jing· 2025-09-30 07:38

Core Insights - The latest model GLM-4.6 from the domestic AI startup Zhipu has been released, showcasing improvements in programming, long context handling, reasoning capabilities, information retrieval, writing skills, and agent applications [3] Model Enhancements - GLM-4.6 aligns its coding capabilities with Claude Sonnet 4 in public benchmarks and real programming tasks [3] - The context window has been increased from 128K to 200K, allowing for longer code and agent tasks [3] - The new model enhances reasoning abilities and supports tool invocation during reasoning processes [3] - There is an improvement in the model's tool invocation and search capabilities [3] Chip Integration - The "MoCore linkage" is a key focus of the new model, with GLM-4.6 achieving FP8+Int4 mixed quantization deployment on domestic Cambricon chips, marking the industry's first production of an FP8+Int4 model chip solution on domestic hardware [3] - This approach maintains accuracy while reducing inference costs, exploring feasible paths for localized operation of large models on domestic chips [3] Quantization Techniques - FP8 (Floating-Point 8) offers a wide dynamic range with minimal precision loss, while Int4 (Integer 4) provides high compression ratios with lower memory usage but more noticeable precision loss [4] - The "FP8+Int4 mixed" mode allocates quantization formats based on the functional differences of the model's modules, optimizing memory usage [4] Memory Efficiency - Core parameters of the large model, which account for 60%-80% of total memory, can be compressed to 1/4 of FP16 size through Int4 quantization, significantly reducing the memory pressure on chips [5] - Temporary dialogue data accumulated during inference can also be compressed using Int4 while keeping precision loss minimal [5] - FP8 is used for numerically sensitive modules to minimize precision loss and retain fine semantic information [5] Ecosystem Development - Cambricon and Moore Threads have successfully adapted GLM-4.6 based on the vLLM inference framework, demonstrating the capabilities of the new generation of GPUs to run the model stably at native FP8 precision [5] - This adaptation signifies that domestic GPUs are now capable of collaborating and iterating with cutting-edge large models, accelerating the development of a self-controlled AI technology ecosystem [5] - The combination of GLM-4.6 and domestic chips will be offered to enterprises and the public through the Zhipu MaaS platform [5]