Model Inference Cost Reduction
Search documents
推理成本骤降75%。gpt-oss用新数据类型实现4倍推理速度,80GB显卡能跑1200亿参数大模型
3 6 Ke· 2025-08-11 10:17
Core Insights - OpenAI's latest open-source model, gpt-oss, utilizes the MXFP4 data type, resulting in a 75% reduction in inference costs and a fourfold increase in token generation speed [1][3] - The MXFP4 data type allows a 120 billion parameter model to fit into an 80GB GPU, and a 20 billion parameter version can run on a 16GB GPU [1][3] Summary by Sections Model Performance and Cost Efficiency - MXFP4 is applied to approximately 90% of the weights in gpt-oss, significantly lowering the operational costs of the model [3] - The memory usage of the gpt-oss model is only one-fourth that of a comparable BF16 model, while the token generation speed can be increased by up to four times [3] Technical Specifications - The MXFP4 data type compresses weight storage to one-eighth of the traditional FP32 format, allowing for faster data reading and writing under the same bandwidth [10][12] - The model's active parameters and total parameters for 120 billion and 20 billion configurations are 5.13 billion and 3.61 billion, respectively, with checkpoint sizes of 60.8 GiB and 12.8 GiB [2] Data Type and Its Implications - MXFP4 stands for Micro-scaling Floating Point 4-bit, defined by the Open Compute Project, and is designed to optimize data size while maintaining precision [12][23] - The MXFP4 format allows for a significant reduction in data size without a substantial loss in model performance, as evidenced by research indicating minimal quality loss when reducing data precision from 16 bits to 8 bits [25][26] Comparison with Other Formats - While MXFP4 is an improvement over standard FP4, NVIDIA suggests that it may still exhibit quality degradation compared to FP8 due to its scaling block size [26] - OpenAI's exclusive use of MXFP4 in gpt-oss suggests confidence in its adequacy for high-performance applications in AI [26]