MXFP4

Search documents
推理成本骤降75%。gpt-oss用新数据类型实现4倍推理速度,80GB显卡能跑1200亿参数大模型
3 6 Ke· 2025-08-11 10:17
Core Insights - OpenAI's latest open-source model, gpt-oss, utilizes the MXFP4 data type, resulting in a 75% reduction in inference costs and a fourfold increase in token generation speed [1][3] - The MXFP4 data type allows a 120 billion parameter model to fit into an 80GB GPU, and a 20 billion parameter version can run on a 16GB GPU [1][3] Summary by Sections Model Performance and Cost Efficiency - MXFP4 is applied to approximately 90% of the weights in gpt-oss, significantly lowering the operational costs of the model [3] - The memory usage of the gpt-oss model is only one-fourth that of a comparable BF16 model, while the token generation speed can be increased by up to four times [3] Technical Specifications - The MXFP4 data type compresses weight storage to one-eighth of the traditional FP32 format, allowing for faster data reading and writing under the same bandwidth [10][12] - The model's active parameters and total parameters for 120 billion and 20 billion configurations are 5.13 billion and 3.61 billion, respectively, with checkpoint sizes of 60.8 GiB and 12.8 GiB [2] Data Type and Its Implications - MXFP4 stands for Micro-scaling Floating Point 4-bit, defined by the Open Compute Project, and is designed to optimize data size while maintaining precision [12][23] - The MXFP4 format allows for a significant reduction in data size without a substantial loss in model performance, as evidenced by research indicating minimal quality loss when reducing data precision from 16 bits to 8 bits [25][26] Comparison with Other Formats - While MXFP4 is an improvement over standard FP4, NVIDIA suggests that it may still exhibit quality degradation compared to FP8 due to its scaling block size [26] - OpenAI's exclusive use of MXFP4 in gpt-oss suggests confidence in its adequacy for high-performance applications in AI [26]
推理成本骤降75%!gpt-oss用新数据类型实现4倍推理速度,80GB显卡能跑1200亿参数大模型
量子位· 2025-08-11 07:48
Core Insights - OpenAI's latest open-source model gpt-oss utilizes the MXFP4 data type, resulting in a 75% reduction in inference costs and a fourfold increase in token generation speed [1][5][4] Group 1: Cost Reduction and Performance Improvement - The MXFP4 data type allows a 120 billion parameter model to fit into an 80GB GPU, and even a 16GB GPU can run a 20 billion parameter version [2] - MXFP4 compresses memory usage to one-fourth of the equivalent BF16 model while significantly enhancing token generation speed [5][4] - The MXFP4 quantization is applied to approximately 90% of the weights in gpt-oss, primarily aimed at reducing operational costs [4][5] Group 2: Technical Mechanism - The operational cost of models is influenced by weight storage and memory bandwidth, with data type changes directly affecting these factors [7][10] - Traditional models use FP32 for weight storage, consuming 4 bytes per parameter, while MXFP4 reduces this to half a byte, leading to an 87.5% reduction in weight storage size [11][12] - This compression not only decreases storage space but also allows for faster data read/write operations, enhancing inference speed [13][14] Group 3: MXFP4 Characteristics - MXFP4 stands for Micro-scaling Floating Point 4-bit, defined by the Open Compute Project (OCP) [15] - MXFP4 maintains a balance between data size reduction and precision by using a scaling factor for groups of high-precision values [20][22] - The performance of chips can double with each halving of floating-point precision, significantly improving throughput during inference [24] Group 4: Industry Implications - OpenAI's adoption of MXFP4 suggests its adequacy for broader applications, potentially influencing industry standards [34][35] - The MXFP4 data type is not new, having been discussed in OCP reports, but its practical application in large language models is gaining traction [28] - While MXFP4 is an improvement over standard FP4, it may still face quality issues compared to FP8, prompting alternatives like NVFP4 from Nvidia [32][33]