推理成本骤降75%！gpt-oss用新数据类型实现4倍推理速度，80GB显卡能跑1200亿参数大模型

Core Insights - OpenAI's latest open-source model gpt-oss utilizes the MXFP4 data type, resulting in a 75% reduction in inference costs and a fourfold increase in token generation speed [1][5][4] Group 1: Cost Reduction and Performance Improvement - The MXFP4 data type allows a 120 billion parameter model to fit into an 80GB GPU, and even a 16GB GPU can run a 20 billion parameter version [2] - MXFP4 compresses memory usage to one-fourth of the equivalent BF16 model while significantly enhancing token generation speed [5][4] - The MXFP4 quantization is applied to approximately 90% of the weights in gpt-oss, primarily aimed at reducing operational costs [4][5] Group 2: Technical Mechanism - The operational cost of models is influenced by weight storage and memory bandwidth, with data type changes directly affecting these factors [7][10] - Traditional models use FP32 for weight storage, consuming 4 bytes per parameter, while MXFP4 reduces this to half a byte, leading to an 87.5% reduction in weight storage size [11][12] - This compression not only decreases storage space but also allows for faster data read/write operations, enhancing inference speed [13][14] Group 3: MXFP4 Characteristics - MXFP4 stands for Micro-scaling Floating Point 4-bit, defined by the Open Compute Project (OCP) [15] - MXFP4 maintains a balance between data size reduction and precision by using a scaling factor for groups of high-precision values [20][22] - The performance of chips can double with each halving of floating-point precision, significantly improving throughput during inference [24] Group 4: Industry Implications - OpenAI's adoption of MXFP4 suggests its adequacy for broader applications, potentially influencing industry standards [34][35] - The MXFP4 data type is not new, having been discussed in OCP reports, but its practical application in large language models is gaining traction [28] - While MXFP4 is an improvement over standard FP4, it may still face quality issues compared to FP8, prompting alternatives like NVFP4 from Nvidia [32][33]