1000 行 Java 代码手搓 OpenAI gpt-oss 推理引擎

Core Insights - OpenAI released gpt-oss in August 2025, providing two reasoning models: 120b and 20b, which gained support from major cloud providers and inference engines [3] - The model architecture follows mainstream designs, utilizing tiktoken for tokenization, MoE architecture, and various optimizations for efficiency [5][9] - The Java port of gpt-oss achieved a high-performance CPU inference engine with approximately 1000 lines of code, demonstrating the feasibility of running LLMs on CPU [3][37] Model Architecture Overview - gpt-oss retains a conventional model architecture, employing techniques like Grouped Query Attention and MoE to balance model capability and inference efficiency [5] - The 20b model is structured with 24 layers, each containing 32 experts, activating only 4 experts per forward pass to reduce computational load [5] - The model file size for the 20b version is approximately 13GB due to mxfp4 quantization [5] Implementation Process - The Java porting process involved replicating the original PyTorch model structure, focusing on key implementations and performance optimizations [9][10] - The model's MLP layer parameters are quantized using mxfp4, optimizing memory requirements during inference [12] Performance Optimization - Initial performance on AWS EC2 was 0.04 tokens/sec, but optimizations improved this to approximately 7 tokens/sec for decoding and 10 tokens/sec for prefill [23][34] - Matrix multiplication optimizations included cache optimization, vectorization, and parallel processing, achieving significant performance gains [24][28] - The final implementation on AWS EC2 reached 61.4 GFLOPS, representing 42% of the machine's peak performance [27] Memory Management - The project utilized Java Foreign Memory API for memory mapping, allowing the model to run with only 16GB of memory [29] - Memory copy reductions were achieved by pre-allocating intermediate data and using mmap for MLP weights [30] Conclusion - The project demonstrated the potential of Java for high-performance LLM inference, with ongoing improvements in Java's performance capabilities [38] - The experience highlighted the importance of engineering optimizations in LLM inference, distinguishing it from pre-training and post-training processes [37]