Kimi开源又放大招！20秒更新万亿参数的中间件来了

Core Viewpoint - The article discusses the introduction of a middleware called "checkpoint-engine" that enables the Kimi K2 model, which has one trillion parameters, to update its model weights in approximately 20 seconds across thousands of GPUs, marking a significant advancement in the efficiency of large language model training and inference [6][7]. Group 1: Middleware Functionality - The checkpoint-engine is designed to facilitate the updating of model weights during the inference process of large language models [6]. - It allows for both simultaneous broadcasting of updated weights to all nodes and point-to-point dynamic updates [2][24]. - The middleware supports a pipeline approach for parameter updates, minimizing memory usage by updating parameters one at a time [19][20]. Group 2: System Architecture - Kimi K2 employs a hybrid co-location architecture where the training and inference engines are deployed on the same set of nodes [8]. - During each reinforcement learning iteration, a centralized controller generates new training data using the inference engine and then instructs the training engine to update parameters based on this data [9]. - The system is optimized for high throughput, with each engine deeply optimized for performance [10]. Group 3: Parameter Update Process - The training engine's parameters are unloaded to DRAM, allowing for quick activation of the training engine with minimal data transfer [12]. - The checkpoint engine manages parameter states by first obtaining local parameter copies from the training engine and then broadcasting the complete parameter set to all checkpoint nodes [16][17]. - The inference engine retrieves only the necessary parameter slices from the checkpoint engine, streamlining the update process [18]. Group 4: Performance Optimization - The design sacrifices some data transfer efficiency for a simpler system architecture, which reduces the complexity of maintenance and testing [25][26]. - During the startup of the training engine, nodes selectively read parameters from disk to minimize expensive disk I/O operations [28]. - The checkpoint engine can independently restart in case of failures, enhancing system resilience [33].