Core Insights - DeepSeek has introduced a new inference system called DualPath, addressing the I/O bottleneck in current large language models for intelligent agent applications [1][3][19] - The DualPath system significantly enhances throughput by implementing a dual-path loading mechanism, effectively eliminating KV cache I/O overhead [1][13] Group 1: System Innovation - DualPath opens a new channel from storage directly to the decoding engine, allowing KV cache to be loaded into the decoding engine and efficiently transmitted to the pre-filling end via RDMA [1][5] - The system achieves a maximum offline inference throughput increase of 1.87 times and an average online service throughput increase of 1.96 times during real intelligent agent workload tests [1][13][17] Group 2: Technical Components - DualPath consists of three core components: the inference engine, traffic manager, and request scheduler, which work together to optimize data movement and resource utilization [6][7] - The traffic manager ensures that KV cache traffic does not interfere with latency-sensitive model collective communications, utilizing a compute network-centric traffic management strategy [11][12] Group 3: Performance Validation - Experiments conducted on a GPU server cluster connected via InfiniBand demonstrated that DualPath can achieve up to 1.87 times acceleration compared to baseline inference frameworks, indicating that KV cache I/O overhead has been largely eliminated [13][15] - The system has been validated for scalability, achieving near-linear expansion from 2P4D (2,000 agents) to 48P96D (48,000 agents) while maintaining consistent task completion times [17][18] Group 4: Future Directions - The research team acknowledges the need for more adaptive and flexible configurations for parallelism and P/D ratios in future developments, suggesting the potential for simulator or online adjustment mechanisms [19]
DeepSeek发布下一代技术,北大实习生立功