Core Insights - The article discusses the increasing focus on the ability of inference systems to handle real-world loads as agents accelerate on the application side [1][4] - The SGLang AI financial meetup highlighted engineering challenges in inference systems, including high concurrency requests, long context windows, multi-turn reasoning, memory management, and consistency generation in financial agent scenarios [4][9] Group 1: Inference System Engineering Solutions - The SGLang event, co-hosted with AtomGit, focused on large model inference architecture, agents, reinforcement learning, and their application in finance [7] - Key participants included engineering teams from inference systems, models, and computing power, emphasizing the higher demands for efficiency in high concurrency, long context windows, multi-turn reasoning, and memory management for agents compared to traditional LLMs [8] - Specific deployment scenarios, such as financial agents, have stricter requirements for low latency, response stability, consistency, and cost control [9] Group 2: Technical Innovations and Implementations - SGLang introduced the HiCache system to address issues of KV cache redundancy and high memory demand in high concurrency and long context scenarios, significantly reducing memory usage and improving inference stability and throughput [11] - For mixed models like Qwen3-Next and Kimi Linear, SGLang implemented Mamba Radix Tree for unified prefix management and Elastic Memory Pool for efficient inference and memory optimization in long context and high concurrency scenarios [13] - The Mooncake system, based on Transfer Engine, significantly reduced weight loading and model startup times, achieving weight update preparation in under 20 seconds and cold start times from 85 seconds to 9 seconds [17] Group 3: Collaboration with Ascend Platform - The capabilities of the inference systems are not limited to a specific computing platform, as HiCache, Mooncake, and GLM can run directly on the Ascend platform, indicating a shift in Ascend's role in the inference system ecosystem [24][25] - SGLang's latest advancements on the Ascend platform include model adaptation, performance optimization, and modular acceleration capabilities, achieving a throughput of 15 TPS per card for DeepSeek V3.2 under specific conditions [29] - System-level optimizations included load balancing, operator fusion to reduce memory access, and multi-stream parallel execution to enhance resource utilization [30][31] Group 4: Future Directions and Open Source Commitment - Ascend's collaboration with SGLang aims to fully embrace open source and accelerate ecosystem development, having completed gray testing of DeepSeek V3.2 in real business scenarios [46] - Future developments will focus on systematic engineering investments around inference systems, enhancing throughput for high concurrency and low latency workloads, and aligning with open-source engines for model deployment and performance tuning [47] - The integration of models, inference engines, and computing platforms into a stable collaborative framework will shift the focus from whether a model can run to whether the system can run sustainably and at scale [47]
SGLang原生支持昇腾,新模型一键拉起无需改代码
量子位·2025-12-21 14:13