Core Insights - Huawei officially launched the AI inference "black technology" UCM (Inference Memory Data Manager) to address challenges in AI inference efficiency and user experience [1] Group 1: AI Inference Development - The AI industry is shifting focus from "maximizing model capabilities" to "optimizing inference experience," which directly impacts user satisfaction and commercial viability, becoming a key metric for measuring AI model value [2] - Huawei plans to open-source UCM in September, initially launching it in the Magic Engine community and gradually contributing to mainstream inference engine communities and sharing with all storage vendors and ecosystem partners [2] Group 2: UCM Features and Benefits - UCM is a KV Cache-centered inference acceleration suite that integrates various caching acceleration algorithms, enabling tiered management of KV Cache memory data generated during inference, thus expanding the inference context window for high throughput and low latency [3] - UCM can dynamically offload long sequence Cache to external professional storage using techniques like dynamic KV unloading and position encoding expansion, achieving a tenfold increase in inference context window [7] Group 3: Cost Efficiency and Performance - UCM enhances inference cost efficiency by allowing memory to flow based on usage across HBM, DRAM, and SSD storage mediums, and by integrating various sparse attention algorithms to improve TPS (tokens per second) by 2 to 22 times, thereby reducing the cost per token [8] - Current mainstream AI models in China have a single-user output speed of less than 60 Tokens/s with a latency of 50 to 100 ms, while leading foreign models have entered the 200 Tokens/s range with a latency of 5 ms [8] Group 4: Practical Applications - Huawei's AI inference acceleration solution, combining UCM with Huawei AI storage (OceanStor A series), is being piloted with China UnionPay in three business scenarios: Voice of the Customer, Marketing Planning, and Office Assistant [9] - For the Office Assistant scenario, the solution supports user input of over 170,000 tokens for long sequence inference, addressing the issue of long sequence models being unable to function effectively [10]
华为发布AI推理“黑科技” 助力解决AI推理效率与用户体验难题
Zhong Guo Ji Jin Bao·2025-08-12 07:50