国产推理生态
Search documents
华为发布AI推理“黑科技” 助力解决AI推理效率与用户体验难题
Zhong Guo Ji Jin Bao· 2025-08-12 07:50
Core Insights - Huawei officially launched the AI inference "black technology" UCM (Inference Memory Data Manager) to address challenges in AI inference efficiency and user experience [1] Group 1: AI Inference Development - The AI industry is shifting focus from "maximizing model capabilities" to "optimizing inference experience," which directly impacts user satisfaction and commercial viability, becoming a key metric for measuring AI model value [2] - Huawei plans to open-source UCM in September, initially launching it in the Magic Engine community and gradually contributing to mainstream inference engine communities and sharing with all storage vendors and ecosystem partners [2] Group 2: UCM Features and Benefits - UCM is a KV Cache-centered inference acceleration suite that integrates various caching acceleration algorithms, enabling tiered management of KV Cache memory data generated during inference, thus expanding the inference context window for high throughput and low latency [3] - UCM can dynamically offload long sequence Cache to external professional storage using techniques like dynamic KV unloading and position encoding expansion, achieving a tenfold increase in inference context window [7] Group 3: Cost Efficiency and Performance - UCM enhances inference cost efficiency by allowing memory to flow based on usage across HBM, DRAM, and SSD storage mediums, and by integrating various sparse attention algorithms to improve TPS (tokens per second) by 2 to 22 times, thereby reducing the cost per token [8] - Current mainstream AI models in China have a single-user output speed of less than 60 Tokens/s with a latency of 50 to 100 ms, while leading foreign models have entered the 200 Tokens/s range with a latency of 5 ms [8] Group 4: Practical Applications - Huawei's AI inference acceleration solution, combining UCM with Huawei AI storage (OceanStor A series), is being piloted with China UnionPay in three business scenarios: Voice of the Customer, Marketing Planning, and Office Assistant [9] - For the Office Assistant scenario, the solution supports user input of over 170,000 tokens for long sequence inference, addressing the issue of long sequence models being unable to function effectively [10]
AI重磅!华为“黑科技”来了
中国基金报· 2025-08-12 07:37
Core Viewpoint - Huawei has officially launched the AI inference technology "UCM" (Inference Memory Data Manager) to address challenges in AI inference efficiency and user experience [2][4]. Group 1: AI Inference Development - The AI industry is shifting focus from maximizing model capabilities to optimizing inference experiences, which directly impacts user satisfaction and commercial viability [4]. - Huawei plans to open-source UCM in September, initially releasing it on the Magic Engine community and gradually contributing to mainstream inference engine communities [5]. Group 2: UCM Technology and Benefits - UCM is a KV Cache-centered inference acceleration suite that integrates various caching acceleration algorithms to manage KV Cache memory data during inference, enhancing throughput and reducing latency [7]. - UCM enables longer inference sequences by offloading cache to external storage, achieving a tenfold increase in inference context window [8]. Group 3: Cost Efficiency and Performance - UCM can dynamically manage memory across HBM, DRAM, and SSD based on memory usage, improving TPS (tokens per second) by 2 to 22 times, thus lowering the cost per token [11]. - Current mainstream AI models in China output less than 60 tokens per second with a latency of 50 to 100 ms, while leading models abroad reach 200 tokens per second with a latency of 5 ms [11]. Group 4: Practical Applications - Huawei's AI inference acceleration solution, combining UCM with OceanStor A series technology, is being piloted in collaboration with China UnionPay across three business scenarios: Voice of Customer, Marketing Planning, and Office Assistant [12]. - In the Office Assistant scenario, the solution supports user input of over 170,000 tokens for long-sequence inference, addressing the limitations of long-sequence models [15].