Core Viewpoint - Huawei has officially launched the AI inference technology "UCM" (Inference Memory Data Manager) to address challenges in AI inference efficiency and user experience [2][4]. Group 1: AI Inference Development - The AI industry is shifting focus from maximizing model capabilities to optimizing inference experiences, which directly impacts user satisfaction and commercial viability [4]. - Huawei plans to open-source UCM in September, initially releasing it on the Magic Engine community and gradually contributing to mainstream inference engine communities [5]. Group 2: UCM Technology and Benefits - UCM is a KV Cache-centered inference acceleration suite that integrates various caching acceleration algorithms to manage KV Cache memory data during inference, enhancing throughput and reducing latency [7]. - UCM enables longer inference sequences by offloading cache to external storage, achieving a tenfold increase in inference context window [8]. Group 3: Cost Efficiency and Performance - UCM can dynamically manage memory across HBM, DRAM, and SSD based on memory usage, improving TPS (tokens per second) by 2 to 22 times, thus lowering the cost per token [11]. - Current mainstream AI models in China output less than 60 tokens per second with a latency of 50 to 100 ms, while leading models abroad reach 200 tokens per second with a latency of 5 ms [11]. Group 4: Practical Applications - Huawei's AI inference acceleration solution, combining UCM with OceanStor A series technology, is being piloted in collaboration with China UnionPay across three business scenarios: Voice of Customer, Marketing Planning, and Office Assistant [12]. - In the Office Assistant scenario, the solution supports user input of over 170,000 tokens for long-sequence inference, addressing the limitations of long-sequence models [15].
AI重磅!华为“黑科技”来了
中国基金报·2025-08-12 07:37