Core Insights - Huawei has launched the AI inference technology UCM, which significantly reduces inference latency and costs while enhancing efficiency [1][2][3] Group 1: Technology and Performance - UCM addresses the challenges of high latency and cost in the AI inference field, with current foreign models achieving output speeds of 200 Tokens/s and latencies of 5ms, while domestic models are below 60 Tokens/s with latencies of 50-100ms [2][3] - UCM utilizes a KV Cache-centered architecture, integrating various caching acceleration algorithms to manage KV Cache memory data, thereby expanding the inference context window and achieving high throughput with low latency [3] - The technology can reduce the first token latency by up to 90% through global prefix caching and can enhance TPS (Tokens Per Second) by 2 to 22 times in long sequence scenarios [3][4] Group 2: Market Context and Impact - The investment scale of Chinese internet companies in AI is only one-tenth of that in the U.S., leading to a gap in inference experience compared to overseas counterparts [4] - UCM aims to optimize inference experience without increasing computational infrastructure costs, promoting a positive business cycle of experience enhancement, user growth, increased investment, and technological iteration [4] - The UCM technology has been piloted in three business scenarios with China UnionPay, achieving notable results in smart financial AI inference acceleration [4] Group 3: Future Plans and Industry Collaboration - As AI applications penetrate various real-world scenarios, the demand for token processing is expected to surge, with an example showing a projected daily token call of 16.4 trillion by May 2025, a 137-fold increase from 2024 [5] - Huawei plans to open-source UCM by September 2025, aiming to contribute to mainstream inference engine communities and share with industry partners to foster standardization and accelerate development in the inference field [5]
华为发布AI黑科技UCM,9月正式开源
Zheng Quan Shi Bao Wang·2025-08-12 10:16