Workflow
KV Cache
icon
Search documents
Hands-On, Enabling KV Cache on EXAScaler
DDN· 2026-01-30 19:12
Your EXAScaler is AI-ready. Join us in this hands-on technical session as we walk through how KV Cache is enabled on EXAScaler and what to expect in real environments. This session is focused on practical enablement and operational considerations, building directly on the KV Cache demo video. What You’ll Learn • How KV Cache fits into the EXAScaler architecture • Environment prerequisites and key considerations • How to enable and verify KV Cache and interpret performance results Who Should Attend • Enginee ...
高通新发AI推理芯片,瞄准3000亿美元市场,科创芯片ETF博时(588990)盘中回调超4%,备受资金关注
Sou Hu Cai Jing· 2025-10-31 06:01
Core Viewpoint - The semiconductor sector is experiencing volatility, influenced by major tech companies' earnings reports and new product launches in the AI chip market [3][4]. Group 1: Market Performance - As of October 31, 2025, the Shanghai Stock Exchange Sci-Tech Innovation Board Chip Index fell by 3.70%, with mixed performance among constituent stocks [3]. - Notable gainers included Peak Technology (+1.98%), Aiwei Electronics (+1.90%), and Lexin Technology (+1.36%), while Lanqi Technology (-9.30%), Yandong Micro (-7.61%), and Shengmei Shanghai (-7.06%) led the declines [3]. - The Bosera Sci-Tech Chip ETF (588990) decreased by 3.77%, with a latest price of 2.48 yuan, but saw a 4.80% increase over the past week, ranking 2nd among comparable funds [3]. Group 2: Liquidity and Fund Flows - The Bosera Sci-Tech Chip ETF recorded a turnover of 9.18% during the trading session, with a transaction volume of 63.18 million yuan [3]. - Over the past month, the ETF averaged daily transactions of 133 million yuan [3]. - In the last two weeks, the ETF's scale increased by 31.92 million yuan, ranking 3rd among comparable funds [4]. - However, there was a net outflow of 5.20 million yuan recently, with a total inflow of 53.12 million yuan over the last 16 trading days [4]. Group 3: Industry Insights - Longjiang Securities anticipates that as AI inference applications materialize, demand for DDR5 and eSSD storage will rise, driven by KV Cache transitioning from HBM to DRAM and SSD [4]. - Hynix projects a more than 20% growth in DRAM bit demand by 2026, with NAND Flash demand also expected to increase significantly [4]. - CITIC Securities forecasts that domestic wafer fabs could increase their global market share from 10% to 30%, indicating substantial expansion potential [4]. - The semiconductor equipment sector may experience a new growth cycle as leading storage manufacturers initiate new projects and advanced logic manufacturers ramp up production [4]. Group 4: Index Composition - The Shanghai Stock Exchange Sci-Tech Innovation Board Chip Index includes companies involved in semiconductor materials, equipment, design, manufacturing, packaging, and testing [5]. - As of September 30, 2025, the top ten weighted stocks in the index accounted for 59.69% of the total index, including companies like Haiguang Information, Lanqi Technology, and SMIC [5].
榨干GPU性能,中兴Mariana(马里亚纳)突破显存壁垒
量子位· 2025-08-26 05:46
Core Insights - The article discusses the challenges of expanding Key-Value Cache (KV Cache) storage in large language models (LLMs), highlighting the conflict between reasoning efficiency and memory cost [1] - It emphasizes the need for innovative solutions to enhance KV Cache storage without compromising performance [1] Industry Exploration - Nvidia's Dynamo project implements a multi-level caching algorithm for storage systems, but faces complexities in data migration and latency issues [2] - Microsoft's LMCahce system is compatible with inference frameworks but has limitations in distributed storage support and space capacity [3] - Alibaba proposed a remote storage solution extending KV Cache to Tair database, which offers easy scalability but struggles with low-latency requirements for LLM inference [3] Emerging Technologies - CXL (Compute Express Link) is presented as a promising high-speed interconnect technology that could alleviate memory bottlenecks in AI and high-performance computing [5] - Research on using CXL to accelerate LLM inference is still limited, indicating a significant opportunity for exploration [5] Mariana Exploration - ZTE Corporation and East China Normal University introduced a distributed shared KV storage technology named Mariana, which is designed for high-performance distributed KV indexing [6] - Mariana's architecture is tailored for GPU and KV Cache storage, achieving 1.7 times higher throughput and 23% lower tail latency compared to existing solutions [6] Key Innovations of Mariana - The Multi-Slot lock-based Concurrency Scheme (MSCS) allows fine-grained concurrency control at the entry level, significantly reducing contention and improving throughput [8] - Tailored Leaf Node (TLN) design optimizes data layout for faster access, enhancing read speeds by allowing simultaneous loading of key arrays into SIMD registers [10] - An adaptive caching strategy using Count-Min Sketch algorithm identifies and caches hot data efficiently, improving read performance [11] Application Validation - Mariana's architecture supports large-capacity storage by distributing data across remote memory pools, theoretically allowing unlimited storage space [13] - Experimental results indicate that Mariana significantly improves read/write throughput and latency performance in KV Cache scenarios [14] Future Prospects - Mariana's design is compatible with future CXL hardware, allowing seamless migration and utilization of CXL's advantages [18] - The advancements in Mariana and CXL technology could lead to efficient operation of large models on standard hardware, democratizing AI capabilities across various applications [18]
华为AI推理新技术犀利!中国银联大模型效率提高了125倍
Core Insights - Huawei has launched the Unified Cache Manager (UCM), an AI inference innovation technology aimed at optimizing inference speed, efficiency, and cost [1] - UCM is a KV Cache-centered inference acceleration suite that integrates various caching acceleration algorithms to manage KV Cache memory data during inference, enhancing throughput and reducing per-token inference costs [1][5] Summary by Sections UCM Overview - UCM is designed to address pain points in the inference process, focusing on optimizing user experience and commercial viability [4] - The technology aims to improve inference speed, with current foreign models achieving 200 Tokens/s compared to less than 60 Tokens/s in China [4] Technical Components - UCM consists of three main components: inference engine plugins, a library of multi-level KV Cache management and acceleration algorithms, and high-performance KV Cache access adapters [5] - The system can reduce the first token latency by up to 90% through its hierarchical adaptive global prefix caching technology [5] Application in Financial Sector - Huawei has partnered with China UnionPay to pilot UCM technology in financial scenarios, achieving a 125-fold increase in inference speed, allowing for rapid identification of customer inquiries [5] - The financial industry is seen as a natural fit for early adoption due to its digital nature and high demands for speed, efficiency, and reliability [5] Future Developments - China UnionPay plans to collaborate with Huawei and other partners to build "AI + Finance" demonstration applications, transitioning technology from experimental validation to large-scale application [6] Differentiation of UCM - UCM's advantages include the integration of professional storage capabilities, a comprehensive lifecycle management mechanism for KV Cache, and a diverse algorithm acceleration library [8] - Unlike existing solutions that focus solely on prefix caching, UCM incorporates a wider range of algorithms and is adaptable to various inference scenarios [8] Open Source Initiative - Huawei has announced an open-source plan for UCM, aiming to foster collaboration among framework, storage, and computing vendors to enhance efficiency and reduce costs in the AI industry [9] - The open-source initiative will officially launch in September, contributing to the mainstream inference engine community [9]