KV缓存
Search documents
开启存储下一个大机会!韩媒详解黄仁勋“神秘推理上下文内存平台”
Hua Er Jie Jian Wen· 2026-01-25 05:28
Core Insights - NVIDIA's CEO Jensen Huang introduced the "Inference Context Memory Platform" (ICMS) at CES 2026, aimed at addressing the explosive data storage demands during AI inference stages, marking a shift in AI hardware architecture towards efficient context storage [1][2][3] Group 1: ICMS Platform Overview - The ICMS platform is designed to tackle the "KV cache" problem in AI inference, as existing GPU memory and server architectures struggle to meet the growing data demands [1][3] - The platform integrates a new Data Processing Unit (DPU) and massive SSDs to create a large cache pool, aiming to overcome physical limitations in data storage [1][4] Group 2: Market Implications - The introduction of ICMS is expected to benefit major storage manufacturers like Samsung and SK Hynix, as NAND flash is poised to enter a "golden age" similar to that of HBM [2][5] - The demand for enterprise-grade SSDs and NAND flash is anticipated to surge due to the high storage density requirements of ICMS [5][23] Group 3: Technical Specifications - The ICMS platform utilizes the "BlueField-4" DPU, managing a total capacity of 9600TB across 16 SSD racks, significantly surpassing traditional GPU rack capacities [4][16] - Each ICMS rack can achieve a KV cache transfer speed of 200GB per second, addressing network bottlenecks associated with large-capacity SSDs [4][18][19] Group 4: Future Developments - NVIDIA is advancing the "Storage Next" initiative, allowing GPUs to directly access NAND flash, thereby eliminating data transfer bottlenecks [5][23] - SK Hynix is collaborating with NVIDIA to develop a prototype storage product, expected to support 25 million IOPS by the end of the year, with plans to enhance performance to 100 million IOPS by 2027 [5][23]
来自 Manus 的一手分享:如何构建 AI Agent 的上下文工程?
Founder Park· 2025-07-18 18:51
Core Insights - The article emphasizes the importance of context engineering in building AI agents, highlighting that it allows for rapid improvements and adaptability in response to advancements in underlying models [3][33] - Manus has adopted a strategy focused on context engineering, which enables faster iterations and keeps their products aligned with the evolving capabilities of foundational models [3][33] Group 1: Context Engineering Principles - KV cache hit rate is identified as the most critical metric for production-level AI agents, significantly impacting latency and cost [6][7] - The article outlines several key practices to improve KV cache hit rates, including maintaining stable prompt prefixes and ensuring context remains additive rather than modifying previous actions or observations [10][11] - The use of a context-aware state machine to manage tool availability is recommended to prevent inefficient action selection as the action space grows [10][15] Group 2: Handling Context Limitations - The article discusses the challenges of context length in AI agents, noting that while modern LLMs support large context windows, practical limitations often arise [17][19] - Manus treats the file system as an ultimate context, allowing for unlimited capacity and persistent memory, which can be directly manipulated by agents [19][23] Group 3: Attention Management and Error Handling - A unique attention management strategy is employed by Manus, where a todo.md file is created and updated throughout task execution to keep the agent focused on its goals [24][27] - The article advocates for retaining erroneous actions in context to help the model learn from mistakes, thereby improving its adaptability and reducing the likelihood of repeating errors [28][31] Group 4: Avoiding Few-Shot Pitfalls - Few-shot prompting can lead to undesirable outcomes in agent systems, as models may overly rely on repetitive patterns from similar action-observation pairs [32] - Introducing controlled randomness in actions and observations is suggested to break fixed patterns and enhance model attention [32] Conclusion - Context engineering is presented as an emerging discipline essential for AI agent systems, influencing their speed, recovery capabilities, and scalability [33][34]