解耦推理 - filings, earnings calls, financial reports, news

解耦推理

Search documents

3 6 Ke· 2025-11-10 04:11

Core Insights - The article discusses the emergence of the "Decoupled Inference" concept introduced by the Peking University and UCSD teams, which has rapidly evolved from a laboratory idea to an industry standard adopted by major frameworks like NVIDIA and vLLM, indicating a shift towards "modular intelligence" in AI [1] Group 1: Decoupled Inference Concept - The DistServe system, launched in March 2024, proposes a bold idea of splitting the inference process of large models into two stages: "prefill" and "decode," allowing them to scale and schedule independently in separate resource pools [1][19] - This decoupled architecture addresses two fundamental limitations of previous inference frameworks: interference and coupled scaling, which hindered efficiency and increased costs in production environments [10][15][18] - By separating prefill and decode, DistServe enables independent scaling to meet latency requirements for both stages, significantly improving overall efficiency [19][22] Group 2: Adoption and Impact - Initially, the decoupled inference concept faced skepticism in the open-source community due to the engineering investment required for deep architectural changes [21] - However, by 2025, it gained widespread acceptance as businesses recognized the critical importance of latency control for their core operations, leading to its adoption as a default solution in major inference stacks [22][23] - The decoupled architecture allows for high resource utilization and flexibility in resource allocation, especially as model sizes and access traffic increase [22][23] Group 3: Current State and Future Directions - The decoupled inference has become a primary design principle in large model inference frameworks, influencing orchestration layers, inference engines, storage systems, and emerging hardware architectures [23][31] - Future research is exploring further disaggregation at the model level, such as "Attention-FFN Disaggregation," which separates different components of the model across various nodes [33][34] - The trend is moving towards a more modular approach in AI systems, where different functional modules can evolve, expand, and optimize independently, marking a significant shift from centralized to decoupled architectures [47][48]

半导体行业观察· 2025-10-02 01:18

Core Viewpoint - The rapid development of AI has made storage a critical component in the AI infrastructure, alongside computing power. The demand for storage is surging due to the increasing data volume and inference scenarios driven by large models and generative AI. Three storage technologies—HBM, HBF, and GDDR7—are redefining the future landscape of AI infrastructure [1]. Group 1: HBM (High Bandwidth Memory) - HBM has evolved from a high-performance AI chip component to a strategic point in the storage industry, significantly impacting AI chip performance limits. In less than three years, HBM has achieved over twofold capacity and approximately 2.5 times bandwidth increase [3]. - SK Hynix is leading the HBM market, currently in the final testing phase for the sixth generation (HBM4) and has announced readiness for mass production. In contrast, Samsung is facing challenges in HBM4 supply to Nvidia, with a two-month delay in testing [3][5]. - A notable trend is the customization of HBM, driven by cloud giants developing their AI chips. SK Hynix is shifting towards a fully customized HBM approach, collaborating closely with major clients [4]. Group 2: HBF (High Bandwidth Flash) - HBF aims to address the limitations of traditional storage by combining the capacity of NAND flash with the bandwidth of HBM. Sandisk is leading the development of HBF technology, which is expected to meet the growing storage demands of AI applications [8][9]. - HBF is seen as complementary to HBM, suitable for specific applications requiring large block storage units. It is particularly advantageous in scenarios demanding high capacity but with relatively relaxed bandwidth requirements [10][11]. Group 3: GDDR7 - Nvidia's introduction of the Rubin CPX GPU, utilizing GDDR7 instead of HBM4, reflects a new approach to AI inference architecture. This design optimizes resource allocation by separating the inference process into two stages, effectively utilizing GDDR7 for context building [13]. - The demand for GDDR7 is increasing, with Samsung successfully meeting Nvidia's orders. This flexibility positions Samsung favorably in the graphics DRAM market [14]. - GDDR7's cost-effectiveness may drive the widespread adoption of AI inference infrastructure, potentially increasing overall market demand for high-end HBM due to the proliferation of applications [15]. Group 4: Industry Trends and Future Outlook - The collaborative evolution of storage technologies is crucial for the AI industry's growth. HBM remains essential for high-end training and inference, while HBF and GDDR7 cater to diverse market needs [23]. - The ongoing innovation in storage technology will accelerate as AI applications expand across various sectors, providing tailored solutions for both performance-driven and cost-sensitive users [23].

半导体行业观察· 2025-09-13 02:48

Core Viewpoint - The introduction of NVIDIA's Rubin CPX GPU, which opts for GDDR7 memory instead of the traditional HBM, raises questions about the future of HBM in AI applications and its potential threats from more cost-effective memory solutions [1][7]. Group 1: Rubin CPX GPU Overview - The Rubin CPX GPU was launched on September 10, 2023, specifically designed for long-context AI workloads, emphasizing a new inference acceleration concept called "disaggregated inference" [2]. - This GPU is not a simplified version of the standard Rubin GPU but is deeply optimized for inference performance, indicating a shift in focus from training to inference in AI applications [2][4]. - The Rubin CPX GPU is expected to provide up to 30 PFLOPs of raw computing power with 128 GB of GDDR7 memory, contrasting with the standard Rubin GPU's 50 PFLOPs and 288 GB of HBM4 memory [3]. Group 2: Architectural Differences - The architectural differences between Rubin CPX and standard Rubin GPU highlight a focus on task specialization, with Rubin CPX handling context construction and Rubin GPU managing generation tasks [5][9]. - The overall performance of the system with Rubin CPX is projected to reach 8 ExaFLOPs NVFP4, significantly surpassing previous models [4]. Group 3: Memory Transition and Implications - The shift from HBM4 to GDDR7 is driven by the need to reduce costs while maintaining performance, as GDDR7 provides sufficient bandwidth for the context-building tasks of the Rubin CPX GPU [9]. - This transition is expected to lower the total cost of systems, making AI infrastructure more accessible to a broader range of enterprises [9]. - The demand for GDDR7 is surging, with NVIDIA increasing orders from suppliers like Samsung, which is expanding production capabilities to meet this demand [10][12]. Group 4: Market Dynamics and Future Outlook - The introduction of GDDR7 is seen as a potential threat to HBM, but it also opens new opportunities for memory suppliers, particularly Samsung, which is poised to benefit from increased orders [10][12]. - SK Hynix has announced the completion of HBM4 development, indicating that while GDDR7 is gaining traction, HBM technology continues to evolve and remain relevant in the market [13].

Vera Rubin NVL144 CPX机架

Vera Rubin NVL144 CPX机架