Workflow
大语言模型推理
icon
Search documents
DeepSeek发布下一代技术,北大实习生立功
3 6 Ke· 2026-02-27 09:09
DeepSeek又找到突破大模型推理瓶颈的新方法了! 智东西2月27日报道,昨天,DeepSeek发布了一项名为DualPath的全新推理系统方案,直指当前大语言模型在智能体应用场景下遭遇的短板 ——KV缓存存储I/O瓶颈。该方案通过引入双路径加载机制,显著提升系统吞吐量,基本消除了KV缓存的I/O开销。 DualPath的核心创新在于开辟了一条从存储直通解码引擎的新通道。KV缓存不再仅由预填充引擎加载,而是可以加载至解码引擎,再通过计 算网络中的RDMA高效传输至预填充端。这一设计不仅缓解了存储端的压力,还避免了网络拥塞,确保延迟敏感型任务不受干扰。 与全局调度器协同后,DualPath实现了动态平衡两端负载,进一步提升资源利用率。在真实智能体工作负载测试中,DualPath将离线推理吞吐 量提升最高达1.87倍,在线服务吞吐量平均提升1.96倍。 在大规模可扩展性方面,DualPath系统在最多1152张GPU上进行了验证。离线推理从2P4D(2K智能体)扩展到48P96D(48K智能体)实现近 线性扩展,任务完成时间基本保持一致。 值得一提的是,与之前DeepSeek发表的许多研究论文类似,这篇论文的第 ...
下一个HBM:HBF,能行吗?
半导体行业观察· 2026-02-20 03:46
Core Viewpoint - The emergence of High Bandwidth Flash (HBF) aims to address the memory bottleneck in artificial intelligence by stacking NAND flash to provide HBM-level bandwidth while achieving a 16-fold capacity increase. However, the practical application of HBF faces significant challenges that may hinder its initial promise [2][30]. Group 1: Background and Challenges - The AI workload bottleneck is no longer computational performance but rather the need for memory to provide data at speeds comparable to NVIDIA's H100, which has a computational capability of 989 TFLOPS. HBM3 meets this requirement with a bandwidth of 819GB/s but has a critical weakness in capacity, with a maximum of 192GB per GPU [5][6]. - The key-value cache (KV cache) for large models like Llama 3.1 405B requires substantial memory, with pre-computed caches needing approximately 540GB for 1 million tokens and 5.4TB for 10 million tokens, making HBM insufficient for such demands [6][11]. - HBF's advantages include a capacity of about 3TB at the same bandwidth of 8TB/s, with NAND costs being approximately one-fifth of HBM, suggesting significant economic benefits [6][8]. Group 2: H³ Architecture and Assumptions - The H³ architecture combines HBM and HBF, acknowledging the limitations of HBF when used alone. It connects HBM directly to the GPU for maximum bandwidth while linking HBF through a daisy chain [8][9]. - The core assumptions of H³ include that most LLM inference data is read-only, the access pattern is deterministic, and a 40MB SRAM buffer can effectively hide the latency of HBF [9][10]. - Simulation results indicate that under ideal conditions, H³ can achieve a throughput increase of 1.25 times for 1 million tokens and 6.14 times for 10 million tokens compared to HBM alone, with a maximum power efficiency improvement of 2.69 times [10][11]. Group 3: Limitations of Assumptions - The assumption that model weights and shared KV caches are read-only is limited in practical LLM services, where frequent updates and model version control are common [11][12]. - The physical limitations of NAND flash, with access delays significantly higher than DRAM, present a fundamental challenge that cannot be overcome by architectural design alone [13][30]. - The cost structure of HBF is complicated by the need for additional components like SRAM and DRAM, which increases the overall system cost despite the lower price of NAND chips [15][16]. Group 4: Alternative Solutions and Market Dynamics - HBF is set to undergo sample testing in 2026-2027, while alternative technologies like HBM4 and CXL memory are rapidly maturing, offering different approaches to memory capacity expansion [20][23][24]. - HBM4 is expected to provide a bandwidth of 1.5TB/s and capacities of 32-48GB, potentially diminishing HBF's capacity advantage [23]. - CXL memory allows for scalable memory pooling across multiple servers, offering significant flexibility and resource utilization improvements, with major industry players already beginning production [24][26]. Group 5: Strategic Importance of HBF - Despite the challenges, HBF represents a strategic shift in the memory industry from commodity supply to platform-based solutions, allowing for greater collaboration with customers and the potential for higher profit margins [28][29]. - The collaboration between SK Hynix and SanDisk in developing HBF technology is a strategic move to explore the integration of storage technologies and platform solutions beyond single product success [29].
OpenAI推理第一人离职,7年打造了o3/o1/GPT-4/Codex
量子位· 2026-01-06 04:20
Core Viewpoint - OpenAI's research vice president Jerry Tworek has announced his departure from the company after nearly seven years, citing a desire to explore research areas that are difficult to pursue at OpenAI [1][21]. Group 1: Jerry Tworek's Background and Contributions - Jerry Tworek has a strong theoretical background, having obtained a master's degree in mathematics from the University of Warsaw [9]. - Before joining OpenAI in 2019, he spent five years in quantitative research, focusing on trading strategies in the futures market, which led him to study reinforcement learning [12]. - At OpenAI, he was involved in significant projects, including the development of Codex and the research of large language models, emphasizing reasoning over mere pattern matching [16][18]. Group 2: Achievements at OpenAI - Tworek played a key role in the development of GPT-4 and ChatGPT, and he was the lead researcher for the first reasoning model, o1 [18]. - He was responsible for leading a team focused on enhancing the capabilities of large language models to solve complex STEM problems [16]. - His work contributed to the establishment of a new paradigm in scaling training and reasoning computations, known as reasoning models [26]. Group 3: Departure and Future Plans - Tworek expressed gratitude for his time at OpenAI, highlighting the friendships and technical breakthroughs he experienced [27][28]. - He plans to explore new research avenues that were challenging to pursue within OpenAI, indicating a shift in his career focus [28].
港科提出新算法革新大模型推理范式:随机策略估值竟成LLM数学推理「神操作」
3 6 Ke· 2025-10-31 08:28
Core Insights - The article discusses the introduction of ROVER (Random Policy Valuation for Diverse Reasoning), a novel approach to enhance reasoning capabilities in large language models (LLMs) through a simplified reinforcement learning framework [1][2][5]. Group 1: ROVER's Methodology - ROVER simplifies the traditional reinforcement learning process by eliminating the need for policy iteration, relying instead on the value assessment of a completely random policy to identify optimal reasoning paths [1][5][7]. - The algorithm operates in three main steps: estimating Q-values, constructing policies using softmax sampling to maintain diversity, and implementing a training objective that integrates rewards into the LLM parameters without requiring an additional value network [11][12][13]. Group 2: Performance Metrics - ROVER significantly outperforms existing methods on various mathematical reasoning benchmarks, achieving notable improvements in pass rates, such as a +8.2 increase in pass@1 and a +16.8 increase in pass@256 [5][15]. - The diversity of strategies generated by ROVER is enhanced by 17.6% compared to baseline methods, allowing for a broader exploration of problem-solving paths [17][20]. Group 3: Experimental Results - In specific tasks like AIME24 and HMMT25, ROVER's pass@1 scores reached 30.6 and 14.6 respectively, marking substantial increases over the best baseline scores [15][16]. - ROVER's ability to discover new solution strategies is illustrated by its performance in generating multiple reasoning paths for complex problems, showcasing its effectiveness in diverse reasoning scenarios [20][22]. Group 4: Implications and Future Directions - The introduction of ROVER represents a paradigm shift in the approach to structured tasks, emphasizing that simplicity can lead to enhanced performance in AI applications [23].
港科提出新算法革新大模型推理范式:随机策略估值竟成LLM数学推理「神操作」
机器之心· 2025-10-31 04:11
Core Insights - The article discusses the introduction of ROVER (Random Policy Valuation for Diverse Reasoning), a novel approach that simplifies the reasoning process in large language models (LLMs) by evaluating a completely random policy to find optimal reasoning paths, thus bypassing traditional reinforcement learning (RL) iterations [3][4][11]. Group 1: ROVER's Methodology and Advantages - ROVER significantly outperforms existing methods on various mathematical reasoning benchmarks, achieving higher quality and diversity in reasoning generation through a minimalist approach [4][9]. - The algorithm eliminates the need for maintaining a value network or a reference model, making it more lightweight compared to traditional RL methods [9][16]. - ROVER's process consists of three simple steps: estimating Q-values, constructing policies using softmax sampling to maintain diversity, and implementing a training objective that reduces computational load and enhances stability [19][21][24]. Group 2: Performance Metrics - In high-difficulty tasks such as AIME24, AIME25, and HMMT25, ROVER improved pass@1 by +8.2 and pass@256 by +16.8, showcasing its superior performance [9][26]. - ROVER achieved a pass@1 score of 30.6 on AIME24, surpassing the best baseline (DAPO) by 19.1 points, and a pass@1 score of 14.6 on HMMT25, representing a 106% increase from the highest baseline [26][27]. - The diversity of strategies generated by ROVER is enhanced by 17.6% compared to baselines, allowing it to cover more problem-solving paths [29][31]. Group 3: Implications and Future Directions - The introduction of ROVER reflects a methodological shift, emphasizing that simplification rather than complexity can drive performance improvements in structured tasks [38].
深度拆解,硬核解构,揭开vLLM推理系统实现高效吞吐的秘籍
机器之心· 2025-10-26 04:03
Core Insights - The article discusses the rapid development of large model applications and the focus on making inference faster and more efficient, highlighting the emergence of vLLM as a high-performance inference framework specifically optimized for large language models [1][4]. Inference Engine Basics - The vLLM framework includes fundamental processes such as input/output request handling, scheduling, paged attention, and continuous batching [4]. - Advanced features of vLLM include chunked prefill, prefix caching, guided decoding, speculative decoding, and decoupled prefill/decoding [4]. Performance Measurement - The performance of the inference system is measured through metrics such as latency (including time to first token, iteration latency, end-to-end latency, and throughput time) and throughput, along with GPU performance roofline models [4]. Architecture and Components - The LLM engine is the core module of vLLM, capable of achieving high throughput inference in offline scenarios [8]. - Key components of the engine include the engine core, processor, output processor, model executor, and scheduler, each playing a critical role in the inference process [15][16]. Scheduling Mechanism - The scheduling mechanism prioritizes decode requests over prefill requests, allowing for more efficient processing of inference tasks [38][39]. - The vLLM V1 scheduler can intelligently mix prefill and decode requests within the same step, enhancing overall efficiency [39]. Advanced Features - Chunked prefill allows for processing long prompts by breaking them into smaller chunks, preventing resource monopolization [57]. - Prefix caching avoids redundant computations for shared tokens across multiple prompts, significantly speeding up prefill requests [69][73]. Guided and Speculative Decoding - Guided decoding utilizes a finite state machine to constrain logits based on grammar rules, ensuring only syntactically valid tokens are sampled [93][95]. - Speculative decoding introduces a draft model to quickly generate candidate tokens, reducing the time required for each forward pass in autoregressive generation [106][110]. Distributed System Deployment - vLLM can be deployed across multiple GPUs and nodes, utilizing tensor and pipeline parallelism to manage large models that exceed single GPU memory limits [146][150]. - The architecture supports both data parallelism and load balancing, ensuring efficient handling of incoming requests [130][156].
告别无效计算!新TTS框架拯救19%被埋没答案,推理准确率飙升
机器之心· 2025-09-02 06:32
Core Insights - The article discusses the development of the Stepwise Reasoning Checkpoint Analysis (SRCA) framework, which enhances the reasoning capabilities of large language models (LLMs) through improved test-time scaling methods [2][3][25]. Group 1: SRCA Framework - The SRCA framework addresses two main issues in existing test-time scaling methods: path homogeneity and underutilization of intermediate results [2][6]. - SRCA integrates two core strategies: Answer-Clustered Search (ACS) to maintain path diversity and Checkpoint Candidate Augmentation (CCA) to utilize all intermediate answers for final decision-making [2][10][19]. Group 2: Methodology - Checkpoint Injection is a foundational technique in SRCA, which forces the model to pause after each reasoning step to output intermediate answers [10][12]. - ACS prevents path homogeneity by grouping similar checkpoint answers and ensuring that diverse reasoning paths are explored [14][17]. - CCA enhances the model's accuracy by salvaging intermediate answers that may have been discarded during the reasoning process, thus improving resource utilization [19][20]. Group 3: Experimental Results - The SRCA framework enabled a 1B parameter model to achieve a 65.2% accuracy on the MATH500 dataset, surpassing a 70B model's accuracy of 65.0% [25]. - SRCA requires only 16 samples to achieve the accuracy of other TTS methods that need 128 samples, resulting in an 8-fold increase in reasoning efficiency [25]. - CCA successfully rescued 19.07% of correct answers from intermediate steps that were previously discarded due to subsequent path deviations [25].
大模型如何推理?斯坦福CS25重要一课,DeepMind首席科学家主讲
机器之心· 2025-08-16 05:02
Core Insights - The article discusses the insights shared by Denny Zhou, a leading figure in AI, regarding the reasoning capabilities of large language models (LLMs) and their optimization methods [3][4]. Group 1: Key Points on LLM Reasoning - Denny Zhou emphasizes that reasoning in LLMs involves generating a series of intermediate tokens before arriving at a final answer, which enhances the model's strength without increasing its size [6][15]. - The challenge lies in the fact that reasoning-based outputs often do not appear at the top of the output distribution, making standard greedy decoding ineffective [6]. - Techniques such as chain-of-thought prompting and reinforcement learning fine-tuning have emerged as powerful methods to enhance LLM reasoning capabilities [6][29]. Group 2: Theoretical Framework - Zhou proposes that any problem solvable by Boolean circuits can be addressed by generating intermediate tokens using a constant-sized transformer model, indicating a theoretical understanding of reasoning [16]. - The importance of intermediate tokens in reasoning is highlighted, as they allow models to solve complex problems without requiring deep architectures [16]. Group 3: Decoding Techniques - The article introduces the concept of chain-of-thought decoding, which involves checking multiple generated candidates rather than relying on a single most likely answer [22][27]. - This method requires programming effort but can significantly improve reasoning outcomes by guiding the model through natural language prompts [27]. Group 4: Self-Improvement and Data Generation - The self-improvement approach allows models to generate their own training data, reducing reliance on human-annotated datasets [39]. - The concept of reject sampling is introduced, where models generate solutions and select the correct steps based on achieving the right answers [40]. Group 5: Reinforcement Learning and Fine-Tuning - Reinforcement learning fine-tuning (RL fine-tuning) has gained attention for its ability to enhance model generalization, although not all tasks can be validated by machines [42][57]. - The article discusses the importance of reliable validators in RL fine-tuning, emphasizing that the quality of machine-generated training data can sometimes surpass human-generated data [45][37]. Group 6: Future Directions - Zhou expresses anticipation for breakthroughs in tasks that extend beyond unique, verifiable answers, suggesting a shift in focus towards building practical applications rather than solely addressing academic benchmarks [66]. - The article concludes with a reminder that simplicity in research can lead to clearer insights, echoing Richard Feynman's philosophy [68].
同时监督和强化的单阶段大模型微调,告别“先背书再刷题”,推理泛化双提升|中科院&美团等
量子位· 2025-07-02 02:02
Core Viewpoint - The article introduces the Supervised Reinforcement Fine-Tuning (SRFT) method, which combines supervised fine-tuning (SFT) and reinforcement learning (RL) in a single-stage approach to enhance the reasoning performance of large language models (LLMs) [1][22]. Group 1: Methodology - SRFT employs a dual strategy design to effectively utilize demonstration data, incorporating both SFT for coarse-grained behavior policy approximation and RL for fine-grained policy refinement [23][24]. - The method introduces an entropy-aware adaptive weighting mechanism to balance the influence of SFT and RL, ensuring stable training dynamics [29][44]. - SRFT achieves a significant improvement in training efficiency, speeding up the process by 2.28 times compared to traditional sequential methods [21][44]. Group 2: Performance Results - SRFT demonstrates an average accuracy of 59.1% across five mathematical reasoning tasks, outperforming the zero-RL baseline by 9.0% [4][47]. - In out-of-distribution tasks, SRFT achieves an average accuracy of 62.5%, surpassing the best baseline by 10.9% [4][47]. - The method shows superior generalization capabilities, with consistent performance improvements across various benchmarks [47][48]. Group 3: Training Dynamics - The training dynamics of SRFT reveal a more stable and efficient learning process, with a gradual increase in response length indicating a deeper reasoning process [48]. - SRFT maintains a more stable entropy during training, allowing for continued exploration, unlike pure RL which exhibits rapid entropy decline [20][48]. - The analysis of training trajectories indicates that SRFT effectively balances knowledge acquisition and self-exploration without excessive deviation from the initial model [15][45].
红帽宣布推出llm-d社区,NVIDIA、Google Cloud为创始贡献者
Xin Lang Ke Ji· 2025-05-27 03:42
Group 1 - Red Hat has launched a new open-source project called llm-d to meet the large-scale inference demands of generative AI, collaborating with CoreWeave, Google Cloud, IBM Research, and NVIDIA [1][3] - According to Gartner, by 2028, over 80% of data center workload accelerators will be deployed specifically for inference rather than training, indicating a shift in resource allocation [3] - The llm-d project aims to integrate advanced inference capabilities into existing enterprise IT infrastructure, addressing the challenges posed by increasing resource demands and potential bottlenecks in AI innovation [3] Group 2 - The llm-d platform allows IT teams to meet various service demands for critical business workloads while maximizing efficiency and significantly reducing the total cost of ownership associated with high-performance AI accelerators [3] - The project has garnered support from a coalition of generative AI model providers, AI accelerator pioneers, and major AI cloud platforms, indicating deep collaboration within the industry to build large-scale LLM services [3] - Key contributors to the llm-d project include CoreWeave, Google Cloud, IBM Research, and NVIDIA, with partners such as AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI [3][4] Group 3 - Google Cloud emphasizes the importance of efficient AI inference in the large-scale deployment of AI to create value for users, highlighting its role as a founding contributor to the llm-d project [4] - NVIDIA views the llm-d project as a significant addition to the open-source AI ecosystem, supporting scalable and high-performance inference as a key to the next wave of generative and agent-based AI [4] - NVIDIA is collaborating with Red Hat and other partners to promote community engagement and industry adoption of the llm-d initiative, leveraging innovations like NIXL to accelerate its development [4]