大语言模型推理 - filings, earnings calls, financial reports, news - Reportify

大语言模型推理

Search documents

OpenAI推理第一人离职，7年打造了o3/o1/GPT-4/Codex

量子位· 2026-01-06 04:20

衡宇发自凹非寺量子位 | 公众号 QbitAI 刚开年，OpenAI再出人事动荡：推理模型第一人离职了！ Jerry Tworek ——构建o3、o1、GPT-4、ChatGPT以及 OpenAI首个AI编程模型Codex的关键人物，OpenAI研究副总裁—— 宣布了他的艰难决定：离开OpenAI，去尝试探索一些在OpenAl难以开展的研究领域。好奇，他所说的"在OpenAI难以开展的研究"包括哪些部分？他表示，在OpenAI快七年的时间里，经历了许多美好和疯狂的时刻，但更多的是美好的时光。（大佬也和OpenAI有七年之痒？）不少OpenAI在职人员都在这篇推文上回顾了和Jerry共事的愉快经历。也祝他拥有美好的未来。但这条朋友的评论区更好笑。 | lerry, 你绝对是个传奇。与你的领导和愿景共事并从中学习是一段美好的经 | | --- | | . 历。祝你在未来的旅程中一切顺利：openai-heart: | | 1.6K | | | | 依旧有因 OpenAI流失重要人才感到沮丧的朋友。 OpenAI推理模型第一人 Jerry Tworek，出生、成长于波兰，在华沙大 ...

大语言模型推理

大语言模型推理

港科提出新算法革新大模型推理范式：随机策略估值竟成LLM数学推理「神操作」

3 6 Ke· 2025-10-31 08:28

Core Insights - The article discusses the introduction of ROVER (Random Policy Valuation for Diverse Reasoning), a novel approach to enhance reasoning capabilities in large language models (LLMs) through a simplified reinforcement learning framework [1][2][5]. Group 1: ROVER's Methodology - ROVER simplifies the traditional reinforcement learning process by eliminating the need for policy iteration, relying instead on the value assessment of a completely random policy to identify optimal reasoning paths [1][5][7]. - The algorithm operates in three main steps: estimating Q-values, constructing policies using softmax sampling to maintain diversity, and implementing a training objective that integrates rewards into the LLM parameters without requiring an additional value network [11][12][13]. Group 2: Performance Metrics - ROVER significantly outperforms existing methods on various mathematical reasoning benchmarks, achieving notable improvements in pass rates, such as a +8.2 increase in pass@1 and a +16.8 increase in pass@256 [5][15]. - The diversity of strategies generated by ROVER is enhanced by 17.6% compared to baseline methods, allowing for a broader exploration of problem-solving paths [17][20]. Group 3: Experimental Results - In specific tasks like AIME24 and HMMT25, ROVER's pass@1 scores reached 30.6 and 14.6 respectively, marking substantial increases over the best baseline scores [15][16]. - ROVER's ability to discover new solution strategies is illustrated by its performance in generating multiple reasoning paths for complex problems, showcasing its effectiveness in diverse reasoning scenarios [20][22]. Group 4: Implications and Future Directions - The introduction of ROVER represents a paradigm shift in the approach to structured tasks, emphasizing that simplicity can lead to enhanced performance in AI applications [23].

大语言模型推理

随机策略估值

大语言模型推理

随机策略估值

港科提出新算法革新大模型推理范式：随机策略估值竟成LLM数学推理「神操作」

机器之心· 2025-10-31 04:11

Core Insights - The article discusses the introduction of ROVER (Random Policy Valuation for Diverse Reasoning), a novel approach that simplifies the reasoning process in large language models (LLMs) by evaluating a completely random policy to find optimal reasoning paths, thus bypassing traditional reinforcement learning (RL) iterations [3][4][11]. Group 1: ROVER's Methodology and Advantages - ROVER significantly outperforms existing methods on various mathematical reasoning benchmarks, achieving higher quality and diversity in reasoning generation through a minimalist approach [4][9]. - The algorithm eliminates the need for maintaining a value network or a reference model, making it more lightweight compared to traditional RL methods [9][16]. - ROVER's process consists of three simple steps: estimating Q-values, constructing policies using softmax sampling to maintain diversity, and implementing a training objective that reduces computational load and enhances stability [19][21][24]. Group 2: Performance Metrics - In high-difficulty tasks such as AIME24, AIME25, and HMMT25, ROVER improved pass@1 by +8.2 and pass@256 by +16.8, showcasing its superior performance [9][26]. - ROVER achieved a pass@1 score of 30.6 on AIME24, surpassing the best baseline (DAPO) by 19.1 points, and a pass@1 score of 14.6 on HMMT25, representing a 106% increase from the highest baseline [26][27]. - The diversity of strategies generated by ROVER is enhanced by 17.6% compared to baselines, allowing it to cover more problem-solving paths [29][31]. Group 3: Implications and Future Directions - The introduction of ROVER reflects a methodological shift, emphasizing that simplification rather than complexity can drive performance improvements in structured tasks [38].

大语言模型推理

Artificial Intelligence

大语言模型推理

Artificial Intelligence

深度拆解，硬核解构，揭开vLLM推理系统实现高效吞吐的秘籍

机器之心· 2025-10-26 04:03

Core Insights - The article discusses the rapid development of large model applications and the focus on making inference faster and more efficient, highlighting the emergence of vLLM as a high-performance inference framework specifically optimized for large language models [1][4]. Inference Engine Basics - The vLLM framework includes fundamental processes such as input/output request handling, scheduling, paged attention, and continuous batching [4]. - Advanced features of vLLM include chunked prefill, prefix caching, guided decoding, speculative decoding, and decoupled prefill/decoding [4]. Performance Measurement - The performance of the inference system is measured through metrics such as latency (including time to first token, iteration latency, end-to-end latency, and throughput time) and throughput, along with GPU performance roofline models [4]. Architecture and Components - The LLM engine is the core module of vLLM, capable of achieving high throughput inference in offline scenarios [8]. - Key components of the engine include the engine core, processor, output processor, model executor, and scheduler, each playing a critical role in the inference process [15][16]. Scheduling Mechanism - The scheduling mechanism prioritizes decode requests over prefill requests, allowing for more efficient processing of inference tasks [38][39]. - The vLLM V1 scheduler can intelligently mix prefill and decode requests within the same step, enhancing overall efficiency [39]. Advanced Features - Chunked prefill allows for processing long prompts by breaking them into smaller chunks, preventing resource monopolization [57]. - Prefix caching avoids redundant computations for shared tokens across multiple prompts, significantly speeding up prefill requests [69][73]. Guided and Speculative Decoding - Guided decoding utilizes a finite state machine to constrain logits based on grammar rules, ensuring only syntactically valid tokens are sampled [93][95]. - Speculative decoding introduces a draft model to quickly generate candidate tokens, reducing the time required for each forward pass in autoregressive generation [106][110]. Distributed System Deployment - vLLM can be deployed across multiple GPUs and nodes, utilizing tensor and pipeline parallelism to manage large models that exceed single GPU memory limits [146][150]. - The architecture supports both data parallelism and load balancing, ensuring efficient handling of incoming requests [130][156].

大语言模型推理

流水线并行

连续批处理

大语言模型推理

流水线并行

连续批处理

告别无效计算！新TTS框架拯救19%被埋没答案，推理准确率飙升

机器之心· 2025-09-02 06:32

Core Insights - The article discusses the development of the Stepwise Reasoning Checkpoint Analysis (SRCA) framework, which enhances the reasoning capabilities of large language models (LLMs) through improved test-time scaling methods [2][3][25]. Group 1: SRCA Framework - The SRCA framework addresses two main issues in existing test-time scaling methods: path homogeneity and underutilization of intermediate results [2][6]. - SRCA integrates two core strategies: Answer-Clustered Search (ACS) to maintain path diversity and Checkpoint Candidate Augmentation (CCA) to utilize all intermediate answers for final decision-making [2][10][19]. Group 2: Methodology - Checkpoint Injection is a foundational technique in SRCA, which forces the model to pause after each reasoning step to output intermediate answers [10][12]. - ACS prevents path homogeneity by grouping similar checkpoint answers and ensuring that diverse reasoning paths are explored [14][17]. - CCA enhances the model's accuracy by salvaging intermediate answers that may have been discarded during the reasoning process, thus improving resource utilization [19][20]. Group 3: Experimental Results - The SRCA framework enabled a 1B parameter model to achieve a 65.2% accuracy on the MATH500 dataset, surpassing a 70B model's accuracy of 65.0% [25]. - SRCA requires only 16 samples to achieve the accuracy of other TTS methods that need 128 samples, resulting in an 8-fold increase in reasoning efficiency [25]. - CCA successfully rescued 19.07% of correct answers from intermediate steps that were previously discarded due to subsequent path deviations [25].

Test Time Scaling（TTS）技术

大语言模型推理

Artificial Intelligence

SRCA（Stepwise Reasoning Checkpoint Analysis）框架

Test Time Scaling（TTS）技术

大语言模型推理

Artificial Intelligence

SRCA（Stepwise Reasoning Checkpoint Analysis）框架

大模型如何推理？斯坦福CS25重要一课，DeepMind首席科学家主讲

机器之心· 2025-08-16 05:02

Core Insights - The article discusses the insights shared by Denny Zhou, a leading figure in AI, regarding the reasoning capabilities of large language models (LLMs) and their optimization methods [3][4]. Group 1: Key Points on LLM Reasoning - Denny Zhou emphasizes that reasoning in LLMs involves generating a series of intermediate tokens before arriving at a final answer, which enhances the model's strength without increasing its size [6][15]. - The challenge lies in the fact that reasoning-based outputs often do not appear at the top of the output distribution, making standard greedy decoding ineffective [6]. - Techniques such as chain-of-thought prompting and reinforcement learning fine-tuning have emerged as powerful methods to enhance LLM reasoning capabilities [6][29]. Group 2: Theoretical Framework - Zhou proposes that any problem solvable by Boolean circuits can be addressed by generating intermediate tokens using a constant-sized transformer model, indicating a theoretical understanding of reasoning [16]. - The importance of intermediate tokens in reasoning is highlighted, as they allow models to solve complex problems without requiring deep architectures [16]. Group 3: Decoding Techniques - The article introduces the concept of chain-of-thought decoding, which involves checking multiple generated candidates rather than relying on a single most likely answer [22][27]. - This method requires programming effort but can significantly improve reasoning outcomes by guiding the model through natural language prompts [27]. Group 4: Self-Improvement and Data Generation - The self-improvement approach allows models to generate their own training data, reducing reliance on human-annotated datasets [39]. - The concept of reject sampling is introduced, where models generate solutions and select the correct steps based on achieving the right answers [40]. Group 5: Reinforcement Learning and Fine-Tuning - Reinforcement learning fine-tuning (RL fine-tuning) has gained attention for its ability to enhance model generalization, although not all tasks can be validated by machines [42][57]. - The article discusses the importance of reliable validators in RL fine-tuning, emphasizing that the quality of machine-generated training data can sometimes surpass human-generated data [45][37]. Group 6: Future Directions - Zhou expresses anticipation for breakthroughs in tasks that extend beyond unique, verifiable answers, suggesting a shift in focus towards building practical applications rather than solely addressing academic benchmarks [66]. - The article concludes with a reminder that simplicity in research can lead to clearer insights, echoing Richard Feynman's philosophy [68].

大语言模型推理

链式推理解码

链式思维提示

强化学习微调

Artificial Intelligence

大语言模型推理

链式推理解码

链式思维提示

强化学习微调

Artificial Intelligence

同时监督和强化的单阶段大模型微调，告别“先背书再刷题”，推理泛化双提升｜中科院&美团等

量子位· 2025-07-02 02:02

Core Viewpoint - The article introduces the Supervised Reinforcement Fine-Tuning (SRFT) method, which combines supervised fine-tuning (SFT) and reinforcement learning (RL) in a single-stage approach to enhance the reasoning performance of large language models (LLMs) [1][22]. Group 1: Methodology - SRFT employs a dual strategy design to effectively utilize demonstration data, incorporating both SFT for coarse-grained behavior policy approximation and RL for fine-grained policy refinement [23][24]. - The method introduces an entropy-aware adaptive weighting mechanism to balance the influence of SFT and RL, ensuring stable training dynamics [29][44]. - SRFT achieves a significant improvement in training efficiency, speeding up the process by 2.28 times compared to traditional sequential methods [21][44]. Group 2: Performance Results - SRFT demonstrates an average accuracy of 59.1% across five mathematical reasoning tasks, outperforming the zero-RL baseline by 9.0% [4][47]. - In out-of-distribution tasks, SRFT achieves an average accuracy of 62.5%, surpassing the best baseline by 10.9% [4][47]. - The method shows superior generalization capabilities, with consistent performance improvements across various benchmarks [47][48]. Group 3: Training Dynamics - The training dynamics of SRFT reveal a more stable and efficient learning process, with a gradual increase in response length indicating a deeper reasoning process [48]. - SRFT maintains a more stable entropy during training, allowing for continued exploration, unlike pure RL which exhibits rapid entropy decline [20][48]. - The analysis of training trajectories indicates that SRFT effectively balances knowledge acquisition and self-exploration without excessive deviation from the initial model [15][45].

大语言模型推理

监督微调（SFT）

强化学习微调（RFT/RL）

SRFT (Supervised Reinforcement Fine - Tuning)

大语言模型推理

监督微调（SFT）

强化学习微调（RFT/RL）

SRFT (Supervised Reinforcement Fine - Tuning)

红帽宣布推出llm-d社区，NVIDIA、Google Cloud为创始贡献者

Xin Lang Ke Ji· 2025-05-27 03:42

Group 1 - Red Hat has launched a new open-source project called llm-d to meet the large-scale inference demands of generative AI, collaborating with CoreWeave, Google Cloud, IBM Research, and NVIDIA [1][3] - According to Gartner, by 2028, over 80% of data center workload accelerators will be deployed specifically for inference rather than training, indicating a shift in resource allocation [3] - The llm-d project aims to integrate advanced inference capabilities into existing enterprise IT infrastructure, addressing the challenges posed by increasing resource demands and potential bottlenecks in AI innovation [3] Group 2 - The llm-d platform allows IT teams to meet various service demands for critical business workloads while maximizing efficiency and significantly reducing the total cost of ownership associated with high-performance AI accelerators [3] - The project has garnered support from a coalition of generative AI model providers, AI accelerator pioneers, and major AI cloud platforms, indicating deep collaboration within the industry to build large-scale LLM services [3] - Key contributors to the llm-d project include CoreWeave, Google Cloud, IBM Research, and NVIDIA, with partners such as AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI [3][4] Group 3 - Google Cloud emphasizes the importance of efficient AI inference in the large-scale deployment of AI to create value for users, highlighting its role as a founding contributor to the llm-d project [4] - NVIDIA views the llm-d project as a significant addition to the open-source AI ecosystem, supporting scalable and high-performance inference as a key to the next wave of generative and agent-based AI [4] - NVIDIA is collaborating with Red Hat and other partners to promote community engagement and industry adoption of the llm-d initiative, leveraging innovations like NIXL to accelerate its development [4]

大语言模型推理

大语言模型推理

以加代乘？华为数学家出手，昇腾算子的高能设计与优化，性能提升30%！

机器之心· 2025-05-23 04:17

Core Viewpoint - The article discusses the rapid advancements in large language models (LLMs) and the challenges they face in inference, particularly regarding speed and energy efficiency. It highlights Huawei's innovative solutions to optimize these models through hardware-software integration, focusing on three key technologies that enhance inference speed and energy efficiency [2][4][11]. Group 1: Key Technologies - AMLA technology transforms complex multiplication into addition operations, significantly increasing chip utilization rates to 71% and improving performance by over 30% in the attention operator [4][5]. - The fusion operator optimization combines multiple operators into a single composite operator, enhancing parallel processing and reducing redundant data movement, leading to substantial performance improvements in model inference [7][9]. - SMTurbo technology enables ultra-low latency memory sharing across 384 cards, achieving sub-microsecond delays and enhancing memory access throughput by over 20% in cross-machine communication scenarios [10][9]. Group 2: Future Developments - Future research on AMLA will focus on optimizing the MLA operator for quantization scenarios, expanding its application [12]. - The fusion operator optimization will explore its application across more model architectures, promoting efficient inference of large language models on Huawei's Ascend hardware [12]. - Load/Store optimization will balance read and write loads, aiming for practical benefits in large batch sizes within Deepseek dispatch and combine scenarios [12].

大语言模型推理

Telecommunications Equipment

CloudMatrix 384

大语言模型推理

Telecommunications Equipment

CloudMatrix 384

叶子豪、陈天奇等人开源项目FlashInfer入选，MLSys2025最佳论文奖公布

机器之心· 2025-05-14 04:36

Core Insights - The article highlights the recognition of two outstanding papers in the field of machine learning systems, both authored by Chinese researchers, awarded at the MLSys 2025 conference [1][29]. Group 1: FlashInfer - FlashInfer, a collaborative research project initiated by the University of Washington, Carnegie Mellon University, and OctoAI, aims to create a flexible inference kernel library for large language models (LLMs) [4]. - NVIDIA has integrated FlashInfer's capabilities into various projects, enhancing LLM inference performance [2]. - FlashInfer significantly improves computational performance in various inference scenarios, reducing inter-token latency by 29% to 69% compared to state-of-the-art LLM deployment solutions [7]. - The system employs a block-sparse format and composable formats to optimize memory access and reduce redundancy in key-value cache storage [9][11]. - FlashInfer supports Just-In-Time (JIT) compilation for customizable attention computation templates, allowing flexibility for different application needs [9][20]. - The system's design includes a load-balancing scheduling algorithm to adapt to dynamic user requests while maintaining compatibility with static configurations [9][26]. Group 2: The Hidden Bloat in Machine Learning Systems - The second awarded paper discusses software bloat in machine learning systems, which refers to unused code and functionalities that lead to performance degradation and resource waste [31]. - The proposed method, Negativa-ML, identifies and eliminates bloat in ML frameworks by analyzing shared libraries, achieving an average reduction of device code size by up to 75% and host code size by up to 72% [32]. - By reducing bloat, Negativa-ML can decrease peak host memory usage, peak GPU memory usage, and execution time by up to 74.6%, 69.6%, and 44.6%, respectively [32].

Nvidia(US:NVDA)

大语言模型推理

大语言模型推理