Workflow
大模型推理
icon
Search documents
NeurIPS 2025 | DynaAct:DeepSeek R1之外,探索大模型推理的另一条道路
机器之心· 2025-11-29 09:33
该工作的第一作者为香港大学计算机系博士生赵学亮。蚂蚁武威、关健为共同贡献者。 在 R1 与 O1 引领「深度推理」浪潮之后,大模型推理领域正迎来新的分叉点! 大模型推理的爆发,实际源于 scaling 范式的转变:从 train-time scaling 到 test-time scaling(TTS),即将更多的算力消耗部署在 inference 阶段。 典型的实现是以 DeepSeek r1 为代表的 long CoT 方法:通过增加思维链的长度来获得答案精度的提升。那么 long CoT 是 TTS 的唯一实现吗? 与传统 token-by-token 式的 CoT 不同,DynaAct 提出以 Action Space Optimization 为核心的 TTS 范式:在每一步推理中动态构建可选动作集合, 并通过学习算法从中选择最优动作,从而让推理路径更高效、更具结构化。 针对这个问题,来自蚂蚁和香港大学自然语言组的研究团队(后简称「团队」)给出了 TTS 的另一种思路: 让模型不仅「想得久」,更要「想得准」。 在这一思路下,团队提出了 DynaAct,该工作已经被 NeurIPS 2025 接收。 ...
华为放出「准万亿级MoE推理」大招,两大杀手级优化技术直接开源
机器之心· 2025-11-28 04:11
机器之心报道 编辑:杜伟 2025 年已接近尾声,这一年里,大模型加速从单点提效工具升级为支撑业务系统的底层基础设施。过程中,推理效率决定了大模型能否真正 落地。对于超大规模 MoE 模型,复杂推理链路带来了计算、通信、访存等方面的挑战,亟需行业给出高效可控的推理路径。 华为亮出了面向准万亿参数 MoE 推理的完整技术栈:openPangu-Ultra-MoE-718B-V1.1 展现 MoE 架构的模型潜力、 包括 Omni Proxy 调度特 性、将昇腾硬件算力利用率推至 86% 的 AMLA 技术在内的昇腾亲和加速技术, 使得超大规模 MoE 模型具备了走向生产级部署的现实可行 性。开源实现: https://gitcode.com/ascend-tribe/ascend-inference-cluster# 如果说过去数年大模型竞争的焦点在训练规模与能力突破上,那么如今,推理效率正迅速成为影响模型能否落地的关键变量。 模型 GitCode 地址:https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1-Int8 从任务属性来看, ...
芯片ETF(512760)连续5日净流入超4亿元,英伟达公布芯片出货预期
Mei Ri Jing Ji Xin Wen· 2025-11-05 07:05
Group 1 - Nvidia announced the release of Blackwell and Rubin architecture cabinet solutions at GTC2025, with the first-generation Rubin NVL144 showing a performance improvement of approximately 3.3 times over the GB300 NVL72, and the second-generation Rubin Ultra576 expected to enhance performance by about 14 times, set to launch in the second half of 2027 [1] - The Vera Rubin Superchip was showcased, featuring an 88-core Arm CPU, dual Rubin GPUs, and 2TB of memory, achieving a computing power of 100 PFLOPS; additionally, the CPX computing board was presented, supporting context acceleration for over one million tokens, aimed at large model inference scenarios [1] - Nvidia anticipates GPU sales exceeding $500 billion over the next five quarters, with projected lifecycle shipments of Blackwell and Rubin reaching 20 million units, significantly higher than the 4 million units of Hopper [1] Group 2 - Nvidia will collaborate with Oracle and the U.S. Department of Energy to build the Solstice and Equinox supercomputing systems, deploying 100,000 and 10,000 Blackwell GPUs respectively, with a total computing power of approximately 2200 EFLOPS, expected to be operational by the first half of 2026 [1] - The chip ETF (512760) tracks the China Semiconductor Chip Index (990001), which selects listed companies involved in semiconductor chip materials, equipment, design, manufacturing, packaging, and testing from the Shanghai and Shenzhen markets to reflect the overall performance of related listed companies in the semiconductor industry [1]
锦秋基金参与微纳核芯超亿元融资,首创三维存算一体3D-CIM™芯片开启大模型推理新篇章|Jinqiu Spotlight
锦秋集· 2025-10-30 13:34
Core Insights - Jinqiu Fund has completed an investment in Micro Nano Core, a leading AI chip company specializing in 3D integrated computing-in-memory (CIM) technology [2][4][12] - The investment highlights the market's strong consensus on the disruptive potential of 3D-CIM™ technology in AI computing applications, aiming to enhance edge AI capabilities [4][6] - The global market for edge AI chips is projected to grow significantly, from $2 billion in 2024 to $16.7 billion by 2028, driven by the increasing demand for high-performance, low-power, and cost-effective solutions [6][7] Investment Overview - Micro Nano Core has successfully completed over 100 million yuan in Series B strategic financing, led by BlueRun Ventures, with participation from top-tier institutions including Jinqiu Fund [4] - The funding will support the development of the world's fastest mass-producible 3D edge AI chips, which are expected to unlock new AI application scenarios [4][9] Technology and Innovation - Micro Nano Core's 3D-CIM™ chip technology combines 3D near-memory computing and computing-in-memory (CIM) to achieve significant improvements in performance, power efficiency, and cost [8][12] - The company claims to have achieved over 4 times the computing density and more than 10 times the power reduction compared to traditional von Neumann architectures [8][9] - The RV-CIM™ full-stack technology addresses the usability of chips, ensuring compatibility with the RISC-V open-source ecosystem [8][10] Market Trends - The evolution of AI agents from execution tools to decision-making partners is expected to drive a revolution in various industries, with a target of 70% penetration of intelligent terminals by 2027 in China [6] - The demand for high-performance, low-power, and low-cost chips is critical for the widespread adoption of AI agents, with the industry consensus leaning towards 3D stacking as a solution [7][8] Team and Ecosystem - Micro Nano Core boasts a world-class innovation team with a strong track record in chip design, having published numerous record-breaking results in international competitions [10][13] - The company is leading the development of global RISC-V CIM standards and is collaborating with multiple industry leaders to build a robust ecosystem around its technology [11][13] - The strategic partnerships with major storage and terminal manufacturers aim to create a self-sustaining ecosystem that enhances the adoption of 3D-CIM™ chips across various applications [11][12]
独家|对话Tensormesh三位联创:如何从学术界走到大模型推理产业前线?
Z Potentials· 2025-10-24 08:18
Core Insights - Tensormesh, a company focused on providing cache-accelerated inference optimization for enterprises, has officially launched and secured $4.5 million in seed funding led by Laude Ventures [2] - The founding team, consisting of Junchen Jiang, Yihua Cheng, and Kuntai Du, aims to bridge the gap between AI inference engines and storage services, leveraging their academic backgrounds to create a commercially viable product [3][4] Company Overview - Tensormesh is the first commercial platform to productize large-scale AI inference caching, inspired by the open-source project LMCache, which combines advanced technology with enterprise-level usability, security, and manageability [2][4] - The company’s product allows enterprises to deploy large model services easily, significantly reducing operational costs to about one-tenth of public API usage while enhancing performance by up to ten times compared to mainstream solutions [4][29] Funding and Growth - The funding process for Tensormesh was unconventional, relying on personal connections rather than traditional methods like business plans or roadshows, resulting in a swift investment agreement [5][48] - The seed funding will primarily be used for product refinement and team expansion, with a strategic focus on creating a strong open-source engine as an entry point for commercial value [5][40] Market Position and Challenges - The inference industry is emerging, with the cost of inference surpassing training costs due to increased usage, highlighting the need for efficient solutions [30][32] - Tensormesh addresses three main challenges in deploying large models: privacy concerns, complex cluster management, and high operational costs [26][28] Product Features and Innovations - The product offers a one-click deployment solution for in-house large model services, ensuring data privacy while significantly lowering costs and improving performance [29][30] - Tensormesh aims to fill a market gap by providing a comprehensive solution that integrates inference engines, storage, scheduling, and routing, which is currently lacking in the industry [38] Future Aspirations - The company aspires to become the go-to solution for large model inference, similar to how Databricks is recognized in big data [44][45] - The long-term vision includes evolving with AI advancements, ensuring that Tensormesh remains relevant as the industry shifts from reliance on single models to more complex systems [51][52]
KTransformers入选计算机系统顶会、与主流框架合作,趋境&清华让「异构」成为推理新范式
量子位· 2025-10-22 09:12
Core Insights - KTransformers, an open-source project developed by Turing Technology and Tsinghua University's KVCache.AI team, focuses on system innovation during the inference phase of large models, enabling efficient operation on diverse hardware architectures with lower computational power [2][4]. Group 1: KTransformers Overview - KTransformers is a high-performance heterogeneous inference framework that optimally utilizes various computing resources such as GPUs, CPUs, and memory [2]. - The project paper was recognized at the prestigious SOSP 2025 conference, highlighting its significance in the field of computer systems [2][4]. Group 2: Technical Innovations - The framework introduces an "Expert Deferral" mechanism, allowing for efficient scheduling of experts in Mixture of Experts (MoE) models, which reduces computational load without sacrificing model performance [7][13]. - KTransformers achieves nearly 4x speedup on a single Intel Xeon processor compared to traditional PyTorch implementations, significantly enhancing CPU performance in expert calculations [12]. - The system allows for dynamic overlapping of CPU and GPU loads, resulting in a model throughput increase of approximately 1.45 times, with minimal impact on model accuracy [15][16]. Group 3: Collaboration and Ecosystem - KTransformers has partnered with SGLang, a mainstream inference framework, to integrate full GPU inference with heterogeneous inference, enhancing the overall architecture for large model deployment [5][19]. - This collaboration enables developers to access both full GPU and heterogeneous inference capabilities seamlessly, particularly beneficial in scenarios with limited GPU resources [21]. Group 4: Market Position and Future Directions - KTransformers has gained significant traction in the developer community, with over 15.2K stars on GitHub, indicating its widespread adoption as a foundational framework for large model inference [24]. - The project aims to democratize AI capabilities, making them accessible beyond elite computational paths, and is actively collaborating with various domestic CPU and GPU platforms to promote cost-effective solutions [28][29].
清华、快手提出AttnRL:让大模型用「注意力」探索
机器之心· 2025-10-21 09:32
Core Insights - The article discusses the advancements in reinforcement learning (RL), particularly focusing on Process-Supervised RL (PSRL) and the introduction of a new framework called AttnRL, which enhances exploration efficiency and performance in reasoning models [3][4][9]. Group 1: Challenges in Traditional Methods - Traditional PSRL methods assign equal reward signals to all tokens, neglecting the fine-grained quality during the reasoning process [7]. - Existing PSRL approaches face significant bottlenecks in exploration efficiency and training costs, leading to high computational expenses [4][10]. Group 2: Introduction of AttnRL - AttnRL introduces an innovative exploration method by utilizing attention mechanisms to guide the reasoning process, allowing the model to branch from high-attention steps [9][12]. - The framework employs Attention-based Tree Branching (ATB), which analyzes the reasoning sequence and calculates Forward Context Influence (FCI) scores to determine the most impactful steps for branching [13][16]. Group 3: Adaptive Sampling Mechanisms - AttnRL incorporates two adaptive sampling mechanisms: difficulty-aware exploration and dynamic batch adjustment, optimizing the learning process by focusing on challenging problems while reducing computational load on simpler ones [20][22]. - The training process is streamlined to a One-Step Off-Policy approach, significantly reducing sampling costs compared to previous PSRL methods [23]. Group 4: Experimental Results - AttnRL demonstrates superior performance across various mathematical reasoning benchmarks, achieving average accuracy rates of 57.2% for 1.5B models and 68.7% for 7B models, outperforming baseline methods like GRPO and TreeRL [28]. - The framework shows improved efficiency in sampling, with a higher effective ratio and better performance in fewer training steps compared to traditional methods [29][31]. Group 5: Future Outlook - The introduction of attention scores in PSRL exploration decisions opens new avenues for enhancing model interpretability and RL research, suggesting that efficiency and intelligence can coexist through more effective exploration strategies [34].
技能英伟达桌面超算,加入苹果Mac Studio快爆了:推理速度飙升至277%
量子位· 2025-10-17 04:58
Core Viewpoint - EXO Labs has developed a new framework that enhances large model inference speed by combining NVIDIA's DGX Spark with Apple's M3 Ultra, achieving a speedup of up to 2.77 times for model deployment [1][5][18]. Group 1: Technology and Implementation - The framework utilizes a PD (Prefill and Decode) separation approach, where DGX Spark handles the Prefill phase due to its high computational power, while M3 Ultra manages the Decode phase, benefiting from its high memory bandwidth [11][18]. - The Prefill phase's computational demand grows quadratically with prompt length, while the Decode phase is primarily limited by memory bandwidth, making the separation of tasks advantageous [8][11]. - EXO Labs employs a streaming transmission method for KV cache, allowing for overlapping computation and data transfer between the two devices, which minimizes communication costs [16][18]. Group 2: Performance Metrics - The combination of DGX Spark and M3 Ultra results in significant performance improvements: Prefill speed increases to 3.79 times that of M3 Ultra alone, and Decode speed improves to 3.37 times that of DGX Spark [18][19]. - The overall performance metrics show that the combined system reduces total processing time to 2.32 seconds, achieving a speedup of 2.8 times compared to using M3 Ultra alone [19]. Group 3: Industry Context - NVIDIA is also exploring similar PD separation techniques with its upcoming Rubin CPX platform, which will utilize a compute-intensive processor for Prefill and a high-bandwidth memory chip for Decode [20]. - The recent delivery of DGX Spark systems to notable figures in the tech industry indicates a growing interest and investment in advanced AI inference technologies [22]. - Apple's latest M5 chip shows improvements in AI performance, but comparisons suggest that M3 Ultra may hold more value in the current landscape of AI hardware [26][30].
中国电信完成业界首个面向大模型推理的异构算力协同技术验证
Xin Lang Cai Jing· 2025-10-13 23:42
Group 1 - The core viewpoint of the articles highlights the successful implementation of the DeepSeek series model by China Telecom Research Institute in collaboration with various industry partners, achieving cost reduction and efficiency improvement in large model inference through a combination of NVIDIA and domestic computing power [1][2] - The DeepSeek 671B model demonstrated a throughput performance improvement of 30% to 72% across multiple scenarios, with a doubling of concurrent capability and a maximum reduction of 42% in inference costs under the same throughput conditions [1] - The successful verification of heterogeneous computing power collaboration for large model inference reflects China Telecom's deep understanding of intelligent computing optimization technology and its innovative practices in adapting domestic computing power [2] Group 2 - The industry consensus is shifting towards optimizing chip design for the Prefill and Decode stages of inference, with NVIDIA and Huawei releasing respective chip design plans that incorporate "high compute low storage" and "low compute high storage" strategies [2] - China Telecom Research Institute has developed a full-stack self-research heterogeneous mixed inference system that showcases three core advantages: efficient transmission between heterogeneous chip PD pools, automatic recommendation and real-time optimization of PD resource allocation, and dynamic scheduling of inference tasks [2] - China Telecom aims to continue enhancing the high-quality development of domestic computing power, creating a "connected and efficient collaborative" heterogeneous computing ecosystem for large model training and inference [2]
700万参数击败DeepSeek R1等,三星一人独作爆火,用递归颠覆大模型推理
机器之心· 2025-10-09 04:43
Core Viewpoint - The article discusses the emergence of new models in AI reasoning, particularly the Hierarchical Reasoning Model (HRM) and the Tiny Recursive Model (TRM), highlighting their efficiency and performance in complex reasoning tasks despite having significantly fewer parameters compared to traditional large models [1][4][29]. Group 1: Hierarchical Reasoning Model (HRM) - HRM, proposed by researchers from Sapient Intelligence, utilizes a hierarchical reasoning structure and has 27 million parameters, achieving remarkable performance with only 1,000 training samples [1]. - The model's architecture is based on a two-network design, which increases the parameter count compared to conventional single-network supervised learning [12]. - HRM's performance is benchmarked against various tasks, showing its accuracy in Sudoku-Extreme and Maze-Hard [25][29]. Group 2: Tiny Recursive Model (TRM) - TRM, introduced by researchers from Samsung Advanced Technology Institute, contains only 7 million parameters and outperforms larger models like o3-mini and Gemini 2.5 Pro in challenging reasoning tasks [4][29]. - The model operates through a recursive reasoning process, iterating up to 16 times to refine its answers, demonstrating the principle of "less is more" [6][9]. - TRM's experimental results indicate superior accuracy in Sudoku-Extreme (87.4%) and competitive performance in other benchmarks compared to HRM [27][29]. Group 3: Experimental Results and Comparisons - The article presents a comparison of accuracy rates between HRM and TRM across various datasets, showing TRM's efficiency in achieving higher accuracy with fewer parameters [23][29]. - In the ARC-AGI benchmarks, TRM-Att and TRM-MLP models demonstrate better performance than HRM, emphasizing the advantages of parameter efficiency and generalization capabilities [26][29]. - The findings suggest that reducing model complexity while increasing recursive iterations can lead to improved performance, challenging traditional assumptions about model depth and parameter size [15][17].