大模型推理
Search documents
芯片ETF(512760)连续5日净流入超4亿元,英伟达公布芯片出货预期
Mei Ri Jing Ji Xin Wen· 2025-11-05 07:05
Group 1 - Nvidia announced the release of Blackwell and Rubin architecture cabinet solutions at GTC2025, with the first-generation Rubin NVL144 showing a performance improvement of approximately 3.3 times over the GB300 NVL72, and the second-generation Rubin Ultra576 expected to enhance performance by about 14 times, set to launch in the second half of 2027 [1] - The Vera Rubin Superchip was showcased, featuring an 88-core Arm CPU, dual Rubin GPUs, and 2TB of memory, achieving a computing power of 100 PFLOPS; additionally, the CPX computing board was presented, supporting context acceleration for over one million tokens, aimed at large model inference scenarios [1] - Nvidia anticipates GPU sales exceeding $500 billion over the next five quarters, with projected lifecycle shipments of Blackwell and Rubin reaching 20 million units, significantly higher than the 4 million units of Hopper [1] Group 2 - Nvidia will collaborate with Oracle and the U.S. Department of Energy to build the Solstice and Equinox supercomputing systems, deploying 100,000 and 10,000 Blackwell GPUs respectively, with a total computing power of approximately 2200 EFLOPS, expected to be operational by the first half of 2026 [1] - The chip ETF (512760) tracks the China Semiconductor Chip Index (990001), which selects listed companies involved in semiconductor chip materials, equipment, design, manufacturing, packaging, and testing from the Shanghai and Shenzhen markets to reflect the overall performance of related listed companies in the semiconductor industry [1]
锦秋基金参与微纳核芯超亿元融资,首创三维存算一体3D-CIM™芯片开启大模型推理新篇章|Jinqiu Spotlight
锦秋集· 2025-10-30 13:34
Core Insights - Jinqiu Fund has completed an investment in Micro Nano Core, a leading AI chip company specializing in 3D integrated computing-in-memory (CIM) technology [2][4][12] - The investment highlights the market's strong consensus on the disruptive potential of 3D-CIM™ technology in AI computing applications, aiming to enhance edge AI capabilities [4][6] - The global market for edge AI chips is projected to grow significantly, from $2 billion in 2024 to $16.7 billion by 2028, driven by the increasing demand for high-performance, low-power, and cost-effective solutions [6][7] Investment Overview - Micro Nano Core has successfully completed over 100 million yuan in Series B strategic financing, led by BlueRun Ventures, with participation from top-tier institutions including Jinqiu Fund [4] - The funding will support the development of the world's fastest mass-producible 3D edge AI chips, which are expected to unlock new AI application scenarios [4][9] Technology and Innovation - Micro Nano Core's 3D-CIM™ chip technology combines 3D near-memory computing and computing-in-memory (CIM) to achieve significant improvements in performance, power efficiency, and cost [8][12] - The company claims to have achieved over 4 times the computing density and more than 10 times the power reduction compared to traditional von Neumann architectures [8][9] - The RV-CIM™ full-stack technology addresses the usability of chips, ensuring compatibility with the RISC-V open-source ecosystem [8][10] Market Trends - The evolution of AI agents from execution tools to decision-making partners is expected to drive a revolution in various industries, with a target of 70% penetration of intelligent terminals by 2027 in China [6] - The demand for high-performance, low-power, and low-cost chips is critical for the widespread adoption of AI agents, with the industry consensus leaning towards 3D stacking as a solution [7][8] Team and Ecosystem - Micro Nano Core boasts a world-class innovation team with a strong track record in chip design, having published numerous record-breaking results in international competitions [10][13] - The company is leading the development of global RISC-V CIM standards and is collaborating with multiple industry leaders to build a robust ecosystem around its technology [11][13] - The strategic partnerships with major storage and terminal manufacturers aim to create a self-sustaining ecosystem that enhances the adoption of 3D-CIM™ chips across various applications [11][12]
独家|对话Tensormesh三位联创:如何从学术界走到大模型推理产业前线?
Z Potentials· 2025-10-24 08:18
Core Insights - Tensormesh, a company focused on providing cache-accelerated inference optimization for enterprises, has officially launched and secured $4.5 million in seed funding led by Laude Ventures [2] - The founding team, consisting of Junchen Jiang, Yihua Cheng, and Kuntai Du, aims to bridge the gap between AI inference engines and storage services, leveraging their academic backgrounds to create a commercially viable product [3][4] Company Overview - Tensormesh is the first commercial platform to productize large-scale AI inference caching, inspired by the open-source project LMCache, which combines advanced technology with enterprise-level usability, security, and manageability [2][4] - The company’s product allows enterprises to deploy large model services easily, significantly reducing operational costs to about one-tenth of public API usage while enhancing performance by up to ten times compared to mainstream solutions [4][29] Funding and Growth - The funding process for Tensormesh was unconventional, relying on personal connections rather than traditional methods like business plans or roadshows, resulting in a swift investment agreement [5][48] - The seed funding will primarily be used for product refinement and team expansion, with a strategic focus on creating a strong open-source engine as an entry point for commercial value [5][40] Market Position and Challenges - The inference industry is emerging, with the cost of inference surpassing training costs due to increased usage, highlighting the need for efficient solutions [30][32] - Tensormesh addresses three main challenges in deploying large models: privacy concerns, complex cluster management, and high operational costs [26][28] Product Features and Innovations - The product offers a one-click deployment solution for in-house large model services, ensuring data privacy while significantly lowering costs and improving performance [29][30] - Tensormesh aims to fill a market gap by providing a comprehensive solution that integrates inference engines, storage, scheduling, and routing, which is currently lacking in the industry [38] Future Aspirations - The company aspires to become the go-to solution for large model inference, similar to how Databricks is recognized in big data [44][45] - The long-term vision includes evolving with AI advancements, ensuring that Tensormesh remains relevant as the industry shifts from reliance on single models to more complex systems [51][52]
KTransformers入选计算机系统顶会、与主流框架合作,趋境&清华让「异构」成为推理新范式
量子位· 2025-10-22 09:12
Core Insights - KTransformers, an open-source project developed by Turing Technology and Tsinghua University's KVCache.AI team, focuses on system innovation during the inference phase of large models, enabling efficient operation on diverse hardware architectures with lower computational power [2][4]. Group 1: KTransformers Overview - KTransformers is a high-performance heterogeneous inference framework that optimally utilizes various computing resources such as GPUs, CPUs, and memory [2]. - The project paper was recognized at the prestigious SOSP 2025 conference, highlighting its significance in the field of computer systems [2][4]. Group 2: Technical Innovations - The framework introduces an "Expert Deferral" mechanism, allowing for efficient scheduling of experts in Mixture of Experts (MoE) models, which reduces computational load without sacrificing model performance [7][13]. - KTransformers achieves nearly 4x speedup on a single Intel Xeon processor compared to traditional PyTorch implementations, significantly enhancing CPU performance in expert calculations [12]. - The system allows for dynamic overlapping of CPU and GPU loads, resulting in a model throughput increase of approximately 1.45 times, with minimal impact on model accuracy [15][16]. Group 3: Collaboration and Ecosystem - KTransformers has partnered with SGLang, a mainstream inference framework, to integrate full GPU inference with heterogeneous inference, enhancing the overall architecture for large model deployment [5][19]. - This collaboration enables developers to access both full GPU and heterogeneous inference capabilities seamlessly, particularly beneficial in scenarios with limited GPU resources [21]. Group 4: Market Position and Future Directions - KTransformers has gained significant traction in the developer community, with over 15.2K stars on GitHub, indicating its widespread adoption as a foundational framework for large model inference [24]. - The project aims to democratize AI capabilities, making them accessible beyond elite computational paths, and is actively collaborating with various domestic CPU and GPU platforms to promote cost-effective solutions [28][29].
清华、快手提出AttnRL:让大模型用「注意力」探索
机器之心· 2025-10-21 09:32
Core Insights - The article discusses the advancements in reinforcement learning (RL), particularly focusing on Process-Supervised RL (PSRL) and the introduction of a new framework called AttnRL, which enhances exploration efficiency and performance in reasoning models [3][4][9]. Group 1: Challenges in Traditional Methods - Traditional PSRL methods assign equal reward signals to all tokens, neglecting the fine-grained quality during the reasoning process [7]. - Existing PSRL approaches face significant bottlenecks in exploration efficiency and training costs, leading to high computational expenses [4][10]. Group 2: Introduction of AttnRL - AttnRL introduces an innovative exploration method by utilizing attention mechanisms to guide the reasoning process, allowing the model to branch from high-attention steps [9][12]. - The framework employs Attention-based Tree Branching (ATB), which analyzes the reasoning sequence and calculates Forward Context Influence (FCI) scores to determine the most impactful steps for branching [13][16]. Group 3: Adaptive Sampling Mechanisms - AttnRL incorporates two adaptive sampling mechanisms: difficulty-aware exploration and dynamic batch adjustment, optimizing the learning process by focusing on challenging problems while reducing computational load on simpler ones [20][22]. - The training process is streamlined to a One-Step Off-Policy approach, significantly reducing sampling costs compared to previous PSRL methods [23]. Group 4: Experimental Results - AttnRL demonstrates superior performance across various mathematical reasoning benchmarks, achieving average accuracy rates of 57.2% for 1.5B models and 68.7% for 7B models, outperforming baseline methods like GRPO and TreeRL [28]. - The framework shows improved efficiency in sampling, with a higher effective ratio and better performance in fewer training steps compared to traditional methods [29][31]. Group 5: Future Outlook - The introduction of attention scores in PSRL exploration decisions opens new avenues for enhancing model interpretability and RL research, suggesting that efficiency and intelligence can coexist through more effective exploration strategies [34].
技能英伟达桌面超算,加入苹果Mac Studio快爆了:推理速度飙升至277%
量子位· 2025-10-17 04:58
Core Viewpoint - EXO Labs has developed a new framework that enhances large model inference speed by combining NVIDIA's DGX Spark with Apple's M3 Ultra, achieving a speedup of up to 2.77 times for model deployment [1][5][18]. Group 1: Technology and Implementation - The framework utilizes a PD (Prefill and Decode) separation approach, where DGX Spark handles the Prefill phase due to its high computational power, while M3 Ultra manages the Decode phase, benefiting from its high memory bandwidth [11][18]. - The Prefill phase's computational demand grows quadratically with prompt length, while the Decode phase is primarily limited by memory bandwidth, making the separation of tasks advantageous [8][11]. - EXO Labs employs a streaming transmission method for KV cache, allowing for overlapping computation and data transfer between the two devices, which minimizes communication costs [16][18]. Group 2: Performance Metrics - The combination of DGX Spark and M3 Ultra results in significant performance improvements: Prefill speed increases to 3.79 times that of M3 Ultra alone, and Decode speed improves to 3.37 times that of DGX Spark [18][19]. - The overall performance metrics show that the combined system reduces total processing time to 2.32 seconds, achieving a speedup of 2.8 times compared to using M3 Ultra alone [19]. Group 3: Industry Context - NVIDIA is also exploring similar PD separation techniques with its upcoming Rubin CPX platform, which will utilize a compute-intensive processor for Prefill and a high-bandwidth memory chip for Decode [20]. - The recent delivery of DGX Spark systems to notable figures in the tech industry indicates a growing interest and investment in advanced AI inference technologies [22]. - Apple's latest M5 chip shows improvements in AI performance, but comparisons suggest that M3 Ultra may hold more value in the current landscape of AI hardware [26][30].
中国电信完成业界首个面向大模型推理的异构算力协同技术验证
Xin Lang Cai Jing· 2025-10-13 23:42
技术验证的成功体现了中国电信对智算推理优化技术的深刻理解与对国产算力适配调优的实践创新,彰 显了中国电信作为算力基础设施建设方推动国产算力从"可用"到"好用"的央企担当。未来,中国电信将 持续深化国产算力高质量发展布局,面向大模型训推一体、多智能体系统打造"互联互通、高效协同"的 异构算力生态格局,推动新型信息基础设施协调发展。 针对推理Prefill与Decode阶段特性优化芯片设计逐渐成为行业共识,英伟达和华为分别发布芯片设计规 划,将PD两阶段分别适用"高算低存"和"低算高存"的思路融于芯片设计。中国电信研究院在2025年初 洞察到PD分离推理对算力异质性的需求,构建异构通信优化、PD资源调配、推理任务调度全栈自研异 构混推体系,展现出三大核心优势:一是通过自研异构传输引擎,实现跨架构芯片PD池间KVCache的 高效传输;二是采用自研国产算力赋能工具"翼芯",根据业务特征与算力性能自动推荐并实时优化PD 资源配比;三是构建AI推理平台,实现推理任务在Prefill池与Decode池间的动态调度。 本报讯(记者翼研)近期,中国电信研究院联合北京智源人工智能研究院、昆仑芯科技有限公司、中兴 通讯、北京基流 ...
700万参数击败DeepSeek R1等,三星一人独作爆火,用递归颠覆大模型推理
机器之心· 2025-10-09 04:43
Core Viewpoint - The article discusses the emergence of new models in AI reasoning, particularly the Hierarchical Reasoning Model (HRM) and the Tiny Recursive Model (TRM), highlighting their efficiency and performance in complex reasoning tasks despite having significantly fewer parameters compared to traditional large models [1][4][29]. Group 1: Hierarchical Reasoning Model (HRM) - HRM, proposed by researchers from Sapient Intelligence, utilizes a hierarchical reasoning structure and has 27 million parameters, achieving remarkable performance with only 1,000 training samples [1]. - The model's architecture is based on a two-network design, which increases the parameter count compared to conventional single-network supervised learning [12]. - HRM's performance is benchmarked against various tasks, showing its accuracy in Sudoku-Extreme and Maze-Hard [25][29]. Group 2: Tiny Recursive Model (TRM) - TRM, introduced by researchers from Samsung Advanced Technology Institute, contains only 7 million parameters and outperforms larger models like o3-mini and Gemini 2.5 Pro in challenging reasoning tasks [4][29]. - The model operates through a recursive reasoning process, iterating up to 16 times to refine its answers, demonstrating the principle of "less is more" [6][9]. - TRM's experimental results indicate superior accuracy in Sudoku-Extreme (87.4%) and competitive performance in other benchmarks compared to HRM [27][29]. Group 3: Experimental Results and Comparisons - The article presents a comparison of accuracy rates between HRM and TRM across various datasets, showing TRM's efficiency in achieving higher accuracy with fewer parameters [23][29]. - In the ARC-AGI benchmarks, TRM-Att and TRM-MLP models demonstrate better performance than HRM, emphasizing the advantages of parameter efficiency and generalization capabilities [26][29]. - The findings suggest that reducing model complexity while increasing recursive iterations can lead to improved performance, challenging traditional assumptions about model depth and parameter size [15][17].
最受欢迎的开源大模型推理框架 vLLM、SGLang 是如何炼成的?
AI科技大本营· 2025-09-24 02:01
Core Viewpoint - The article discusses the development stories of vLLM and SGLang, two prominent open-source inference engines for large language models (LLMs), highlighting their innovations, community engagement, and performance metrics. Group 1: LLM Inference Challenges - The core challenge of LLM inference lies in deploying models with hundreds of billions of parameters under strict constraints of latency, throughput, and cost [3] - The inference process involves applying learned knowledge to new data, which requires efficient computation and memory management [2][3] Group 2: vLLM Development - vLLM originated from a 2023 paper on PagedAttention, which innovatively applied operating system techniques for memory management, significantly enhancing throughput [7][8] - vLLM demonstrated remarkable performance improvements, handling up to 5 times the traffic and increasing throughput by 30 times compared to previous backends [9] - The project quickly evolved from a research initiative to a community-driven open-source project, amassing over 56,000 stars on GitHub and engaging thousands of developers [15][9] Group 3: SGLang Development - SGLang was developed from the paper "SGLang: Efficient Execution of Structured Language Model Programs," featuring RadixAttention for optimized performance [12] - SGLang retains the KVCache from previous requests to reduce computation during the prefill phase, showing significant performance advantages over traditional inference engines [12] - Although SGLang's community is smaller than vLLM's, it has over 2,000 participants and has shown rapid iteration and growth [13] Group 4: Community Engagement - vLLM has a robust community with over 12,000 participants in issues and pull requests, while SGLang's community is less than half that size [15][13] - Both projects have faced challenges in managing a growing number of issues and pull requests, with vLLM generally responding faster than SGLang [13] Group 5: Performance Metrics and Comparisons - vLLM and SGLang have both integrated advanced features like Continuous Batching and various attention mechanisms, leading to significant performance enhancements [29] - The competition between these two projects has intensified, with both claiming performance leadership in their respective releases [26] Group 6: Future Trends and Developments - The article notes that as the performance race heats up, both vLLM and SGLang are focusing on reproducible methods and real-world metrics rather than just benchmark results [26] - The trend indicates a convergence in model architectures and features among leading inference engines, with a shift in competition towards factors beyond performance [29] Group 7: Investment and Support - Both projects have attracted attention from investment firms and open-source foundations, with vLLM receiving support from a16z and SGLang being recognized in the PyTorch ecosystem [31][40]
商汤拆分芯片业务始末:百度创始成员加入,半年已融15亿
36氪· 2025-09-19 13:42
Core Viewpoint - The article discusses the emergence of AI chip startups in China, focusing on the establishment of "曦望" (Sunrise) as a subsidiary of商汤 (SenseTime) to develop large model inference chips, aiming to reduce inference costs significantly and capitalize on the growing AI chip market [4][7][9]. Company Overview - "曦望" was formed as part of商汤's "1+X" strategy, which involves splitting off high-potential but resource-intensive chip development into an independent entity [5][9]. - The company aims to leverage商汤's five years of experience in chip development to accelerate its growth and market entry [11][13]. Leadership and Team - 王湛, a former key figure at百度 (Baidu), has joined "曦望" as co-CEO, bringing extensive experience in managing large teams and product development [5][6]. - The executive team includes王勇, who has 20 years of chip industry experience, and the team has grown by 50% to nearly 200 members, with many coming from major tech companies [12][13]. Financial Investment and Product Development -商汤 has invested over 1.1 billion yuan in chip development over the past five years, with "曦望" raising over 1.5 billion yuan in recent funding rounds [13][14]. - "曦望" has successfully produced two chips: the S1 chip for cloud-edge visual inference and the S2 chip for large model inference, with plans for the S3 chip to reduce inference costs by 90% [14][15][17]. Market Context and Competitive Landscape - The Chinese AI chip industry is at a pivotal moment, with companies like寒武纪 (Cambricon) and others gaining significant market traction [9][22]. - The article highlights the importance of timing in entering the AI chip market, suggesting that "曦望" is well-positioned to capitalize on the current market dynamics [24][25]. Strategic Focus and Future Outlook - "曦望" aims to focus on specific market segments and leverage its relationship with industry capital to ensure successful product commercialization [18][19]. - The company believes that the future of AI chips will hinge on integrated hardware and software capabilities, as well as the ability to predict market trends [25].