大模型推理
Search documents
大模型Infra新突破!腾讯混元开源LLM推理算子库,推理吞吐提升30%
量子位· 2026-01-22 11:13
Core Viewpoint - In the competition of large models, computational efficiency has become a critical bottleneck for AI applications and development, necessitating a shift from merely stacking GPUs to enhancing efficiency [1][7]. Group 1: HPC-Ops Development - Tencent's Mix Yuan AI Infra team has open-sourced a high-performance LLM inference core operator library called HPC-Ops to address performance issues with mainstream operator libraries like H20 [2][15]. - HPC-Ops is built from scratch using CUDA and CuTe, featuring deep architectural adaptations and optimizations to lower the development threshold for core operators, achieving significant performance breakthroughs [4][15]. Group 2: Performance Improvements - The inference performance of the Mix Yuan model has improved by 30% and the DeepSeek model by 17% when utilizing HPC-Ops [5][27]. - HPC-Ops has achieved up to 2.22 times performance improvement in Attention compared to FlashInfer/FlashAttention, 1.88 times in GroupGEMM compared to DeepGEMM, and 1.49 times in FusedMoE compared to TensorRT-LLM [6][47]. Group 3: Pain Points of Existing Operator Libraries - Current mainstream operator libraries are costly to use, complex in design, and require deep familiarity with the code, making adaptation difficult for ordinary AI researchers [11]. - Existing state-of-the-art (SOTA) operator libraries often fail to leverage the full performance potential of hardware, particularly on inference cards like H20, which differ from high-end training cards [8][13]. Group 4: Technical Innovations - HPC-Ops includes modules for FusedMoE, Attention, and GroupGEMM, with optimizations that align task characteristics with hardware capabilities, achieving over 80% of the hardware peak bandwidth [20][47]. - The library employs persistent kernels to hide overhead and uses innovative data rearrangement techniques to enhance performance, achieving superior results compared to current SOTA implementations [24][28]. Group 5: Future Development Directions - HPC-Ops aims to focus on developing sparse Attention operators to address memory and computational bottlenecks in long-context large models and to expand quantization strategies to include mixed precision [50]. - The library will also explore optimization of computation-communication coordination to reduce communication overhead in distributed inference scenarios, supporting the efficient deployment of ultra-large models [51].
30 亿融资砸向推理算力!目标:百万 Token 一分钱!
是说芯语· 2026-01-22 10:21
Core Viewpoint - The article highlights the significant strategic financing of nearly 3 billion for Hangzhou-based GPU company Xiwang, aimed at advancing the development, production, and ecosystem of next-generation inference GPUs, positioning it strongly in the competitive landscape of AI computing power [1]. Group 1: Financing and Support - Xiwang secured nearly 30 billion in strategic financing within a year, with participation from industrial capital, leading VC/PE firms, and state-owned enterprises, indicating strong confidence in its technology and commercialization capabilities [1]. - The financing arrangement provides comprehensive support from technology research and development to market expansion, enhancing Xiwang's growth potential [2]. Group 2: Leadership and Team - Xiwang is led by a dual CEO team, with Wang Yong, a veteran in the chip industry, and Wang Zhan, a core member of Baidu's founding team, bringing extensive experience in both chip development and commercialization [3]. - The core team consists of around 300 members, many from top companies like NVIDIA and AMD, with an average of 15 years of industry experience and over 200 core patents, addressing the challenges of slow R&D and commercialization in domestic chip development [5]. Group 3: Differentiation Strategy - Unlike many competitors focusing on training and inference integration, Xiwang targets inference scenarios with a restructured native architecture, optimizing key components to reduce inference costs significantly [6]. - The company aims to lower the cost and accessibility of large model inference, making computing power widely available [7]. Group 4: Product Development - Xiwang has developed a three-generation chip matrix, with the S1 chip launched in 2020 for visual inference, the S2 chip set for 2024, and the upcoming S3 chip targeting a new benchmark of "one cent for a million tokens" [8]. - The S3 chip is designed to support low-precision inference, significantly reducing cost and energy consumption, which could redefine the cost structure of AI commercialization [8]. Group 5: Ecosystem Collaboration - Xiwang emphasizes ecosystem collaboration over zero-sum competition, positioning itself as a "cost optimization layer" within existing computing systems, fostering partnerships with local chip manufacturers [10]. - This strategy aims to create a virtuous cycle of broader application, refined technology, and lower costs, ultimately enhancing the overall strength of domestic computing power [10].
首发丨曦望完成近30亿元战略融资,All-in推理GPU
投中网· 2026-01-22 07:07
Core Insights - The article highlights the transition of the AI industry from "training dividends" to "inference dividends," as evidenced by the significant investment in the company Sunrise, which focuses on inference GPU chips [2][4]. Group 1: Investment and Market Trends - Sunrise has completed nearly 3 billion yuan in strategic financing within a year, with investments from various industry players and well-known VC/PE institutions [2]. - The funding will primarily be used for the development of the next generation of inference GPUs, large-scale production, and ecosystem building [2]. Group 2: Company Background and Team - Founded in 2020, Sunrise originated from the chip department of SenseTime and has a deep understanding of model evolution, operator optimization, and customer needs [4]. - The core team consists of experienced professionals from companies like AMD, Baidu, and SenseTime, with an average of 15 years in the industry [4]. Group 3: Product Development and Innovation - Sunrise has developed three generations of chips, with a focus on reducing inference costs and improving efficiency, rather than competing on general GPU parameters [6]. - The company has invested 2 billion yuan in R&D over the past few years and holds over 200 core patents [6]. Group 4: Strategic Positioning - The company aims to significantly reduce inference costs and provide stable services, positioning itself as a "profit and loss optimizer" for the AI industry [8]. - Sunrise's focus on real-world economic metrics rather than just technical specifications differentiates it from other domestic chip manufacturers [8]. Group 5: Future Goals and Industry Impact - The goal is to drastically lower the cost and accessibility of large model inference, thereby unlocking the full potential of AGI [9]. - Sunrise's rise signifies a shift in the domestic AI chip landscape from "catching up" to "leading through differentiation" [9].
暴涨近28%!黄仁勋一句话引爆存储股,机构称存储超级周期持续至2027年
Jin Rong Jie· 2026-01-07 00:49
存储芯片市场正被一股前所未有的热潮席卷。隔夜美股市场上,闪迪(Sandisk)股价暴涨近28%,希捷、西部数据等硬盘制造商涨幅均超14%,美光科技亦 大涨10%,整个板块呈现井喷态势。 野村证券表示,这一轮始于今年下半年的存储超级周期至少延续至2027年,并且真正有意义的新增供给最早要到2028年初期才会出现。野村分析师团队表 示,投资者们在2026年应继续超配存储龙头,把存储芯片价格—利润—估值三击作为2026年存储投资主线,而不是仅把存储当HBM单一题材,该机构预计 三大存储芯片公司(三星电子、SK海力士、美光科技)盈利将创历史新高。 股票频道更多独家策划、专家专栏,免费查阅>> 责任编辑:栎树 此番言论并非空穴来风,其背后是人工智能,尤其是大模型推理对数据存储需求的爆炸性增长。黄仁勋在CES上详细介绍了一项关键技术革新——将高速存 储(KV缓存)直接集成到GPU机架内的"context memory架构"。随着模型上下文长度从十万级迈向亿级,所需的存储空间呈几何级数增长。旧有的存储架构 已不堪重负,而将存储贴近GPU的革新,旨在解决因海量数据频繁移动导致的网络拥堵痛点。黄仁勋称此想法"完全革命性",它首 ...
田渊栋的2025年终总结:关于被裁和26年的研究方向
自动驾驶之心· 2026-01-06 00:28
Core Insights - The article discusses the complexities and challenges faced by the company in the context of project management and personal career decisions, particularly in the realm of AI and machine learning research [3][4][5]. Group 1: Project Management and Challenges - The company faced significant pressure when asked to assist with the Llama4 project, leading to a complex decision-making scenario that involved weighing potential outcomes and personal integrity [3]. - Despite the challenges, the company made progress in core areas of reinforcement learning, including training stability and model architecture design, which contributed to a shift in research perspectives [3]. Group 2: Career Decisions and Transitions - After over a decade with the company, there was contemplation about leaving, influenced by economic and personal factors, but ultimately a decision was made to stay, reflecting the difficulty of such transitions [4]. - The experience of navigating through ups and downs in the workplace provided valuable material for future creative endeavors, indicating a blend of professional and personal growth [5]. Group 3: Research Directions - The company is focusing on two main research directions for 2025: large model inference and understanding the "black box" of models, which has gained traction following the release of their continuous latent space reasoning work [6]. - Efforts to improve inference efficiency include various innovative approaches, such as using discrete tokens and parallel reasoning chains, which have shown promising results in reducing computational costs while enhancing performance [7]. Group 4: Interpretability and Future Directions - The company emphasizes the importance of interpretability in AI, arguing that understanding how AI systems work is crucial for ensuring ethical and effective use of technology [10]. - Current efforts to demystify model training processes are still in early stages, with a focus on deriving principles from first principles to guide future AI model design [11].
田渊栋2025年终总结:救火Llama4但被裁,现任神秘初创公司联创
机器之心· 2026-01-04 08:05
Core Insights - The article discusses the experiences and reflections of a prominent AI researcher, including the impact of layoffs at Meta and future work plans [1][2][3] Group 1: Layoffs and Career Reflections - The researcher was involved in the Llama 4 project during a critical period and faced the complexities of decision-making under pressure, leading to a deeper understanding of societal dynamics [4] - After over a decade at Meta, the researcher had contemplated leaving but ultimately decided to stay until the company made the decision for them, which provided new material for creative writing [5] - Following the layoffs, the researcher received numerous job offers but chose to become a co-founder of a new startup, indicating a shift towards entrepreneurship [6] Group 2: Research Directions for 2025 - The main research directions for 2025 include large model inference and understanding the "black box" of models, with a focus on improving training efficiency and interpretability [7][8] - The researcher’s team has made significant contributions to the field, including theoretical analyses and practical applications that enhance model performance and efficiency [8][9] Group 3: Importance of Interpretability - The article emphasizes the critical need for interpretability in AI, arguing that understanding how AI models work is essential for trust and effective deployment [11][12] - The challenges of explaining model behavior from first principles are highlighted, with a call for deeper insights into the emergent structures and training dynamics of AI models [12] Group 4: Future of Work and AI Integration - The integration of AI into the workforce is transforming traditional roles, with a shift from valuing human experience to assessing the ability to enhance AI capabilities [20][23] - The article presents two potential scenarios for the future: one where AI achieves superintelligence and another where traditional scaling methods fail, both underscoring the necessity of interpretability [21][23] Group 5: The Role of Independent Thinking - The future landscape will require individuals to maintain independent thought and creativity, as reliance on AI-generated content may lead to a decline in original thinking [29][30] - The transition from employee to entrepreneur or founder roles is emphasized, with a focus on having clear goals to drive proactive thinking and innovation [31][33]
LeCun曝Meta作弊刷榜,田渊栋:我没想到这个结局
量子位· 2026-01-04 05:21
Core Viewpoint - The article discusses the fallout from the release of Meta's Llama 4, highlighting internal conflicts and the departure of key figures like LeCun and Tian Yuandong, who are now pursuing entrepreneurial ventures due to dissatisfaction with Meta's direction in AI development [1][3][22]. Group 1: Llama 4 and Internal Conflicts - Llama 4 faced significant criticism and allegations of cheating in benchmark tests, leading to a loss of confidence from Meta's leadership [1][10]. - The release of DeepSeek, a competing AI model, pressured Meta to accelerate its AI investments, resulting in internal turmoil and a shift in team dynamics [4][6]. - The communication breakdown within the team was exacerbated by differing priorities, with LeCun's team wanting to innovate while leadership preferred proven technologies [7][8]. Group 2: Departures and New Ventures - LeCun and Tian Yuandong both announced their intentions to start new companies after leaving Meta, with LeCun focusing on world models and Tian Yuandong on new AI initiatives [27][33]. - LeCun's new venture, Advanced Machine Intelligence (AMI), aims to explore advanced machine intelligence through open-source projects, while he will serve as the executive chairman [27][30]. - Tian Yuandong expressed a desire to co-found a startup, indicating a trend among former Meta employees to seek new opportunities outside the company [33]. Group 3: Future Directions in AI - LeCun's focus on the V-JEPA architecture aims to enhance AI's understanding of the physical world through video and spatial data, with expectations for significant progress within 12 months [32]. - The article emphasizes the need for AI to move beyond language limitations, as highlighted by LeCun's critique of the current focus on large language models [25][26].
首都在线20251230
2025-12-31 16:02
首都在线 20251230 摘要 首都在线受益于大模型推理应用增长,边缘云需求随之提升,公司致力 于打造国产与英伟达适配平台,支持 MiniMax 等大模型企业出海,客户 依赖性增强。 基于 MaaS 服务和 Converged UI 平台的新业务每月增长率达 20%- 30%,有望显著提升公司毛利和收入水平。 首都在线转型较早,与智谱等公司建立了深度合作,积累了技术投入和 用户理解优势,并具备全球资源适配能力,满足客户国内外资源布局需 求。 公司通过"铁三角战略"(销售、产品技术解决方案、大客户服务)深 耕大客户,提供国产设备适配服务,确保业务合规性和连续性。 首都在线积极布局前瞻领域,包括太空算力相关业务,已在文昌建设算 力中心,在庆阳设立算力中心支持酒泉卫星发射,并探索 AI 生成短剧等 行业应用。 公司采取根据客户需求逐步扩大资产和合作的策略,国内重点布局庆阳、 怀来、芜湖和海南等八大节点,海外布局包括达拉斯和新加坡。 首都在线通过 MAAS 服务及 ComfyUI 与模型厂商合作推广业务,获取 较高毛利,并通过政府补助及政策支持,实现区域内较高盈利。 Q&A 近期大模型上市和 AI 应用发展对首都在 ...
2025年大模型推理优化与部署实践产业洞察研究报告-云计算开源产业联盟
Sou Hu Cai Jing· 2025-12-25 02:34
Group 1 - The core point of the report indicates that the large model industry has transitioned from "model innovation" to a critical period of "scale implementation," where inference optimization and efficient deployment have become core competitive advantages, leading to rapid market growth [1][13] - The global AI inference computing power market is expected to grow nearly tenfold from 2021 to 2024, reaching USD 13.958 billion in 2024 and projected to increase to USD 18.355 billion in 2025; the Chinese market is expected to grow even faster, reaching CNY 43.83 billion by 2025, with a compound annual growth rate of 66.3% [1][39][43] - The market competition landscape is diverse, with Tianyi Cloud, Alibaba Cloud, and Huawei Cloud leading the domestic market, while Amazon, Google, and Microsoft dominate internationally; the token-based billing model has become mainstream, and the model-as-a-service (MaaS) business model is rapidly gaining popularity [1][39] Group 2 - The current deployment forms have diversified to meet different scenario needs, with four main deployment methods emerging: MaaS, integrated inference machines, private deployment platforms, and cloud-edge-end collaborative inference [2][59] - Full-stack optimization technology has become a core support, breaking through performance bottlenecks through hardware adaptation, inference engines, model layers, and parallel computing technologies [2][3] - The industry faces multiple challenges, including high costs, lack of standards, talent shortages, fragmented ecosystems, and complex security compliance; the report suggests accelerating the establishment of a technical standard system and fostering collaborative innovation mechanisms [3][14] Group 3 - Industry applications are deeply rooted, with significant results from practical cases; for instance, CITIC Securities has processed over 200 million service requests through an inference acceleration engine, and a robotics company has achieved an 80% efficiency improvement in private deployment [3][14] - The Chinese AI inference computing power market is expected to see a rapid increase in the proportion of inference workloads, projected to reach 70.5% by 2026, indicating a shift in focus from training to inference [43][47] - The deployment preferences for large model inference platforms are expected to change significantly, with public cloud deployment increasing from 49% to 58% and private cloud deployment rising from 16% to 26% by 2027 [58][59]
国产算力迈入“万卡”时代:摩尔线程发布新一代GPU架构,中科曙光发布万卡超集群
Jing Ji Guan Cha Wang· 2025-12-20 06:47
Core Insights - The article discusses the advancements in the domestic GPU industry, highlighting the launch of the "Huagang" architecture by Moore Threads and the "scaleX" supercluster system by Inspur, indicating a shift in focus from individual GPU performance to building scalable systems capable of handling massive computational tasks [2][6]. Group 1: Moore Threads Developments - Moore Threads unveiled its latest "Huagang" architecture, which boasts a 50% increase in computing density and a 10-fold improvement in efficiency compared to the previous generation [3]. - The "Huagang" architecture supports full precision calculations from FP4 to FP64 and introduces new support for MTFP6, MTFP4, and mixed low precision [3]. - Future chip plans include "Huashan," aimed at AI training and inference, and "Lushan," focused on high-performance graphics rendering, with "Lushan" showing a 64-fold increase in AI computing performance and a 50% improvement in ray tracing performance [4]. Group 2: Inspur Developments - Inspur's "scaleX" supercluster system, which publicly debuted, consists of 16 scaleX640 supernodes interconnected via the scaleFabric high-speed network, capable of deploying 10,240 AI accelerator cards [10]. - The scaleX system employs immersion phase change liquid cooling technology to address heat dissipation challenges, achieving a 20-fold increase in computing density per rack and a PUE (Power Usage Effectiveness) of 1.04 [11][12]. - The system supports multi-brand accelerator cards and has optimized compatibility with over 400 mainstream large models, reflecting a strategy to provide a versatile platform for various domestic computing resources [14]. Group 3: Industry Challenges and Solutions - The industry faces challenges in scaling up computational power, particularly in managing heat, power supply, and physical space limitations when deploying thousands of high-power chips in data centers [8][9]. - Both companies are addressing communication delays in distributed computing, with Moore Threads integrating a new asynchronous programming model and self-developed MTLink technology to support clusters exceeding 100,000 cards, while Inspur's scaleFabric network achieves 400 Gb/s bandwidth and sub-microsecond communication latency [12][13]. Group 4: Software Ecosystem and Compatibility - As the hardware specifications approach international standards, the focus is shifting towards optimizing the software stack, with Moore Threads announcing an upgrade to its MUSA unified architecture and achieving over 98% efficiency in core computing libraries [13]. - Inspur emphasizes the compatibility of its systems with various brands of accelerator cards, promoting an open architecture strategy that allows for coexistence of multiple chips [14].