大模型推理
Search documents
网易游戏 Tmax 平台实践:基于 Fluid 的云原生 AI 大模型推理加速架构
AI前线· 2026-03-03 04:05
作者 | 廖海峰,张翔 背景:游戏行业智能化浪潮下的 基础设施不断演进 作为中国领先的游戏研发与运营公司,网易游戏旗下拥有《梦幻西游》《大话西游》《蛋仔派对》等国民级游戏产品,以及游戏资产交易平台"藏宝阁"等 重要服务生态。随着游戏产品矩阵的不断扩大和用户体验需求的持续升级,网易游戏需要处理的数据类型和业务场景日益复杂多样。 而大模型正深刻改变游戏行业。 在 NPC 智能化、自动化剧情生成、角色动作捕捉及游戏资产生成等场景,特别是 RPG 与社交类游戏中,大模型已成为 核心竞争力。 为了更好地通过生成式 AI 支持业务发展,网易游戏打造了面向云原生的 Tmax AI 机器学习平台 ,提供灵活的资源调度、高效的 AI 开发 效率与易托管的 AI 服务。 挑战:大模型推理服务的 Tmax 平台构建于 Kubernetes 之上,整合了 Kubeflow、自研调度器及 CubeFS 文件管理系统,支持从 Jupyter 交互式开发到分布式训练、再到模型推理 部署的全链路 AI 生命周期管理。然而,随着大模型推理业务规模爆发,平台在 资源弹性、数据访问效率与多地域协同 方面面临严峻挑战。 "不可能三角" 在构建推理服 ...
Open AI获超千亿美元投资;涨价太快存储商调整付款方式 | 科技风向标
2 1 Shi Ji Jing Ji Bao Dao· 2026-02-28 03:07
21世纪经济报道新质生产力研究院综合报道 早上好,新的一天又开始了。在过去的24小时内,科技行业发生了哪些有意思的事情?来跟21tech一起看看吧。 【巨头风向标】 Open AI宣布获得1100亿美元的新投资 Open AI宣布,获得1100亿美元的新投资,公司投前估值达到7300亿美元。这笔投资包括软银300亿美元、英伟达300亿美元以及 亚马逊500亿美元。此外,公司还与亚马逊签署了战略合作伙伴关系,并与英伟达达成了下一代推理计算技术合作协议。随着本 轮融资的进行,预计将有更多金融投资者加入。新一轮融资使OpenAI基金会持有的OpenAI集团股份价值超过1800亿美元,进一 步巩固了这家历史上资源最雄厚的非营利组织之一的地位,并扩大了其在健康突破和人工智能韧性等领域资助慈善事业的能 力。 影石赢得美国"337调查"最终裁决 2月27日,影石创新公告称,公司于2026年2月26日(美国时间)获悉ITC(美国国际贸易委员会)最终裁决结果,针对案涉6件 专利,ITC确认3件GoPro主张的发明专利所涉及的指控产品不构成侵权且专利权利要求无效/部分无效、1件GoPro主张的发明专 利所涉及的指控产品不构成侵权 ...
未知机构:从训练走向极致推理LPU架构重塑算力底座东北计算机范式转移-20260228
未知机构· 2026-02-28 02:55
从训练走向极致推理—LPU架构重塑算力底座【东北计算机】 技术核心: 不同于GPU依赖HBM,LPU倾向于采用大规模片上SRAM直接存储模型参数,消除了内存访问延 迟;同时利用静态时序调度,将计算路径精确锁定在时钟周期内。 这种ASIC化设计旨在追求推理端的绝对高吞吐与低延迟。 范式转移:推理端的"低延迟革命"催生LPU架构 随着大模型进入大规模应用期,算力需求正从"暴力计算"向"极致交互"演进。 传统的GPU架构在处理LLM推理的Decode阶段时,往往面临高延迟瓶颈。 LPU(Language ProcessingUnit)架构应运而生。 技术核心: 不同于 从训练走向极致推理—LPU架构重塑算力底座【东北计算机】 范式转移:推理端的"低延迟革命"催生LPU架构 随着大模型进入大规模应用期,算力需求正从"暴力计算"向"极致交互"演进。 传统的GPU架构在处理LLM推理的Decode阶段时,往往面临高延迟瓶颈。 LPU(Language ProcessingUnit)架构应运而生。 产业链正加速转向M9级以上基材,其核心标准在于: 树脂端: 必须使用极低损耗的特种树脂体系。 电子布: 传统的玻璃布在介电一致 ...
DeepSeek新论文剧透V4新框架,用闲置网卡加速智能体推理性能,打破PD分离瓶颈
3 6 Ke· 2026-02-27 02:29
Core Insights - A new reasoning framework for agents called DualPath has been introduced, which addresses I/O bottlenecks in long-text reasoning scenarios by optimizing the speed of loading KV-Cache from external storage [1][3]. Group 1: DualPath Framework - DualPath changes the traditional Storage-to-Prefill loading mode by introducing a second path, Storage-to-Decode, allowing for more efficient data handling [3][6]. - The framework utilizes idle storage network interface card (SNIC) bandwidth from the decoding engine (DE) to read caches and employs high-speed computing networks (RDMA) to transfer data to the prefill engine (PE), achieving global pooling of storage bandwidth and dynamic load balancing [3][13]. Group 2: Performance Improvements - In tests with a production-level model of 660 billion parameters, DualPath demonstrated a remarkable increase in offline inference throughput by 1.87 times and an average increase in online service throughput by 1.96 times [3][14]. - The framework significantly optimizes first token latency (TTFT) under high load while maintaining stable token generation speed (TPOT) [5][14]. Group 3: Technical Innovations - DualPath allows KV-Cache to be loaded into the decoding engine first, which is then transmitted to the prefill engine, alleviating bandwidth pressure on the prefill side [7][9]. - The architecture includes a central scheduler that dynamically allocates tasks based on I/O pressure and computational load, preventing congestion on any single network interface or computational resource [14][18]. Group 4: Research and Development - The first author of the paper, Wu Yongtong, is a PhD student at Peking University, focusing on system software and large model infrastructure, particularly in optimizing inference systems for large-scale deployment [15][16].
4卡96GB显存暴力输出!英特尔锐炫Pro B60和长城世恒X-AIGC工作站评测
Xin Lang Cai Jing· 2026-02-10 12:41
Core Viewpoint - Intel's Arc Pro B60 graphics card is positioned as a cost-effective solution for AI inference, offering significant advantages in memory capacity and performance compared to NVIDIA's offerings, particularly in the context of large model inference. Group 1: Product Overview - Intel's Arc Pro B60 features a complete BMG-G21 GPU core with 20 Xe2 cores, 2560 FP32 units, and 24GB of GDDR6 memory, which is double the capacity of its predecessor, the Intel Arc B580 [6][59]. - The card provides 12.28 TFLOPS of FP32 performance and 197 TOPS of INT8 AI performance, with a memory bandwidth of 456GB/s [8][59]. - Compared to NVIDIA's RTX Pro 2000, the Arc Pro B60 offers 50% more memory capacity and bandwidth at a significantly lower price point, making it a competitive option for high-performance AI inference [9][46]. Group 2: Market Positioning - Intel's transition to a "full-stack AI company" is challenging NVIDIA's previous dominance in the GPU market, particularly in AI applications [1][52]. - The introduction of oneAPI allows developers to easily migrate code from NVIDIA's CUDA environment to Intel hardware, enhancing the usability of Intel's GPUs for AI tasks [4][55]. - The Arc Pro B60 is highlighted as the most cost-effective solution for building large memory pools (96GB to 192GB) necessary for running extensive AI models [9][59]. Group 3: Performance Testing - In tests with the GPT-OSS-120B model, the Arc Pro B60 demonstrated the ability to handle 100 concurrent requests successfully, indicating its robustness for real-time applications [27][50]. - The mean time to first token (TTFT) was recorded at 91.37ms, showcasing the card's strong performance in the prefill phase [31][50]. - As concurrency increased, the throughput of the Arc Pro B60 improved significantly, reaching a maximum of 701 tokens per second at high loads, which is sufficient to support up to 1000 simultaneous users [36][40]. Group 4: Competitive Analysis - When compared to NVIDIA's RTX Pro 2000, the Arc Pro B60 outperformed in both memory capacity and processing power, achieving approximately 50% better performance in multi-GPU setups [46][49]. - The Arc Pro B60's large memory capacity allows it to run larger models without the need for extreme quantization, which is a limitation for NVIDIA's offerings at similar price points [47][49]. - Intel's pricing strategy for the Arc Pro B60 positions it as a viable alternative for enterprises looking to build high-performance local LLM inference stations at a fraction of the cost of NVIDIA's equivalent products [50][51].
腾讯混元AI Infra核心技术开源,推理吞吐提升30%
Sou Hu Cai Jing· 2026-02-04 12:22
▲ HPC-Ops 算子库架构图 IT之家 2 月 4 日消息,腾讯混元 AI Infra 团队今日宣布推出开源生产级高性能 LLM 推理核心算子库 HPC- Ops。 该算子库宣称基于生产环境痛点,采用 CUDA 和 CuTe 从零构建,通过抽象化工程架构、微架构深度适配及 指令级极致优化等,降低底层算子开发门槛,将核心算子性能逼近硬件峰值,实现了性能突破。 在真实场景下,基于 HPC-Ops,混元模型推理 QPM 提升 30%,DeepSeek 模型 QPM 提升 17%。同时,在 单算子性能方面,HPC-Ops 实现 Attention 相比 FlashInfer / FlashAttention 最高提升 2.22 倍;GroupGEMM 相 比 DeepGEMM 最高提升 1.88 倍;FusedMoE 相比 TensorRT-LLM 最高提升 1.49 倍。 在未来的发展规划中,HPC-Ops 将持续深耕大模型推理性能的突破方向: IT之家附 HPC-Ops 开源地址如下: 一方面,将重点研发稀疏 Attention 算子,针对性解决长上下文大模型的内存与算力瓶颈; 另一方面,会拓展更丰富的量化策 ...
“中国英伟达”突发跳水!寒武纪大跌14%市值跌破5000亿,业绩指引“小作文”流传,公司称很多传闻都是假的
Jin Rong Jie· 2026-02-03 03:42
今日社交平台,关于寒武纪2026年业绩指引的"小作文"流传。据每经报道,寒武纪回应,对于股价波动不清楚具体原因,市场很多传闻都是假的。希望大家 理性对待。 股票频道更多独家策划、专家专栏,免费查阅>> 责任编辑:磐石 作为A股科技股龙头的寒武纪,核心炒作逻辑主要围绕三大方面展开。首先是国产替代加速,在地缘政治因素影响下,国内云厂商和互联网大厂对自主可控 AI芯片(核心股)的需求快速增长,寒武纪作为国内AI芯片龙头直接受益。其次是大模型推理需求爆发,以DeepSeek等为代表的本土大模型快速发展,带 动了对高性能AI推理芯片的旺盛需求。第三是行业龙头地位,被称为"中国英伟达"的寒武纪在AI芯片架构设计和软硬件协同优化方面的技术积累逐步显现价 值。 1月31日寒武纪曾发布业绩预告,预计2025年全年营业收入60亿元至70亿元,同比增长410.87%至496.02%;预计扣除非经常性损益后的净利润盈利16亿元至 19亿元;预计归属于上市公司股东的净利润盈利18.5亿元至21.5亿元。公司称,受益于人工智能行业算力(核心股)需求的持续攀升,营业收入较上年同期 大幅增长,净利润实现扭亏为盈。此外,寒武纪定增申请获上交所 ...
曦望董事长徐冰:把大模型推理这件事,做到极致
Sou Hu Cai Jing· 2026-01-29 11:35
Core Insights - The core message emphasizes that whoever masters efficient, controllable, and sustainable inference infrastructure will dominate the speed of AI implementation [3][5]. Group 1: Company Overview - The company, known as Xi Wang, is positioned as a leading GPU chip company focused on inference, aiming to optimize large model inference [4]. - Xi Wang's mission is to excel in large model inference, transitioning from a training-driven to an inference-driven AI industry [4][5]. - The company was established in 2020, evolving from the chip division of SenseTime, and has accumulated significant experience in AI applications over the past decade [5][6]. Group 2: Market Trends - By 2026, inference computing power is projected to account for 66% of AI workloads, surpassing training, indicating a structural shift in the industry [4]. - The demand for real-time interaction and complex scenarios, such as 3D and video generation, is driving the need for high-frequency response in AI applications [4][5]. Group 3: Cost Structure and Strategy - Inference costs currently represent 70% of AI application expenses, which is critical for profitability and commercial success [4][5]. - The company aims to reduce inference costs significantly, targeting a reduction from "per unit" to "per fraction," making AI infrastructure as accessible as utilities [4][7]. Group 4: Product Development and Innovation - Xi Wang has invested 2 billion in R&D over the past eight years, successfully producing the S1 and S2 chips, with the S3 chip recently launched [7][8]. - The company plans to set a new industry benchmark by achieving a cost of "one cent per million tokens" for inference [7][8]. Group 5: Business Model - The company is not merely a chip seller but aims to create a comprehensive ecosystem around "chip + system + ecology" [8][9]. - Xi Wang intends to collaborate with major AI firms and various computing power providers to optimize existing systems and enhance cost efficiency [8][9]. Group 6: Future Vision - The company envisions becoming the foundational infrastructure for affordable and stable computing power in the AI era, linking technology, policy, and commercial models [9]. - The future of AI in China is expected to rely on scalable and cost-effective inference infrastructure, marking a significant transition from following to leading in the domestic AI chip market [9].
曦望发布新一代推理GPU芯片启望S3,单位Token推理成本降低90%
Xin Lang Cai Jing· 2026-01-27 11:36
炒股就看金麒麟分析师研报,权威,专业,及时,全面,助您挖掘潜力主题机会! (来源:智通财经) 启望S3为面向大模型推理的定制化GPGPU芯片。智通财经记者了解到,其在典型推理场景下的整体性 价比较上一代提升超10倍。算力与存储设计上,该芯片支持FP16至FP4精度切换,采用LPDDR6显存方 案,显存容量提升4倍。在DeepSeek V3/R1满血版等主流大模型推理中,单位Token成本较上一代降低约 90%。 1月27日,智通财经记者获悉,国产GPU厂商曦望(Sunrise)发布新一代推理GPU芯片启望S3。这是曦 望在近一年累计完成约30亿元战略融资后的首次集中公开亮相。2025年,曦望芯片交付量已突破万片。 ...
GPU创企曦望一年融资30亿:出身商汤,押注推理
Guan Cha Zhe Wang· 2026-01-22 13:13
Group 1 - The company Sunrise has completed nearly 3 billion yuan in financing within a year, with investors including SANY Group's Huaxu Fund, Paradigm Intelligence, and IDG Capital [1] - The funds raised will be used for the development of the next generation of inference GPUs, large-scale production, and ecosystem building [1] - Sunrise was established in 2020 as a spin-off from SenseTime's chip division, focusing on high-performance GPUs and multimodal scene inference chips [1] Group 2 - The core team consists of over 200 employees, with more than 80% being technical and R&D personnel, many of whom come from leading companies like AMD and Huawei [1] - Co-CEO Wang Yong has over 20 years of experience in the chip industry, having previously worked at AMD and Baidu [2] - Co-CEO Wang Zhan, a founding member of Baidu, has significant experience in the field [2] Group 3 - Unlike traditional GPU manufacturers like NVIDIA that pursue "training and inference integration," Sunrise focuses solely on "pure inference" [4] - The company aims to reduce inference costs by 90% and provide stable services, positioning itself as a transformative player in the AI industry [4] - Sunrise has invested a total of 2 billion yuan in R&D over the past few years, developing a product matrix of three generations of chips [4] Group 4 - The S1 chip, launched in 2020, is China's first visual inference chip and has been licensed to major clients like Sony and Xiaomi [4] - The upcoming S2 chip, set for mass production in 2024, will compete with mainstream A/H series GPUs and be compatible with the CUDA ecosystem [4] - The S3 chip, expected in 2026, aims to achieve significant reductions in cost and energy consumption, targeting a new industry benchmark of "one cent per million tokens" [5]