大模型推理
Search documents
网易游戏 Tmax 平台实践:基于 Fluid 的云原生 AI 大模型推理加速架构
AI前线· 2026-03-03 04:05
作者 | 廖海峰,张翔 背景:游戏行业智能化浪潮下的 基础设施不断演进 作为中国领先的游戏研发与运营公司,网易游戏旗下拥有《梦幻西游》《大话西游》《蛋仔派对》等国民级游戏产品,以及游戏资产交易平台"藏宝阁"等 重要服务生态。随着游戏产品矩阵的不断扩大和用户体验需求的持续升级,网易游戏需要处理的数据类型和业务场景日益复杂多样。 而大模型正深刻改变游戏行业。 在 NPC 智能化、自动化剧情生成、角色动作捕捉及游戏资产生成等场景,特别是 RPG 与社交类游戏中,大模型已成为 核心竞争力。 为了更好地通过生成式 AI 支持业务发展,网易游戏打造了面向云原生的 Tmax AI 机器学习平台 ,提供灵活的资源调度、高效的 AI 开发 效率与易托管的 AI 服务。 挑战:大模型推理服务的 Tmax 平台构建于 Kubernetes 之上,整合了 Kubeflow、自研调度器及 CubeFS 文件管理系统,支持从 Jupyter 交互式开发到分布式训练、再到模型推理 部署的全链路 AI 生命周期管理。然而,随着大模型推理业务规模爆发,平台在 资源弹性、数据访问效率与多地域协同 方面面临严峻挑战。 "不可能三角" 在构建推理服 ...
Open AI获超千亿美元投资;涨价太快存储商调整付款方式 | 科技风向标
2 1 Shi Ji Jing Ji Bao Dao· 2026-02-28 03:07
21世纪经济报道新质生产力研究院综合报道 早上好,新的一天又开始了。在过去的24小时内,科技行业发生了哪些有意思的事情?来跟21tech一起看看吧。 【巨头风向标】 Open AI宣布获得1100亿美元的新投资 Open AI宣布,获得1100亿美元的新投资,公司投前估值达到7300亿美元。这笔投资包括软银300亿美元、英伟达300亿美元以及 亚马逊500亿美元。此外,公司还与亚马逊签署了战略合作伙伴关系,并与英伟达达成了下一代推理计算技术合作协议。随着本 轮融资的进行,预计将有更多金融投资者加入。新一轮融资使OpenAI基金会持有的OpenAI集团股份价值超过1800亿美元,进一 步巩固了这家历史上资源最雄厚的非营利组织之一的地位,并扩大了其在健康突破和人工智能韧性等领域资助慈善事业的能 力。 影石赢得美国"337调查"最终裁决 2月27日,影石创新公告称,公司于2026年2月26日(美国时间)获悉ITC(美国国际贸易委员会)最终裁决结果,针对案涉6件 专利,ITC确认3件GoPro主张的发明专利所涉及的指控产品不构成侵权且专利权利要求无效/部分无效、1件GoPro主张的发明专 利所涉及的指控产品不构成侵权 ...
未知机构:从训练走向极致推理LPU架构重塑算力底座东北计算机范式转移-20260228
未知机构· 2026-02-28 02:55
从训练走向极致推理—LPU架构重塑算力底座【东北计算机】 技术核心: 不同于GPU依赖HBM,LPU倾向于采用大规模片上SRAM直接存储模型参数,消除了内存访问延 迟;同时利用静态时序调度,将计算路径精确锁定在时钟周期内。 这种ASIC化设计旨在追求推理端的绝对高吞吐与低延迟。 范式转移:推理端的"低延迟革命"催生LPU架构 随着大模型进入大规模应用期,算力需求正从"暴力计算"向"极致交互"演进。 传统的GPU架构在处理LLM推理的Decode阶段时,往往面临高延迟瓶颈。 LPU(Language ProcessingUnit)架构应运而生。 技术核心: 不同于 从训练走向极致推理—LPU架构重塑算力底座【东北计算机】 范式转移:推理端的"低延迟革命"催生LPU架构 随着大模型进入大规模应用期,算力需求正从"暴力计算"向"极致交互"演进。 传统的GPU架构在处理LLM推理的Decode阶段时,往往面临高延迟瓶颈。 LPU(Language ProcessingUnit)架构应运而生。 产业链正加速转向M9级以上基材,其核心标准在于: 树脂端: 必须使用极低损耗的特种树脂体系。 电子布: 传统的玻璃布在介电一致 ...
DeepSeek新论文剧透V4新框架,用闲置网卡加速智能体推理性能,打破PD分离瓶颈
3 6 Ke· 2026-02-27 02:29
Core Insights - A new reasoning framework for agents called DualPath has been introduced, which addresses I/O bottlenecks in long-text reasoning scenarios by optimizing the speed of loading KV-Cache from external storage [1][3]. Group 1: DualPath Framework - DualPath changes the traditional Storage-to-Prefill loading mode by introducing a second path, Storage-to-Decode, allowing for more efficient data handling [3][6]. - The framework utilizes idle storage network interface card (SNIC) bandwidth from the decoding engine (DE) to read caches and employs high-speed computing networks (RDMA) to transfer data to the prefill engine (PE), achieving global pooling of storage bandwidth and dynamic load balancing [3][13]. Group 2: Performance Improvements - In tests with a production-level model of 660 billion parameters, DualPath demonstrated a remarkable increase in offline inference throughput by 1.87 times and an average increase in online service throughput by 1.96 times [3][14]. - The framework significantly optimizes first token latency (TTFT) under high load while maintaining stable token generation speed (TPOT) [5][14]. Group 3: Technical Innovations - DualPath allows KV-Cache to be loaded into the decoding engine first, which is then transmitted to the prefill engine, alleviating bandwidth pressure on the prefill side [7][9]. - The architecture includes a central scheduler that dynamically allocates tasks based on I/O pressure and computational load, preventing congestion on any single network interface or computational resource [14][18]. Group 4: Research and Development - The first author of the paper, Wu Yongtong, is a PhD student at Peking University, focusing on system software and large model infrastructure, particularly in optimizing inference systems for large-scale deployment [15][16].
4卡96GB显存暴力输出!英特尔锐炫Pro B60和长城世恒X-AIGC工作站评测
Xin Lang Cai Jing· 2026-02-10 12:41
然而,随着硬件巨头Intel向"全栈AI公司"快速转型,这种绝对垄断正在被打破。 早在2019年,Intel就发布了oneAPI 跨架构编程模型,旨在让代码在 CPU、GPU、NPU 之间通用。这意味着开发 者用一套代码即可调用 Intel 的所有算力,降低了迁移成本。 来源:快科技官方 一、前言:当前最具性价比的96GB/192GB AI推理卡 oneAPI还允许开发者将原本仅能NVIDIA CUDA环境下运行的代码,通过其迁移工具(SYCLomatic)快速转换到 Intel硬件上,为Arc系列显卡运行主流大模型打下了坚实的软件基础。 凭借深耕多年的CUDA护城河,NVIDIA在AI领域一度拥有"定价权",这也让这家公司的GPU及相关产品的售价逐 渐脱离普通的消费者。 去年,Intel发布了基于第二代Xe2架构(Battlemage)的专业级显卡—Intel Arc Pro B60。随后,以Maxsun(铭 瑄)、SPARKLE(撼与)、GUNNIR(蓝戟)为代表的核心伙伴正式将其推向全球市场,直指高性能AI推理领 域。 Intel Arc Pro B60与此前发布的消费级Intel Arc B580一样 ...
腾讯混元AI Infra核心技术开源,推理吞吐提升30%
Sou Hu Cai Jing· 2026-02-04 12:22
▲ HPC-Ops 算子库架构图 IT之家 2 月 4 日消息,腾讯混元 AI Infra 团队今日宣布推出开源生产级高性能 LLM 推理核心算子库 HPC- Ops。 该算子库宣称基于生产环境痛点,采用 CUDA 和 CuTe 从零构建,通过抽象化工程架构、微架构深度适配及 指令级极致优化等,降低底层算子开发门槛,将核心算子性能逼近硬件峰值,实现了性能突破。 在真实场景下,基于 HPC-Ops,混元模型推理 QPM 提升 30%,DeepSeek 模型 QPM 提升 17%。同时,在 单算子性能方面,HPC-Ops 实现 Attention 相比 FlashInfer / FlashAttention 最高提升 2.22 倍;GroupGEMM 相 比 DeepGEMM 最高提升 1.88 倍;FusedMoE 相比 TensorRT-LLM 最高提升 1.49 倍。 在未来的发展规划中,HPC-Ops 将持续深耕大模型推理性能的突破方向: IT之家附 HPC-Ops 开源地址如下: 一方面,将重点研发稀疏 Attention 算子,针对性解决长上下文大模型的内存与算力瓶颈; 另一方面,会拓展更丰富的量化策 ...
“中国英伟达”突发跳水!寒武纪大跌14%市值跌破5000亿,业绩指引“小作文”流传,公司称很多传闻都是假的
Jin Rong Jie· 2026-02-03 03:42
今日社交平台,关于寒武纪2026年业绩指引的"小作文"流传。据每经报道,寒武纪回应,对于股价波动不清楚具体原因,市场很多传闻都是假的。希望大家 理性对待。 股票频道更多独家策划、专家专栏,免费查阅>> 责任编辑:磐石 作为A股科技股龙头的寒武纪,核心炒作逻辑主要围绕三大方面展开。首先是国产替代加速,在地缘政治因素影响下,国内云厂商和互联网大厂对自主可控 AI芯片(核心股)的需求快速增长,寒武纪作为国内AI芯片龙头直接受益。其次是大模型推理需求爆发,以DeepSeek等为代表的本土大模型快速发展,带 动了对高性能AI推理芯片的旺盛需求。第三是行业龙头地位,被称为"中国英伟达"的寒武纪在AI芯片架构设计和软硬件协同优化方面的技术积累逐步显现价 值。 1月31日寒武纪曾发布业绩预告,预计2025年全年营业收入60亿元至70亿元,同比增长410.87%至496.02%;预计扣除非经常性损益后的净利润盈利16亿元至 19亿元;预计归属于上市公司股东的净利润盈利18.5亿元至21.5亿元。公司称,受益于人工智能行业算力(核心股)需求的持续攀升,营业收入较上年同期 大幅增长,净利润实现扭亏为盈。此外,寒武纪定增申请获上交所 ...
曦望董事长徐冰:把大模型推理这件事,做到极致
Sou Hu Cai Jing· 2026-01-29 11:35
Core Insights - The core message emphasizes that whoever masters efficient, controllable, and sustainable inference infrastructure will dominate the speed of AI implementation [3][5]. Group 1: Company Overview - The company, known as Xi Wang, is positioned as a leading GPU chip company focused on inference, aiming to optimize large model inference [4]. - Xi Wang's mission is to excel in large model inference, transitioning from a training-driven to an inference-driven AI industry [4][5]. - The company was established in 2020, evolving from the chip division of SenseTime, and has accumulated significant experience in AI applications over the past decade [5][6]. Group 2: Market Trends - By 2026, inference computing power is projected to account for 66% of AI workloads, surpassing training, indicating a structural shift in the industry [4]. - The demand for real-time interaction and complex scenarios, such as 3D and video generation, is driving the need for high-frequency response in AI applications [4][5]. Group 3: Cost Structure and Strategy - Inference costs currently represent 70% of AI application expenses, which is critical for profitability and commercial success [4][5]. - The company aims to reduce inference costs significantly, targeting a reduction from "per unit" to "per fraction," making AI infrastructure as accessible as utilities [4][7]. Group 4: Product Development and Innovation - Xi Wang has invested 2 billion in R&D over the past eight years, successfully producing the S1 and S2 chips, with the S3 chip recently launched [7][8]. - The company plans to set a new industry benchmark by achieving a cost of "one cent per million tokens" for inference [7][8]. Group 5: Business Model - The company is not merely a chip seller but aims to create a comprehensive ecosystem around "chip + system + ecology" [8][9]. - Xi Wang intends to collaborate with major AI firms and various computing power providers to optimize existing systems and enhance cost efficiency [8][9]. Group 6: Future Vision - The company envisions becoming the foundational infrastructure for affordable and stable computing power in the AI era, linking technology, policy, and commercial models [9]. - The future of AI in China is expected to rely on scalable and cost-effective inference infrastructure, marking a significant transition from following to leading in the domestic AI chip market [9].
曦望发布新一代推理GPU芯片启望S3,单位Token推理成本降低90%
Xin Lang Cai Jing· 2026-01-27 11:36
炒股就看金麒麟分析师研报,权威,专业,及时,全面,助您挖掘潜力主题机会! (来源:智通财经) 启望S3为面向大模型推理的定制化GPGPU芯片。智通财经记者了解到,其在典型推理场景下的整体性 价比较上一代提升超10倍。算力与存储设计上,该芯片支持FP16至FP4精度切换,采用LPDDR6显存方 案,显存容量提升4倍。在DeepSeek V3/R1满血版等主流大模型推理中,单位Token成本较上一代降低约 90%。 1月27日,智通财经记者获悉,国产GPU厂商曦望(Sunrise)发布新一代推理GPU芯片启望S3。这是曦 望在近一年累计完成约30亿元战略融资后的首次集中公开亮相。2025年,曦望芯片交付量已突破万片。 ...
GPU创企曦望一年融资30亿:出身商汤,押注推理
Guan Cha Zhe Wang· 2026-01-22 13:13
Group 1 - The company Sunrise has completed nearly 3 billion yuan in financing within a year, with investors including SANY Group's Huaxu Fund, Paradigm Intelligence, and IDG Capital [1] - The funds raised will be used for the development of the next generation of inference GPUs, large-scale production, and ecosystem building [1] - Sunrise was established in 2020 as a spin-off from SenseTime's chip division, focusing on high-performance GPUs and multimodal scene inference chips [1] Group 2 - The core team consists of over 200 employees, with more than 80% being technical and R&D personnel, many of whom come from leading companies like AMD and Huawei [1] - Co-CEO Wang Yong has over 20 years of experience in the chip industry, having previously worked at AMD and Baidu [2] - Co-CEO Wang Zhan, a founding member of Baidu, has significant experience in the field [2] Group 3 - Unlike traditional GPU manufacturers like NVIDIA that pursue "training and inference integration," Sunrise focuses solely on "pure inference" [4] - The company aims to reduce inference costs by 90% and provide stable services, positioning itself as a transformative player in the AI industry [4] - Sunrise has invested a total of 2 billion yuan in R&D over the past few years, developing a product matrix of three generations of chips [4] Group 4 - The S1 chip, launched in 2020, is China's first visual inference chip and has been licensed to major clients like Sony and Xiaomi [4] - The upcoming S2 chip, set for mass production in 2024, will compete with mainstream A/H series GPUs and be compatible with the CUDA ecosystem [4] - The S3 chip, expected in 2026, aims to achieve significant reductions in cost and energy consumption, targeting a new industry benchmark of "one cent per million tokens" [5]