大模型推理 - filings, earnings calls, financial reports, news - Reportify

大模型推理

Search documents

网易游戏 Tmax 平台实践：基于 Fluid 的云原生 AI 大模型推理加速架构

AI前线· 2026-03-03 04:05

作者 | 廖海峰，张翔背景：游戏行业智能化浪潮下的基础设施不断演进作为中国领先的游戏研发与运营公司，网易游戏旗下拥有《梦幻西游》《大话西游》《蛋仔派对》等国民级游戏产品，以及游戏资产交易平台"藏宝阁"等重要服务生态。随着游戏产品矩阵的不断扩大和用户体验需求的持续升级，网易游戏需要处理的数据类型和业务场景日益复杂多样。而大模型正深刻改变游戏行业。在 NPC 智能化、自动化剧情生成、角色动作捕捉及游戏资产生成等场景，特别是 RPG 与社交类游戏中，大模型已成为核心竞争力。为了更好地通过生成式 AI 支持业务发展，网易游戏打造了面向云原生的 Tmax AI 机器学习平台，提供灵活的资源调度、高效的 AI 开发效率与易托管的 AI 服务。挑战：大模型推理服务的 Tmax 平台构建于 Kubernetes 之上，整合了 Kubeflow、自研调度器及 CubeFS 文件管理系统，支持从 Jupyter 交互式开发到分布式训练、再到模型推理部署的全链路 AI 生命周期管理。然而，随着大模型推理业务规模爆发，平台在资源弹性、数据访问效率与多地域协同方面面临严峻挑战。 "不可能三角" 在构建推理服 ...

大模型推理

Tmax AI机器学习平台

大模型推理

Tmax AI机器学习平台

Open AI获超千亿美元投资；涨价太快存储商调整付款方式 | 科技风向标

2 1 Shi Ji Jing Ji Bao Dao· 2026-02-28 03:07

21世纪经济报道新质生产力研究院综合报道早上好，新的一天又开始了。在过去的24小时内，科技行业发生了哪些有意思的事情？来跟21tech一起看看吧。【巨头风向标】 Open AI宣布获得1100亿美元的新投资 Open AI宣布，获得1100亿美元的新投资，公司投前估值达到7300亿美元。这笔投资包括软银300亿美元、英伟达300亿美元以及亚马逊500亿美元。此外，公司还与亚马逊签署了战略合作伙伴关系，并与英伟达达成了下一代推理计算技术合作协议。随着本轮融资的进行，预计将有更多金融投资者加入。新一轮融资使OpenAI基金会持有的OpenAI集团股份价值超过1800亿美元，进一步巩固了这家历史上资源最雄厚的非营利组织之一的地位，并扩大了其在健康突破和人工智能韧性等领域资助慈善事业的能力。影石赢得美国"337调查"最终裁决 2月27日，影石创新公告称，公司于2026年2月26日（美国时间）获悉ITC（美国国际贸易委员会）最终裁决结果，针对案涉6件专利，ITC确认3件GoPro主张的发明专利所涉及的指控产品不构成侵权且专利权利要求无效/部分无效、1件GoPro主张的发明专利所涉及的指控产品不构成侵权 ...

大模型推理

DualPath推理系统

大模型推理

DualPath推理系统

未知机构：从训练走向极致推理LPU架构重塑算力底座东北计算机范式转移-20260228

未知机构· 2026-02-28 02:55

从训练走向极致推理—LPU架构重塑算力底座【东北计算机】技术核心：不同于GPU依赖HBM，LPU倾向于采用大规模片上SRAM直接存储模型参数，消除了内存访问延迟；同时利用静态时序调度，将计算路径精确锁定在时钟周期内。这种ASIC化设计旨在追求推理端的绝对高吞吐与低延迟。范式转移：推理端的"低延迟革命"催生LPU架构随着大模型进入大规模应用期，算力需求正从"暴力计算"向"极致交互"演进。传统的GPU架构在处理LLM推理的Decode阶段时，往往面临高延迟瓶颈。 LPU（Language ProcessingUnit）架构应运而生。技术核心：不同于从训练走向极致推理—LPU架构重塑算力底座【东北计算机】范式转移：推理端的"低延迟革命"催生LPU架构随着大模型进入大规模应用期，算力需求正从"暴力计算"向"极致交互"演进。传统的GPU架构在处理LLM推理的Decode阶段时，往往面临高延迟瓶颈。 LPU（Language ProcessingUnit）架构应运而生。产业链正加速转向M9级以上基材，其核心标准在于：树脂端：必须使用极低损耗的特种树脂体系。电子布：传统的玻璃布在介电一致 ...

大模型推理

算力范式转移

大模型推理

算力范式转移

DeepSeek新论文剧透V4新框架，用闲置网卡加速智能体推理性能，打破PD分离瓶颈

3 6 Ke· 2026-02-27 02:29

Core Insights - A new reasoning framework for agents called DualPath has been introduced, which addresses I/O bottlenecks in long-text reasoning scenarios by optimizing the speed of loading KV-Cache from external storage [1][3]. Group 1: DualPath Framework - DualPath changes the traditional Storage-to-Prefill loading mode by introducing a second path, Storage-to-Decode, allowing for more efficient data handling [3][6]. - The framework utilizes idle storage network interface card (SNIC) bandwidth from the decoding engine (DE) to read caches and employs high-speed computing networks (RDMA) to transfer data to the prefill engine (PE), achieving global pooling of storage bandwidth and dynamic load balancing [3][13]. Group 2: Performance Improvements - In tests with a production-level model of 660 billion parameters, DualPath demonstrated a remarkable increase in offline inference throughput by 1.87 times and an average increase in online service throughput by 1.96 times [3][14]. - The framework significantly optimizes first token latency (TTFT) under high load while maintaining stable token generation speed (TPOT) [5][14]. Group 3: Technical Innovations - DualPath allows KV-Cache to be loaded into the decoding engine first, which is then transmitted to the prefill engine, alleviating bandwidth pressure on the prefill side [7][9]. - The architecture includes a central scheduler that dynamically allocates tasks based on I/O pressure and computational load, preventing congestion on any single network interface or computational resource [14][18]. Group 4: Research and Development - The first author of the paper, Wu Yongtong, is a PhD student at Peking University, focusing on system software and large model infrastructure, particularly in optimizing inference systems for large-scale deployment [15][16].

Seek .(US:SKLTY)

大模型推理

Artificial Intelligence

大模型推理

Artificial Intelligence

4卡96GB显存暴力输出！英特尔锐炫Pro B60和长城世恒X-AIGC工作站评测

Xin Lang Cai Jing· 2026-02-10 12:41

Core Viewpoint - Intel's Arc Pro B60 graphics card is positioned as a cost-effective solution for AI inference, offering significant advantages in memory capacity and performance compared to NVIDIA's offerings, particularly in the context of large model inference. Group 1: Product Overview - Intel's Arc Pro B60 features a complete BMG-G21 GPU core with 20 Xe2 cores, 2560 FP32 units, and 24GB of GDDR6 memory, which is double the capacity of its predecessor, the Intel Arc B580 [6][59]. - The card provides 12.28 TFLOPS of FP32 performance and 197 TOPS of INT8 AI performance, with a memory bandwidth of 456GB/s [8][59]. - Compared to NVIDIA's RTX Pro 2000, the Arc Pro B60 offers 50% more memory capacity and bandwidth at a significantly lower price point, making it a competitive option for high-performance AI inference [9][46]. Group 2: Market Positioning - Intel's transition to a "full-stack AI company" is challenging NVIDIA's previous dominance in the GPU market, particularly in AI applications [1][52]. - The introduction of oneAPI allows developers to easily migrate code from NVIDIA's CUDA environment to Intel hardware, enhancing the usability of Intel's GPUs for AI tasks [4][55]. - The Arc Pro B60 is highlighted as the most cost-effective solution for building large memory pools (96GB to 192GB) necessary for running extensive AI models [9][59]. Group 3: Performance Testing - In tests with the GPT-OSS-120B model, the Arc Pro B60 demonstrated the ability to handle 100 concurrent requests successfully, indicating its robustness for real-time applications [27][50]. - The mean time to first token (TTFT) was recorded at 91.37ms, showcasing the card's strong performance in the prefill phase [31][50]. - As concurrency increased, the throughput of the Arc Pro B60 improved significantly, reaching a maximum of 701 tokens per second at high loads, which is sufficient to support up to 1000 simultaneous users [36][40]. Group 4: Competitive Analysis - When compared to NVIDIA's RTX Pro 2000, the Arc Pro B60 outperformed in both memory capacity and processing power, achieving approximately 50% better performance in multi-GPU setups [46][49]. - The Arc Pro B60's large memory capacity allows it to run larger models without the need for extreme quantization, which is a limitation for NVIDIA's offerings at similar price points [47][49]. - Intel's pricing strategy for the Arc Pro B60 positions it as a viable alternative for enterprises looking to build high-performance local LLM inference stations at a fraction of the cost of NVIDIA's equivalent products [50][51].

大模型推理

英特尔锐炫Pro B60显卡

长城世恒X - AIGC工作站

NVIDIA RTX Pro 2000

大模型推理

英特尔锐炫Pro B60显卡

长城世恒X - AIGC工作站

NVIDIA RTX Pro 2000

腾讯混元AI Infra核心技术开源，推理吞吐提升30%

Sou Hu Cai Jing· 2026-02-04 12:22

▲ HPC-Ops 算子库架构图 IT之家 2 月 4 日消息，腾讯混元 AI Infra 团队今日宣布推出开源生产级高性能 LLM 推理核心算子库 HPC- Ops。该算子库宣称基于生产环境痛点，采用 CUDA 和 CuTe 从零构建，通过抽象化工程架构、微架构深度适配及指令级极致优化等，降低底层算子开发门槛，将核心算子性能逼近硬件峰值，实现了性能突破。在真实场景下，基于 HPC-Ops，混元模型推理 QPM 提升 30%，DeepSeek 模型 QPM 提升 17%。同时，在单算子性能方面，HPC-Ops 实现 Attention 相比 FlashInfer / FlashAttention 最高提升 2.22 倍；GroupGEMM 相比 DeepGEMM 最高提升 1.88 倍；FusedMoE 相比 TensorRT-LLM 最高提升 1.49 倍。在未来的发展规划中，HPC-Ops 将持续深耕大模型推理性能的突破方向： IT之家附 HPC-Ops 开源地址如下：一方面，将重点研发稀疏 Attention 算子，针对性解决长上下文大模型的内存与算力瓶颈；另一方面，会拓展更丰富的量化策 ...

TENCENT(HK:00700)

大模型推理

Software and Internet

腾讯混元AI Infra

大模型推理

Software and Internet

腾讯混元AI Infra

“中国英伟达”突发跳水！寒武纪大跌14%市值跌破5000亿，业绩指引“小作文”流传，公司称很多传闻都是假的

Jin Rong Jie· 2026-02-03 03:42

今日社交平台，关于寒武纪2026年业绩指引的"小作文"流传。据每经报道，寒武纪回应，对于股价波动不清楚具体原因，市场很多传闻都是假的。希望大家理性对待。股票频道更多独家策划、专家专栏，免费查阅>> 责任编辑：磐石作为A股科技股龙头的寒武纪，核心炒作逻辑主要围绕三大方面展开。首先是国产替代加速，在地缘政治因素影响下，国内云厂商和互联网大厂对自主可控 AI芯片（核心股）的需求快速增长，寒武纪作为国内AI芯片龙头直接受益。其次是大模型推理需求爆发，以DeepSeek等为代表的本土大模型快速发展，带动了对高性能AI推理芯片的旺盛需求。第三是行业龙头地位，被称为"中国英伟达"的寒武纪在AI芯片架构设计和软硬件协同优化方面的技术积累逐步显现价值。 1月31日寒武纪曾发布业绩预告，预计2025年全年营业收入60亿元至70亿元，同比增长410.87%至496.02%；预计扣除非经常性损益后的净利润盈利16亿元至 19亿元；预计归属于上市公司股东的净利润盈利18.5亿元至21.5亿元。公司称，受益于人工智能行业算力（核心股）需求的持续攀升，营业收入较上年同期大幅增长，净利润实现扭亏为盈。此外，寒武纪定增申请获上交所 ...

Cambricon(SH:688256)

大模型推理

大模型推理

曦望董事长徐冰：把大模型推理这件事，做到极致

Sou Hu Cai Jing· 2026-01-29 11:35

Core Insights - The core message emphasizes that whoever masters efficient, controllable, and sustainable inference infrastructure will dominate the speed of AI implementation [3][5]. Group 1: Company Overview - The company, known as Xi Wang, is positioned as a leading GPU chip company focused on inference, aiming to optimize large model inference [4]. - Xi Wang's mission is to excel in large model inference, transitioning from a training-driven to an inference-driven AI industry [4][5]. - The company was established in 2020, evolving from the chip division of SenseTime, and has accumulated significant experience in AI applications over the past decade [5][6]. Group 2: Market Trends - By 2026, inference computing power is projected to account for 66% of AI workloads, surpassing training, indicating a structural shift in the industry [4]. - The demand for real-time interaction and complex scenarios, such as 3D and video generation, is driving the need for high-frequency response in AI applications [4][5]. Group 3: Cost Structure and Strategy - Inference costs currently represent 70% of AI application expenses, which is critical for profitability and commercial success [4][5]. - The company aims to reduce inference costs significantly, targeting a reduction from "per unit" to "per fraction," making AI infrastructure as accessible as utilities [4][7]. Group 4: Product Development and Innovation - Xi Wang has invested 2 billion in R&D over the past eight years, successfully producing the S1 and S2 chips, with the S3 chip recently launched [7][8]. - The company plans to set a new industry benchmark by achieving a cost of "one cent per million tokens" for inference [7][8]. Group 5: Business Model - The company is not merely a chip seller but aims to create a comprehensive ecosystem around "chip + system + ecology" [8][9]. - Xi Wang intends to collaborate with major AI firms and various computing power providers to optimize existing systems and enhance cost efficiency [8][9]. Group 6: Future Vision - The company envisions becoming the foundational infrastructure for affordable and stable computing power in the AI era, linking technology, policy, and commercial models [9]. - The future of AI in China is expected to rely on scalable and cost-effective inference infrastructure, marking a significant transition from following to leading in the domestic AI chip market [9].

大模型推理

大模型推理

曦望发布新一代推理GPU芯片启望S3，单位Token推理成本降低90%

Xin Lang Cai Jing· 2026-01-27 11:36

炒股就看金麒麟分析师研报，权威，专业，及时，全面，助您挖掘潜力主题机会！（来源：智通财经）启望S3为面向大模型推理的定制化GPGPU芯片。智通财经记者了解到，其在典型推理场景下的整体性价比较上一代提升超10倍。算力与存储设计上，该芯片支持FP16至FP4精度切换，采用LPDDR6显存方案，显存容量提升4倍。在DeepSeek V3/R1满血版等主流大模型推理中，单位Token成本较上一代降低约 90%。 1月27日，智通财经记者获悉，国产GPU厂商曦望（Sunrise）发布新一代推理GPU芯片启望S3。这是曦望在近一年累计完成约30亿元战略融资后的首次集中公开亮相。2025年，曦望芯片交付量已突破万片。 ...

大模型推理

大模型推理

GPU创企曦望一年融资30亿：出身商汤，押注推理

Guan Cha Zhe Wang· 2026-01-22 13:13

Group 1 - The company Sunrise has completed nearly 3 billion yuan in financing within a year, with investors including SANY Group's Huaxu Fund, Paradigm Intelligence, and IDG Capital [1] - The funds raised will be used for the development of the next generation of inference GPUs, large-scale production, and ecosystem building [1] - Sunrise was established in 2020 as a spin-off from SenseTime's chip division, focusing on high-performance GPUs and multimodal scene inference chips [1] Group 2 - The core team consists of over 200 employees, with more than 80% being technical and R&D personnel, many of whom come from leading companies like AMD and Huawei [1] - Co-CEO Wang Yong has over 20 years of experience in the chip industry, having previously worked at AMD and Baidu [2] - Co-CEO Wang Zhan, a founding member of Baidu, has significant experience in the field [2] Group 3 - Unlike traditional GPU manufacturers like NVIDIA that pursue "training and inference integration," Sunrise focuses solely on "pure inference" [4] - The company aims to reduce inference costs by 90% and provide stable services, positioning itself as a transformative player in the AI industry [4] - Sunrise has invested a total of 2 billion yuan in R&D over the past few years, developing a product matrix of three generations of chips [4] Group 4 - The S1 chip, launched in 2020, is China's first visual inference chip and has been licensed to major clients like Sony and Xiaomi [4] - The upcoming S2 chip, set for mass production in 2024, will compete with mainstream A/H series GPUs and be compatible with the CUDA ecosystem [4] - The S3 chip, expected in 2026, aims to achieve significant reductions in cost and energy consumption, targeting a new industry benchmark of "one cent per million tokens" [5]

SENSETIME(HK:00020)

大模型推理

大模型推理