Workflow
SGLang
icon
Search documents
InferenceX v2:NVIDIA Blackwell 对阵 AMD 对阵 Hopper —— 原名 InferenceMAX --- InferenceX v2_ NVIDIA Blackwell Vs AMD vs Hopper - Formerly InferenceMAX
2026-02-24 14:19
Summary of InferenceX v2: NVIDIA Blackwell vs AMD vs Hopper Industry and Company Involved - The document discusses the competitive landscape of AI inference performance, focusing on NVIDIA's Blackwell architecture and AMD's offerings, particularly in the context of inference benchmarks and optimizations. Core Points and Arguments - **InferenceX v2 Overview**: InferenceX v2 builds on InferenceMAXv1, establishing a new standard for AI inference performance and economics through continuous testing across numerous GPUs and frameworks [3][4][7] - **Benchmarking Capabilities**: InferenceX v2 is the first suite to benchmark NVIDIA's Blackwell Ultra GB300 NVL72 and B300, as well as AMD's MI355X, across the entire Pareto frontier curve [9][10] - **Performance Comparison**: - AMD's MI355X shows competitive performance per total cost of ownership (TCO) against NVIDIA's B200 in FP8 precision using disaggregated and wide expert parallelism [21][23] - However, NVIDIA's solutions, particularly the B200 and B300, maintain a significant performance lead over AMD's offerings in many scenarios [28][34] - **Energy Efficiency**: NVIDIA GPUs demonstrate superior energy efficiency, consuming significantly fewer picoJoules per token across all workloads compared to AMD [28] - **Composability Issues**: AMD's inference optimizations struggle with composability, where individual optimizations perform well in isolation but fail to deliver competitive results when combined [29][30][31] - **Future Focus for AMD**: AMD is advised to enhance the composability of its inference optimizations and is reportedly planning to focus on software composability of FP4 and distributed inferencing after the Chinese New Year [31][33][70] Additional Important Content - **Performance Improvements**: AMD has made notable improvements in SGLang DeepSeek R1 FP4 configurations, nearly doubling throughput in under two months [66][67] - **NVIDIA's Consistency**: NVIDIA's performance results have been more stable, with minor improvements noted for the B200 SGLang over a similar timeframe [73] - **Market Dynamics**: The document highlights the competitive dynamics between NVIDIA and AMD, emphasizing the need for AMD to increase contributions to open-source projects and improve its software stack to remain competitive [70][42] - **Technical Concepts**: The document explains key technical concepts such as disaggregated prefill, tensor parallelism, and the trade-offs between interactivity and throughput in LLM inference [49][57][61] This summary encapsulates the critical insights and data points from the InferenceX v2 report, providing a comprehensive overview of the competitive landscape in AI inference technology.
SemiAnalysis创始人播客分享--英伟达、华为、AIDC的谣言
傅里叶的猫· 2026-02-22 13:41
最近听了 SemiAnalysis 创始人 Dylan Patel 最近的一期播客,信息量非常大。从英伟达收购 Gro q,到中国半导体产业的疯狂内卷,再到那些关于"AI 耗尽水资源"的谣言,这个文章我们就来整理一 下他的观点。 英伟达的焦虑:从"一芯通吃"到多元化布局 不久前,英伟达还在说"一块 GPU 就能搞定所有 AI 任务",结果现在转头就收购了 Groq。这背后 藏着老黄的深层焦虑。 Dylan 提到一个很关键的观点:现在 AI 模型的工作负载已经大到可以容纳专用芯片了。Groq 这种 芯片在通用任务上不行,训练不了,跑大模型也不够经济,但它有一个绝活——推理速度快到飞起。 这就是典型的"专用芯片打败通用芯片"的场景。 未来 AI 模型可能不再是单线程思考,而是同时开启 100 个并行的思维流。Google 和 OpenAI 的一 些 Pro 模型已经在这么干了——模型不是只有一条推理链,而是同时跑多条,然后选出最佳答案。 这种场景下,你需要的不是"极致的快",而是"足够宽的并行处理能力"。 所以英伟达现在的策略很明确:既要保住通用 GPU 的基本盘,又要通过收购 Groq、开发 CPX 芯 片等方 ...
LLM-in-Sandbox:给大模型一台电脑,激发通用智能体能力
机器之心· 2026-01-30 04:25
Core Idea - The article presents the concept of LLM-in-Sandbox, which allows large language models (LLMs) to explore tasks in a virtual computer environment, significantly enhancing their performance in various non-code domains without additional training [5][40]. Group 1: Technical Advancements - The evolution of large models is being unlocked through different paradigms, including In-Context Learning, Chain-of-Thought, and the recent intelligent agent framework that enables multi-turn interactions and tool usage [2][3]. - LLM-in-Sandbox is proposed as a new paradigm that combines LLMs with a virtual computer, allowing them to autonomously explore and complete tasks, leading to improved performance in fields such as mathematics, physics, chemistry, and long-text understanding [3][7]. Group 2: Design and Implementation - LLM-in-Sandbox features a lightweight, general-purpose design that contrasts with existing software engineering agents that require task-specific environments, thus enhancing generalization and scalability [10][11]. - The environment is based on a Docker Ubuntu setup with minimal pre-installed tools, allowing models to autonomously acquire domain-specific tools as needed [12][13]. Group 3: Experimental Results - Experiments across six non-code domains showed significant performance improvements for LLMs in the LLM-in-Sandbox mode, with enhancements observed in mathematics (+6.6% to +24.2%), physics (+1.0% to +11.1%), and other areas without additional training [20][21]. - The model's ability to autonomously utilize the sandbox environment was demonstrated through case studies, showcasing its capacity for external resource access, file management, and computational execution [21][22][23]. Group 4: Reinforcement Learning Integration - LLM-in-Sandbox RL is introduced to enhance the generalization capabilities of weaker models by training them in the sandbox environment using context-based tasks, which require active exploration [26][29]. - The approach has shown consistent performance improvements across various models, indicating its broad applicability and effectiveness [31]. Group 5: Efficiency and Performance - LLM-in-Sandbox demonstrates cross-domain generalization, achieving consistent performance improvements in multiple downstream tasks, including software engineering [31]. - The deployment of LLM-in-Sandbox can significantly reduce token consumption in long-text scenarios, with reductions of up to 8 times, while maintaining competitive throughput speeds [32][34]. Group 6: Future Prospects - LLM-in-Sandbox transcends traditional text generation capabilities, enabling cross-modal abilities and direct file generation, which could evolve into a universal digital creation system [35][38]. - The article concludes that LLM-in-Sandbox should become the default deployment paradigm for large models, as it offers substantial performance enhancements with minimal deployment costs [40].
来这场沙龙,一览SGLang X 超长上下文扩展、RL后训练框架、扩散语言模型等前沿技术实践
机器之心· 2026-01-29 08:12
Core Insights - The article discusses the transition of artificial intelligence from a "chat" paradigm to an "actionable" intelligent agent era, emphasizing the need for deep collaboration and experience sharing among developers in optimizing LLM systems [2] Event Overview - A Meetup organized by SGLang community, Machine Heart, and Zhangjiang Incubator will take place on February 6, focusing on LLM system optimization and practical implementation [2] - The event will feature discussions on SGLang's technical roadmap, long-context expansion, RL post-training frameworks, and diffusion language model exploration [2] Event Schedule - The event schedule includes: - 13:30-14:00: Registration - 14:00-14:30: Keynote on SGLang roadmap by Zhang Bozhou, core developer of SGLang [5] - 14:30-15:00: Keynote on Omni-infer performance optimization by Zheng Jinhwan, core developer of Omni-infer [5] - 15:00-15:30: Keynote on slime RL scaling post-training framework by Xie Chengxing, Tsinghua University PhD student [5] - 15:30-16:00: Keynote on SGLang CPP for long-context scaling by Cai Shangming, core developer of SGLang and Mooncake [5] Guest Introductions - Zhang Bozhou: Core developer of SGLang, focusing on open-source LLM support and optimization across different CUDA hardware [8] - Zheng Jinhwan: Huawei technical expert and core contributor to Omni-infer, specializing in high-performance systems and inference optimization [9] - Xie Chengxing: PhD student at Tsinghua University and core developer of the slime RL framework, with a focus on enhancing LLM reasoning and decision-making capabilities [10] - Cai Shangming: Researcher at Alibaba Cloud, core contributor to SGLang and Mooncake, with expertise in high-performance inference systems and distributed machine learning [10] - Li Zehuan: System engineer at Ant Group and core contributor to SGLang, focusing on AI infrastructure optimization [11]
给AI打个分,结果搞出17亿估值独角兽?
3 6 Ke· 2026-01-07 11:04
Core Insights - LMArena has successfully secured $150 million in Series A funding, raising its valuation to $1.7 billion, marking a strong start to the new year [1] - The funding round was led by Felicis and UC Investments, with participation from Andreessen Horowitz and The House Fund, indicating strong investor confidence in the AI model evaluation sector [3] Company Background - LMArena originated from Chatbot Arena, created by the open-source organization LMSYS, comprised mainly of members from top universities like UC Berkeley and Stanford [4] - The team developed the open-source inference engine SGLang, which achieved performance comparable to DeepSeek's official report on 96 H100 GPUs [4] - The primary focus of LMArena is on evaluating AI models, having established a crowdsourced benchmarking platform during the rise of models like ChatGPT and Claude [6][7] Evaluation Methodology - LMArena employs a unique evaluation method where users anonymously vote on model responses, ensuring unbiased assessments [10] - The platform utilizes an Elo rating system based on the Bradley–Terry model to score models, allowing for real-time updates and fair comparisons [10] - LMArena has become a go-to platform for new models to be tested, with Gemini 3 Pro currently leading the rankings with a score of 1490 [10][11] Growth and Future Plans - Since its seed funding of $100 million last year, LMArena has rapidly expanded, accumulating 50 million votes across various modalities and evaluating over 400 models [12] - The newly raised funds will be used to enhance platform operations, improve user experience, and expand the technical team [12]
给AI打个分,结果搞出17亿估值独角兽???
量子位· 2026-01-07 09:11
Core Insights - LMArena has successfully secured $150 million in Series A funding, raising its valuation to $1.7 billion, marking a strong start to the new year [1][3]. Group 1: Funding and Valuation - The funding round was led by Felicis and UC Investments, with participation from Andreessen Horowitz and The House Fund [3]. - The significant investment reflects the attractiveness of the AI model evaluation sector in the current market [4]. Group 2: Company Background - LMArena originated from Chatbot Arena, which was created by the open-source organization LMSYS following the emergence of ChatGPT in 2023 [5][4]. - The core team consists of highly educated individuals from top universities such as UC Berkeley, Stanford, UCSD, and CMU [6]. Group 3: Technology and Evaluation Methodology - LMArena's open-source inference engine, SGLang, has achieved performance comparable to DeepSeek's official report on 96 H100 GPUs [7]. - SGLang has been widely adopted by major companies including xAI, NVIDIA, AMD, Google Cloud, Oracle Cloud, Alibaba Cloud, Meituan, and Tencent Cloud [8]. - The primary focus of LMArena is on evaluating AI models, which they began with the launch of Chatbot Arena, a crowdsourced benchmarking platform [9][10]. Group 4: Evaluation Process - LMArena employs a unique evaluation process that includes anonymous battles, an Elo-style scoring system, and human-machine collaboration [20]. - Users input questions, and the system randomly matches two models for anonymous responses, allowing users to vote on the quality of the answers without knowing the model identities [21][22]. - The platform's Elo scoring mechanism updates model rankings based on performance, ensuring a fair and objective evaluation process [22]. Group 5: Growth and Future Plans - Since securing $100 million in seed funding, LMArena has rapidly exceeded expectations, accumulating 50 million votes across various modalities and evaluating over 400 models [25]. - The newly raised funds will be used to enhance platform operations, improve user experience, and expand the technical team to support further development [25].
SGLang原生支持昇腾,新模型一键拉起无需改代码
量子位· 2025-12-21 14:13
Core Insights - The article discusses the increasing focus on the ability of inference systems to handle real-world loads as agents accelerate on the application side [1][4] - The SGLang AI financial meetup highlighted engineering challenges in inference systems, including high concurrency requests, long context windows, multi-turn reasoning, memory management, and consistency generation in financial agent scenarios [4][9] Group 1: Inference System Engineering Solutions - The SGLang event, co-hosted with AtomGit, focused on large model inference architecture, agents, reinforcement learning, and their application in finance [7] - Key participants included engineering teams from inference systems, models, and computing power, emphasizing the higher demands for efficiency in high concurrency, long context windows, multi-turn reasoning, and memory management for agents compared to traditional LLMs [8] - Specific deployment scenarios, such as financial agents, have stricter requirements for low latency, response stability, consistency, and cost control [9] Group 2: Technical Innovations and Implementations - SGLang introduced the HiCache system to address issues of KV cache redundancy and high memory demand in high concurrency and long context scenarios, significantly reducing memory usage and improving inference stability and throughput [11] - For mixed models like Qwen3-Next and Kimi Linear, SGLang implemented Mamba Radix Tree for unified prefix management and Elastic Memory Pool for efficient inference and memory optimization in long context and high concurrency scenarios [13] - The Mooncake system, based on Transfer Engine, significantly reduced weight loading and model startup times, achieving weight update preparation in under 20 seconds and cold start times from 85 seconds to 9 seconds [17] Group 3: Collaboration with Ascend Platform - The capabilities of the inference systems are not limited to a specific computing platform, as HiCache, Mooncake, and GLM can run directly on the Ascend platform, indicating a shift in Ascend's role in the inference system ecosystem [24][25] - SGLang's latest advancements on the Ascend platform include model adaptation, performance optimization, and modular acceleration capabilities, achieving a throughput of 15 TPS per card for DeepSeek V3.2 under specific conditions [29] - System-level optimizations included load balancing, operator fusion to reduce memory access, and multi-stream parallel execution to enhance resource utilization [30][31] Group 4: Future Directions and Open Source Commitment - Ascend's collaboration with SGLang aims to fully embrace open source and accelerate ecosystem development, having completed gray testing of DeepSeek V3.2 in real business scenarios [46] - Future developments will focus on systematic engineering investments around inference systems, enhancing throughput for high concurrency and low latency workloads, and aligning with open-source engines for model deployment and performance tuning [47] - The integration of models, inference engines, and computing platforms into a stable collaborative framework will shift the focus from whether a model can run to whether the system can run sustainably and at scale [47]
基于 SGlang RBG + Mooncake 打造生产级云原生大模型推理平台
AI前线· 2025-12-12 00:40
Core Insights - The article emphasizes the rapid evolution of large language model (LLM) inference services into core enterprise infrastructure, focusing on the balance of performance, stability, and cost in building high-performance inference systems [2] - It discusses the transition from monolithic to distributed architectures in LLM inference, highlighting the need for external KVCache to alleviate memory pressure and enhance performance in high-demand scenarios [2][4] Distributed KVCache and Mooncake - Mooncake is introduced as a leading distributed KVCache storage engine designed to provide high throughput and low latency for inference frameworks like SGLang [3] - The article outlines the challenges in managing distributed KVCache systems in production environments, which necessitate the development of RoleBasedGroup (RBG) for unified management of caching and inference nodes [4] RoleBasedGroup (RBG) Design and Challenges - RBG is presented as a Kubernetes-native API aimed at AI inference, facilitating multi-role orchestration to ensure stable and high-performance operations [4][12] - The article identifies five fundamental challenges in deploying large model inference services, including the need for strong state management and performance optimization [12][15] SCOPE Framework - The SCOPE framework is introduced, focusing on five core capabilities: Stability, Coordination, Orchestration, Performance, and Extensibility, which are essential for managing LLM inference services [16][18] - RBG's design allows for rapid architecture iteration and performance-sensitive operations, addressing the complexities of multi-role dependencies and operational efficiency [15][24] Benchmark Testing and Performance Metrics - Benchmark tests demonstrate significant improvements in KVCache hit rates and inference performance, with L3 Mooncake cache achieving a 64.67% hit rate and reducing average TTFT to 2.58 seconds [32][48] - The article highlights the importance of a multi-tier caching architecture in enhancing performance for applications like multi-turn dialogue and AI agents [44] Conclusion and Future Outlook - The integration of RBG and Mooncake is positioned as a transformative approach to building production-grade LLM inference services, emphasizing the need for deep integration of high-performance design with cloud-native operational capabilities [43][44] - The article concludes with a call for community collaboration to advance this paradigm and lay the foundation for the next generation of AI infrastructure [43]
Z Event|Z Potentials × SGLang NeurIPS 全球前沿研究者峰会之夜
Z Potentials· 2025-11-26 04:34
Core Insights - NeurIPS 2025 is set to be a historic event for the future of AI technology, gathering top researchers and engineers in San Diego [1] - Z Potentials is collaborating with SGLang, a leading open-source inference engine community, to create a unique networking opportunity for frontier researchers [2] Event Details - The event will feature prominent researchers from organizations like OpenAI, DeepMind, and Nvidia, focusing on next-generation generative AI and system innovations [1] - The event is scheduled for December 5, from 6:00 PM to 8:00 PM, near the NeurIPS venue in San Diego [6] Collaboration and Support - Z Potentials aims to bridge investment, research, and infrastructure, with SGLang recognized as a standard in the large model inference field [2] - Atlas Cloud is providing significant computational support for the event, enabling the gathering of leading researchers [3]
大模型优秀大脑齐聚硬核开源聚会,SGLang社区举办国内首次Meetup
机器之心· 2025-10-28 06:29
Core Insights - The Pytorch Conference 2025 showcased the vibrant community and significant developments in deep learning, particularly highlighting SGLang's contributions and potential in the industry [1][3][4]. SGLang Overview - SGLang, an open-source high-performance inference engine for large language models and visual language models, originated from RadixAttention and is incubated by the non-profit organization LMSYS. It offers low latency and high throughput inference across various environments, from single GPUs to large distributed clusters [7][8]. Community Engagement - The first Meetup event in Beijing, co-hosted by SGLang, Meituan, and Amazon Web Services, attracted numerous contributors, developers, and scholars, indicating a strong community presence and development potential [4][8]. Technical Developments - The Meetup featured technical discussions on SGLang's architecture, including advancements in KV Cache, Piecewise CUDA Graph, and Spec Decoding, aimed at improving efficiency and compatibility [21][22]. - SGLang's quantization strategies were also discussed, focusing on expanding application range and optimizing model performance [34][35]. Application and Practice - Various industry applications of SGLang were presented, including its integration with Baidu's Ernie 4.5 model for large-scale deployment and optimization in search scenarios [41][42]. - The application of SGLang in WeChat's search function was highlighted, emphasizing the need for high throughput and low latency in user experience [44]. Future Directions - The roadmap for SGLang includes further integration with various hardware and software solutions, aiming to enhance stability and compatibility across different platforms [22][35]. - The Specforge framework, developed by the SGLang team, aims to accelerate large language model inference and has been adopted by major companies like Meituan and NVIDIA [57][58].