vLLM

Search documents
o3-pro通关“推箱子”,人类怀旧小游戏成了大模型新Benchmark
量子位· 2025-06-16 04:50
Core Viewpoint - Classic nostalgic games like Sokoban and Tetris have become benchmarks for evaluating large models, with the o3-pro model recently surpassing previous performance limits in these games [1][2][6]. Group 1: Benchmark Performance - The o3-pro model successfully completed all levels of Sokoban, which previously had a benchmark limit at the sixth level [3][8]. - In comparison to the previous state-of-the-art model (SOTA), o3, the performance of o3-pro has doubled [3][10]. - The scoring system for Tetris involves calculating the number of placed blocks and the number of cleared lines multiplied by ten, until the game ends [13][22]. Group 2: Game Characteristics and Evaluation - The Lmgame benchmark includes several games, such as 2048, Candy Crush, Super Mario Bros, and Phoenix Wright, each with unique evaluation criteria [18][24]. - The evaluation for 2048 is based on the total value of merged blocks, while Candy Crush measures the total candies eliminated in a fixed number of rounds [24]. - The evaluation methods do not consider time as a factor, focusing instead on game-specific performance metrics [22][24]. Group 3: Model Development and Support - The project is developed by the Hao AI Lab at UCSD, which is affiliated with the machine learning systems and NLP labs [28]. - The lab has received funding from Google and NVIDIA, with NVIDIA donating a DGX B200 system to support their research [34]. - The benchmark is open-source, allowing interested parties to download and test their models [23].
o3-pro通关“推箱子”,人类怀旧小游戏成了大模型新Benchmark
量子位· 2025-06-16 04:49
Core Viewpoint - Classic nostalgic games like "Sokoban" and "Tetris" have become benchmarks for evaluating large models, with the o3-pro model achieving significant breakthroughs in these games [1][6]. Group 1: Benchmark Performance - The o3-pro model surpassed previous benchmarks by completing all levels of Sokoban, while the best prior model, o3, only reached the sixth level [2][3]. - In Tetris, the scoring system combines the number of placed blocks with ten times the number of cleared lines, and o3-pro's performance doubled that of o3 [3][13]. - The o3-pro model's performance is notable for its time-consuming operations, taking several minutes for each move [17]. Group 2: Game Evaluation Standards - The Lmgame benchmark includes various games, with specific evaluation metrics for each, such as total distance moved in Super Mario Bros and total candy cleared in Candy Crush [6][24]. - The evaluation does not consider time as a factor, focusing instead on game-specific performance metrics [22]. - The benchmark is open-source, allowing others to download and test their models [23]. Group 3: Development and Support - The project is developed by the Hao AI Lab at UCSD, which has received support from Google and NVIDIA [28][34]. - The lab has created multiple open-source projects, with FastVideo being the most starred on GitHub [32].
对话红帽全球副总裁曹衡康:AI成本下降了 芯片的量一定会起来
Mei Ri Jing Ji Xin Wen· 2025-06-14 09:02
Core Viewpoint - The consensus in the industry is that the cost of computing power will eventually decrease, but there is no unified path chosen among data centers, integrated machines, or inference servers [1] Group 1: AI Inference Year - 2023 is considered the year of AI inference, marking the official launch of AI applications that will generate business revenue and internal cost control for enterprises [1] - Red Hat has chosen to adopt the "vLLM" framework, a high-performance large language model inference framework that has become a de facto standard in the open-source community [1] Group 2: Contribution and Market Potential - Contributors from China account for 35% of the contributions to the vLLM community, indicating a strong potential for inference technology to bring enterprise value in China [1] - The company identifies two technical challenges in inference: achieving high-performance inference with minimal hardware and cost, and distributing inference workloads across multiple servers [1] Group 3: Future of Computing Power Costs - Red Hat plans to launch inference servers in 2025, emphasizing that the main advantage is the reduction of computing power costs for enterprises [2] - The company does not produce hardware but focuses on software solutions, aiming to lower the barriers for AI adoption among businesses [2] - As computing costs decrease, the demand for GPU cards is expected to rise significantly, potentially increasing the number of enterprises using AI from 1,000 to 100,000 or even 1 million [2]
DeepSeek研究员1200行代码复刻vLLM,H800硬件实测性能反超原版
量子位· 2025-06-13 07:05
Core Viewpoint - The article highlights the development of Nano-vLLM, an open-source project by DeepSeek researcher Yu Xingkai, which achieves a lightweight and fully readable version of vLLM using less than 1200 lines of code, while maintaining comparable performance to the original framework [1][27]. Group 1: Project Overview - The project Nano-vLLM has three main characteristics: minimal codebase, high readability, and competitive performance [2]. - In benchmark tests on RTX 4070 hardware with the Qwen3-0.6B model, vLLM achieved a throughput of 1353.86 tokens/s in 98.95 seconds, while Nano-vLLM had a throughput of 1314.65 tokens/s in 101.90 seconds, showing that vLLM slightly outperformed Nano-vLLM [3][4]. - On H800 hardware with the Qwen3-8B model, Nano-vLLM surpassed vLLM, achieving a throughput of 6731.42 tokens/s in 86.73 seconds compared to vLLM's 5916.89 tokens/s in 98.67 seconds, indicating significant performance improvement [9]. Group 2: Technical Insights - vLLM is designed for optimizing inference and deployment of large language models (LLMs) and was initially developed by the Sky Computing Lab at UC Berkeley [16]. - The core technology of vLLM is inspired by the operating system's virtual memory paging mechanism, addressing issues of fragmentation in memory storage for key-value (KV) caches [19]. - The PagedAttention algorithm allows for non-contiguous storage of KV pairs, improving memory management and reducing waste, which enhances throughput by 2-4 times compared to previous systems like FasterTransformer and Orca [24]. Group 3: Features and Compatibility - vLLM integrates seamlessly with popular Hugging Face models and supports various decoding algorithms for high-throughput services, including parallel sampling and beam search [25]. - It is compatible with multiple hardware platforms, including NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPUs, and AWS Neuron [26]. - The vLLM engine consists of 8500 lines of Python code and 2000 lines of C++/CUDA code, while Nano-vLLM achieves similar functionality with a significantly reduced codebase [26][27].
1200行代码逆袭!DeepSeek工程师开源轻量级vLLM,吞吐量逼近原版
机器之心· 2025-06-13 04:31
Core Viewpoint - vLLM is a high-performance, open-source LLM inference and service engine developed by the University of California, Berkeley, aimed at enhancing inference speed and resource utilization, particularly memory efficiency, while being compatible with popular model libraries like Hugging Face [2][3]. Group 1: vLLM and Nano-vLLM - vLLM enables mainstream models like GPT, Mistral, and LLaMA to run faster and consume fewer resources through its innovative attention mechanism called PagedAttention [3]. - A lightweight implementation of vLLM, named Nano-vLLM, was developed by DeepSeek AI researcher Yu Xingkai, simplifying the code to under 1200 lines [4][7]. - Nano-vLLM has gained over 200 stars on GitHub, indicating community interest and engagement [5]. Group 2: Features of Nano-vLLM - Nano-vLLM offers three core functionalities: 1. Fast offline inference with performance comparable to vLLM [6]. 2. A readable codebase with a simplified implementation [7]. 3. An optimization suite that includes features like prefix caching, Torch compilation, and CUDA computation graphs [8]. Group 3: Benchmarking Results - Benchmark tests showed that Nano-vLLM produced the same output tokens as vLLM but took slightly longer, resulting in a throughput of 1314.65 tokens/s compared to vLLM's 1353.86 tokens/s [9][11]. - The testing configuration included using an RTX 4070 GPU, with a model size of Qwen3-0.6B, and random sampling of input and output lengths between 100 and 1024 tokens [10].
开源AI开发生态大洗牌:低代码平台逆袭,传统LLM框架日渐式微
量子位· 2025-05-28 07:28
Core Insights - The report and the comprehensive panorama released by Ant Group provide a detailed analysis of the current open-source ecosystem for large models, highlighting its evolution and trends [1][4][40] Group 1: Overview of the Open-Source Ecosystem - The open-source ecosystem for large models is described as a "real-world hackathon," emphasizing the collaborative nature of development [2][3] - Ant Group's report includes a panorama covering 19 technical fields and 135 projects, from model infrastructure to intelligent applications [5][10] - The analysis identifies three dominant technical tracks in the current open-source ecosystem: model training frameworks, efficient inference engines, and low-code application development frameworks [10][11] Group 2: Key Projects and Trends - The report lists the top 20 projects for 2025, highlighting significant growth and decline among various projects [7] - PyTorch ranks first in influence among all projects in the panorama, while vLLM and SGlang are noted for rapid iteration in the inference category [14][31] - Dify and RAGFlow are emerging as leading platforms in application development, driven by their ability to meet enterprise user needs through low-code workflows [18][35] Group 3: Development Paradigms and Standards - The shift towards low-code development is becoming mainstream, with traditional agent frameworks declining in popularity [20][17] - New communication standards for models and applications are being established, such as the MCP protocol and A2A protocol, which facilitate interaction between different agents [22][25] - The report emphasizes the importance of standardization in the evolving landscape of large model services, suggesting that the standard protocol layer will become a strategic battleground for leading players [24][26] Group 4: Implications for Developers - Developers are encouraged to focus on enhancing user experience and deepening their understanding of specific application scenarios to gain competitive advantages [34][35] - The report highlights the need for developers to adapt to rapid changes in project cycles and to embrace a trial-and-error approach in development [37][38] - Overall, the report serves as a valuable resource for understanding the underlying mechanisms of the large model open-source ecosystem and its future direction [41][42]
SemiAnalysis:AMD vs NVIDIA 推理基准测试:谁赢了?--性能与每百万令牌成本分析
2025-05-25 14:09
Summary of AMD vs NVIDIA Inference Benchmarking Conference Call Industry and Companies Involved - **Industry**: Artificial Intelligence (AI) Inference Solutions - **Companies**: Advanced Micro Devices (AMD) and NVIDIA Core Insights and Arguments 1. **Performance Comparison**: AMD's AI servers have been claimed to provide better inference performance per total cost of ownership (TCO) than NVIDIA, but results show nuanced performance differences across various tasks such as chat applications, document processing, and reasoning [4][5][6] 2. **Workload Performance**: For hyperscalers and enterprises owning GPUs, NVIDIA outperforms AMD in some workloads, while AMD excels in others. However, for short to medium-term rentals, NVIDIA consistently offers better performance per dollar due to a lack of AMD GPU rental providers [6][12][13] 3. **Market Dynamics**: The M25X, intended to compete with NVIDIA's H200, faced shipment delays, leading customers to choose the B200 instead. The M55X is expected to ship later in 2025, further impacting AMD's competitive position [8][10][24] 4. **Software and Developer Experience**: AMD's software support for its GPUs is still lacking compared to NVIDIA's, particularly in terms of developer experience and continuous integration (CI) coverage. This has contributed to AMD's ongoing challenges in the AI software space [9][15][14] 5. **Market Share Trends**: AMD's market share in Datacenter A GPUs has been increasing but is expected to decline in Q2 CY2025 due to NVIDIA's new product launches. However, AMD's upcoming M55X and software improvements may help regain some market share [26][27] Additional Important Points 1. **Benchmarking Methodology**: The benchmarking methodology emphasizes online throughput against end-to-end latency, providing a realistic assessment of performance under operational conditions [30][31] 2. **Latency and Throughput Relationship**: There is a trade-off between throughput and latency; optimizing for one often negatively impacts the other. Understanding this balance is crucial for selecting the right configuration for different applications [35][36] 3. **Inference Engine Selection**: vLLM is the primary inference engine for benchmarking, while TensorRT-LLM (TRT-LLM) is also evaluated. Despite improvements, TRT-LLM still lags behind vLLM in user experience [54][55] 4. **Future Developments**: AMD is encouraged to increase investment in internal cluster resources to improve developer experience and software capabilities, which could lead to better long-term shareholder returns [15] This summary captures the key insights and arguments presented during the conference call, highlighting the competitive landscape between AMD and NVIDIA in the AI inference market.
LLM Inference 和 LLM Serving 视角下的 MCP
AI前线· 2025-05-16 07:48
Core Viewpoint - The article emphasizes the importance of distinguishing between LLM Inference and LLM Serving, as the rapid development of LLM-related technologies has led to confusion in the industry regarding these concepts [1][3]. Summary by Sections LLM Inference and LLM Serving Concepts - LLM Inference refers to the process of running a trained LLM to generate predictions or outputs based on user inputs, focusing on the execution of the model itself [5]. - LLM Serving is oriented towards user and client needs, addressing the challenges of using large language models through IT engineering practices [7]. Characteristics and Responsibilities - LLM Inference is computation-intensive and typically requires specialized hardware like GPUs or TPUs [4]. - The responsibility of LLM Inference includes managing the model's runtime state and execution, while LLM Serving encompasses end-to-end service processes, including request handling and model management [10]. Technical Frameworks - vLLM is highlighted as a typical implementation framework for LLM Inference, optimizing memory usage during service inference [5][7]. - Kserve is presented as an example of LLM Serving, providing capabilities for model versioning and standardized service experiences across different machine learning frameworks [7][10]. Model Context Protocol (MCP) - MCP is described as a standardized protocol that connects AI models to various data sources and tools, functioning as a bridge between LLM Inference and LLM Serving [11][12]. - The architecture of MCP suggests that it plays a role similar to LLM Serving while also addressing aspects of LLM Inference [12][16]. Future Development of MCP - The article predicts that MCP will evolve to enhance authentication, load balancing, and infrastructure services, while clearly delineating the functions of LLM Inference and LLM Serving [17].
腾讯、华为、微软、阿里专家齐聚一堂,共谈推理优化实践 | AICon
AI前线· 2025-04-23 07:28
Core Viewpoint - The article emphasizes the rapid evolution of artificial intelligence and the critical role of optimizing inference performance in large models to address computational challenges, memory bottlenecks, and communication pressures [1]. Summary by Sections Inference Performance Optimization - Current optimization efforts focus on three main areas: model optimization, inference acceleration, and engineering optimization. Techniques such as model quantization, pruning, and distillation are employed to reduce computational complexity and enhance inference efficiency [1]. - The DeepSeek-R1-Distill-Qwen-32B model utilizes a distillation strategy to significantly compress resource expenditure while maintaining high performance [1]. AICon Conference - The AICon global AI development and application conference will take place on May 23-24, featuring a special forum on "Strategies for Optimizing Inference Performance of Large Models," led by industry practitioners [1][10]. Expert Presentations - **Xiang Qianbiao - Tencent**: His presentation will cover the AngelHCF inference acceleration framework, detailing its comprehensive exploration in operator design, communication optimization, and architecture adjustments, achieving significant cost and performance advantages [1][2]. - **Zhang Jun - Huawei**: He will discuss the optimization practices of Huawei's Ascend AI framework, focusing on hybrid model advantages, kernel optimization, and strategies for ultra-large MoE models to alleviate communication bottlenecks [3][4]. - **Jiang Huiqiang - Microsoft**: His talk will address efficient long-text methods centered around KV caching, exploring challenges and strategies in the inference process [5][7]. - **Li Yuanlong - Alibaba Cloud**: He will present on cross-layer optimization practices in large model inference, discussing operator fusion, model quantization, and dynamic batching techniques to maximize hardware resource efficiency [6][8]. Technical Trends and Future Directions - The article highlights the importance of understanding the full lifecycle of KV caching and its impact on long-text processing, as well as the need for comprehensive optimization strategies from model architecture to hardware acceleration [7][8]. - The conference will also explore collaborative optimization strategies and the future landscape of inference performance enhancement, including model parallelism and hardware selection [10].
与 00 后开源者聊 DeepSeek 开源周:一直开源最强模型,可能是不想赚钱,也可能是想推动更大变化丨开源对话#2
晚点LatePost· 2025-02-27 14:03
"当 AI 足够强大后,开源还是不是一个好选择?" 整理丨刘倩 程曼祺 嘉宾丨美国西北大学 MLL Lab 博士王子涵 ▲扫描上图中的二维码,可收听播客。《晚点聊 LateTalk》#102 期节目。欢迎在小宇宙、喜马拉雅、苹果 Podcast 等渠道关注、收听我们。 《晚点聊 LateTalk》是《晚点 LatePost》 推出的播客节目。"最一手的商业、科技访谈,最真实的从业者思考。" 这是《晚点 LatePost》 「开源对话」系列的第 2 篇。该系列将收录与开源相关的访谈与讨论。系列文章见文末的合集#开源对话。 上周五,DeepSeek 在官方 Twitter 上预告了下一周会连续 5 天开源 5 个代码库,进入 "open-source week"开源周。 目前 DeepSeek 已放出的 4 个库,主要涉及 DeepSeek-V3/R1 相关的训练与推理代码 。 这是比发布技术报告和开源模型权重更深度的开源。 有了训练和推理 工具,开发者才能更好地在自己的系统里,实现 DeepSeek 系列模型的高效表现。 (注:所有 4 个库和后续开源可见 DeepSeek GitHub 中的 Open-Inf ...