PagedAttention算法 - filings, earnings calls, financial reports, news

PagedAttention算法

Search documents

量子位· 2025-06-13 07:05

Core Viewpoint - The article highlights the development of Nano-vLLM, an open-source project by DeepSeek researcher Yu Xingkai, which achieves a lightweight and fully readable version of vLLM using less than 1200 lines of code, while maintaining comparable performance to the original framework [1][27]. Group 1: Project Overview - The project Nano-vLLM has three main characteristics: minimal codebase, high readability, and competitive performance [2]. - In benchmark tests on RTX 4070 hardware with the Qwen3-0.6B model, vLLM achieved a throughput of 1353.86 tokens/s in 98.95 seconds, while Nano-vLLM had a throughput of 1314.65 tokens/s in 101.90 seconds, showing that vLLM slightly outperformed Nano-vLLM [3][4]. - On H800 hardware with the Qwen3-8B model, Nano-vLLM surpassed vLLM, achieving a throughput of 6731.42 tokens/s in 86.73 seconds compared to vLLM's 5916.89 tokens/s in 98.67 seconds, indicating significant performance improvement [9]. Group 2: Technical Insights - vLLM is designed for optimizing inference and deployment of large language models (LLMs) and was initially developed by the Sky Computing Lab at UC Berkeley [16]. - The core technology of vLLM is inspired by the operating system's virtual memory paging mechanism, addressing issues of fragmentation in memory storage for key-value (KV) caches [19]. - The PagedAttention algorithm allows for non-contiguous storage of KV pairs, improving memory management and reducing waste, which enhances throughput by 2-4 times compared to previous systems like FasterTransformer and Orca [24]. Group 3: Features and Compatibility - vLLM integrates seamlessly with popular Hugging Face models and supports various decoding algorithms for high-throughput services, including parallel sampling and beam search [25]. - It is compatible with multiple hardware platforms, including NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPUs, and AWS Neuron [26]. - The vLLM engine consists of 8500 lines of Python code and 2000 lines of C++/CUDA code, while Nano-vLLM achieves similar functionality with a significantly reduced codebase [26][27].