vLLM
Search documents
听LLaMA Factory、vLLM、RAGFlow作者亲述顶级开源项目的增长法则|GOBI 2025
AI科技大本营· 2025-12-17 09:42
于开发者而言,开源一个项目很简单,一个命令足矣,但维护一个项目,却意味着: 一边扛着本职工作,一边独自修复 Bug、优化文档; 深夜改着无人问津的 PR,独自面对着扎堆涌来的 Issue…… 看着冷清的仓库,一个问题在深夜里反复叩问你的内心: "开源容易,让项目活起来,怎么这么难?" 你肯定也曾仰望过那些 GitHub 上数万 Star 的项目,心中既有无限敬佩,又感到一丝遥远。你也渴 望自己的项目 Star 能从 1 增长到 Star 10000 …… 那么,如何才能穿越这段"至暗时刻"? 为了回答这个问题,在 12 月 21 日( 本周日 )的 GOBI 2025 全球开源商业创新大会上,组委会 把那些真正戴上桂冠,并走得更远的人请到了现场。他们不是 理论家,而是从枪林弹雨中杀出来的 实战派。 万人 Star 的开源项目是如何炼成的? 郑耀威 LLaMA Factory 作者 在「聚力·开源社区的进化与未来 聚拢微光,可成星河」Panel 上,来自 GitHub 60,000+ Star 的 LLaMA Factory 郑耀威、顶级推理框架 vLLM 社区核心贡献者张家驹、企业级 RAG 引擎新星 RAG ...
DeepSeek倒逼vLLM升级,芯片内卷、MoE横扫千模,vLLM核心维护者独家回应:如何凭PyTorch坐稳推理“铁王座”
3 6 Ke· 2025-12-15 00:36
Core Insights - vLLM has rapidly become a preferred inference engine for global tech companies, with GitHub stars increasing from 40,000 to 65,000 in just over a year, driven by the open-source PagedAttention technology [1] - Neural Magic played a crucial role in vLLM's success, utilizing a "free platform + open-source tools" strategy to build a robust enterprise-level inference stack and maintain a library of pre-optimized models [1] - Red Hat's acquisition of Neural Magic in November 2024, including key team members like Michael Goin, is expected to enhance vLLM's competitive edge in the AI large model sector [1][2] Development and Optimization - The vLLM core team, led by Michael Goin, has shifted focus from optimizing Llama models to enhancing features related to the DeepSeek model, particularly with the release of DeepSeek R1 [3] - The development cycle for version 0.7.2 was tight, efficiently supporting Qwen 2.5 VL and introducing a Transformers backend for running Hugging Face models [3] - Version 0.7.3 marked a significant update with numerous contributors involved, enhancing DeepSeek with multi-token prediction and MLA attention optimizations, as well as expanding support for AMD hardware [4] Hardware Compatibility and Ecosystem - The vLLM team is committed to building an open and efficient hardware inference ecosystem, supporting various mainstream chips and collaborating closely with hardware teams like NVIDIA and AMD [8] - The integration of PyTorch as a foundational layer allows vLLM to support a wide range of hardware, simplifying the adaptation process for hardware vendors [10][11] - The team's collaboration with hardware partners ensures that vLLM can maintain high performance across different platforms, with a focus on optimizing the architecture for new hardware like the Blackwell chip [8][9] Multi-Modal Capabilities - vLLM has evolved from a text-only inference engine to a unified service platform supporting multi-modal generation and understanding, including text, images, audio, and video [17][19] - The introduction of multi-modal prefix caching significantly improves efficiency in processing various input types, while the decoupling of encoders enhances resource utilization for large-scale inference [18][19] - The release of vLLM-Omni marks a milestone in multi-modal inference, allowing for seamless integration and resource allocation across different modalities [19][21] Community and Feedback Loop - The growing trend of companies contributing modifications back to the upstream vLLM project reflects a positive feedback loop driven by the speed of community version iterations [22][23] - Collaboration with leading model labs and companies enables rapid feedback collection, ensuring that vLLM remains competitive and aligned with industry developments [23][24] - The vLLM team is actively addressing developer concerns, such as startup speed, by implementing tracking projects and optimizing performance through community engagement [24][25] Strategic Positioning - Red Hat's deep involvement in vLLM is rooted in the strategic understanding that inference is a critical component of AI application costs, aiming to integrate cutting-edge model optimizations [26][27] - The governance structure of vLLM is decentralized, with contributions from multiple organizations, allowing Red Hat to influence the project while adhering to open-source principles [26][27] - The collaboration with the PyTorch team has led to significant improvements in supporting new hardware and models, reinforcing vLLM's position as a standard in inference services [27]
LMCache:基于KV缓存复用的LLM推理优化方案
Xin Lang Cai Jing· 2025-12-09 13:41
(来源:DeepHub IMBA) # Old way: Slow as molasses def get_answer(prompt): memory = build_memory_from_zero(prompt) # GPU cries return model.answer(memo LLM推理服务中,(Time-To-First-Token) 一直是个核心指标。用户发起请求到看见第一个token输出,这段时间越短体验越好,但实际部署中往往存在各 种问题。 LMCache针对TTFT提出了一套KV缓存持久化与复用的方案。项目开源,目前已经和vLLM深度集成。 原理 大模型推理有个特点:每次处理输入文本都要重新计算KV缓存。KV缓存可以理解为模型"阅读"文本时产生的中间状态,类似于做的笔记。 问题在于传统方案不复用这些"笔记"。同样的文本再来一遍,整个KV缓存从头算。 LMCache的做法是把KV缓存存下来——不光存GPU显存里,还能存到CPU内存、磁盘上。下次遇到相同文本(注意不只是前缀匹配,是任意位置的文本复 用),直接取缓存,省掉重复计算。 实测效果:搭配vLLM,在多轮对话、RAG这类场景下,响 ...
开源破局AI落地:中小企业的技术平权与巨头的生态暗战
2 1 Shi Ji Jing Ji Bao Dao· 2025-11-11 14:20
Core Insights - The competition between open-source and closed-source AI solutions has evolved, with open-source significantly impacting the speed and model of AI deployment in enterprises [1] - Over 50% of surveyed companies are utilizing open-source technologies in their AI tech stack, with the highest adoption in the technology, media, and telecommunications sectors at 70% [1] - Open-source allows for rapid customization of solutions based on specific business needs, contrasting with closed-source tools that restrict access to core technologies [1] Group 1 - The "hundred model battle" in open-source AI has lowered the technical barriers for small and medium enterprises, making models more accessible for AI implementation [1] - Companies face challenges in efficiently utilizing heterogeneous resources, including diverse computing power and various deployment environments [2] - Open-source ecosystems can accommodate different business needs and environments, enhancing resource management [3] Group 2 - The narrative around open-source AI is shifting from "building models" to "running models," focusing on ecosystem development rather than just algorithm competition [4] - Companies require flexible and scalable AI application platforms that balance cost and information security, with AI operating systems (AI OS) serving as the core hub for task scheduling and standard interfaces [4][5] - The AI OS must support multiple models and hardware through standardized and modular design to ensure efficient operation [5] Group 3 - Despite the growing discussion around inference engines, over 51% of surveyed companies have yet to deploy any inference engine [5] - vLLM, developed by the University of California, Berkeley, aims to enhance LLM inference speed and GPU resource utilization while being compatible with popular model libraries [6] - Open-source inference engines like vLLM and SG Lang are more suitable for enterprise scenarios due to their compatibility with multiple models and hardware, allowing companies to choose the best technology without vendor lock-in [6]
深度拆解,硬核解构,揭开vLLM推理系统实现高效吞吐的秘籍
机器之心· 2025-10-26 04:03
Core Insights - The article discusses the rapid development of large model applications and the focus on making inference faster and more efficient, highlighting the emergence of vLLM as a high-performance inference framework specifically optimized for large language models [1][4]. Inference Engine Basics - The vLLM framework includes fundamental processes such as input/output request handling, scheduling, paged attention, and continuous batching [4]. - Advanced features of vLLM include chunked prefill, prefix caching, guided decoding, speculative decoding, and decoupled prefill/decoding [4]. Performance Measurement - The performance of the inference system is measured through metrics such as latency (including time to first token, iteration latency, end-to-end latency, and throughput time) and throughput, along with GPU performance roofline models [4]. Architecture and Components - The LLM engine is the core module of vLLM, capable of achieving high throughput inference in offline scenarios [8]. - Key components of the engine include the engine core, processor, output processor, model executor, and scheduler, each playing a critical role in the inference process [15][16]. Scheduling Mechanism - The scheduling mechanism prioritizes decode requests over prefill requests, allowing for more efficient processing of inference tasks [38][39]. - The vLLM V1 scheduler can intelligently mix prefill and decode requests within the same step, enhancing overall efficiency [39]. Advanced Features - Chunked prefill allows for processing long prompts by breaking them into smaller chunks, preventing resource monopolization [57]. - Prefix caching avoids redundant computations for shared tokens across multiple prompts, significantly speeding up prefill requests [69][73]. Guided and Speculative Decoding - Guided decoding utilizes a finite state machine to constrain logits based on grammar rules, ensuring only syntactically valid tokens are sampled [93][95]. - Speculative decoding introduces a draft model to quickly generate candidate tokens, reducing the time required for each forward pass in autoregressive generation [106][110]. Distributed System Deployment - vLLM can be deployed across multiple GPUs and nodes, utilizing tensor and pipeline parallelism to manage large models that exceed single GPU memory limits [146][150]. - The architecture supports both data parallelism and load balancing, ensuring efficient handling of incoming requests [130][156].
迈向超级人工智能之路
3 6 Ke· 2025-09-29 09:33
Core Insights - The core viewpoint is that AI represents a new leap in technology, with the potential to enhance human intelligence and evolve into Artificial Superintelligence (ASI) beyond Artificial General Intelligence (AGI) [1][11][19] - The increasing adoption of AI Agents in business operations is leading to automation of repetitive tasks, improved efficiency, and enhanced decision-making capabilities [1][2][16] Group 1: AI Agent Adoption and Impact - A survey by PwC revealed that 79% of companies are already using AI Agents in some capacity, with 66% reporting productivity improvements and 57% noting cost reductions [1][2] - Major tech companies are actively developing AI Agents, with products like OpenAI's Agent Mode and Microsoft's Copilot gaining traction [2][3] - Alibaba Cloud's Bailian platform aims to provide a comprehensive environment for enterprises to develop and deploy AI Agents, integrating all necessary components for effective implementation [2][12] Group 2: Infrastructure and Model Development - Alibaba Cloud has upgraded to a "full-stack AI service provider," focusing on building robust infrastructure and foundational models to support AI Agent deployment [3][19] - The strength of foundational models, such as the Tongyi Qianwen series, is crucial for the performance of AI Agents, with recent evaluations showing competitive advantages over international counterparts [5][6] - The introduction of multiple new models at the Yunqi Conference demonstrates Alibaba Cloud's commitment to advancing AI capabilities across various applications [6][8] Group 3: Scalability and Reliability - Scalability is a primary requirement for AI platforms, with Alibaba Cloud offering serverless architectures to handle unpredictable traffic and resource demands [7][9] - High availability and stability are essential for enterprises to trust AI Agents in critical processes, with Alibaba Cloud ensuring low-cost, high-concurrency storage and reliable computing capabilities [7][9] - The integration of memory management and retrieval systems is vital for AI Agents to evolve and retain knowledge over time, enhancing their productivity [8][9] Group 4: Development Framework and Business Integration - Alibaba Cloud's "1+2+7" framework for enterprise-level AI Agents includes a model service, two development modes, and seven key capabilities to facilitate integration into business processes [13][14] - The dual-track approach allows companies to quickly prototype using low-code solutions and transition to high-code for deeper customization, reducing exploration costs and ensuring business continuity [14][15] - Successful implementations of AI Agents in various sectors, such as finance and recruitment, highlight the tangible benefits and efficiency gains achieved through Alibaba Cloud's solutions [15][16] Group 5: Strategic Positioning and Future Outlook - Alibaba Cloud's leadership in the AI and cloud computing market is underscored by its significant market share and the trust of over 100,000 enterprise customers [18][21] - The development of AI Agents is seen as a critical step in the evolution of AI from theoretical models to practical applications that drive business growth [19][21] - The comprehensive strategy of combining models, platforms, and infrastructure positions Alibaba Cloud as a global leader in the AI space, enabling local enterprises to innovate without relying on foreign solutions [21]
从模型到生态:2025 全球机器学习技术大会「开源模型与框架」专题前瞻
AI科技大本营· 2025-09-26 05:49
Core Insights - The article discusses the growing divide between open-source and closed-source AI models, highlighting that the performance gap has narrowed from 8% to 1.7% as of 2025, indicating that open-source models are catching up [1][12]. Open Source Models and Frameworks - The 2025 Global Machine Learning Technology Conference will feature a special topic on "Open Source Models and Frameworks," inviting creators and practitioners to share their insights and experiences [1][12]. - Various open-source projects are being developed, including mobile large language model inference, reinforcement learning frameworks, and efficient inference services, aimed at making open-source technology more accessible to developers [2][7]. Key Contributors - Notable contributors to the open-source projects include: - Wang Zhaode, a technical expert from Alibaba Taotian Group, focusing on mobile large language model inference [4][23]. - Chen Haiquan, an engineer from ByteDance, contributing to the Verl project for flexible and efficient reinforcement learning programming [4][10]. - Jiang Yong, a senior architect at Dify, involved in the development of open-source tools [4][23]. - You Kaichao, the core maintainer of vLLM, which provides low-cost large model inference services [4][7]. - Li Shenggui, a core developer of SGLang, currently a PhD student at Nanyang Technological University [4][23]. Conference Highlights - The conference will feature discussions on the evolution of AI competition, which now encompasses data, models, systems, and evaluation, with major players like Meta, Google, and Alibaba vying for dominance in the AI ecosystem [12][13]. - Attendees will have the opportunity to hear from leading experts, including Lukasz Kaiser, a co-inventor of GPT-5 and Transformer, who will provide insights into the future of AI technology [12][13]. Event Details - The conference is set to take place soon, with a focus on the latest technological insights and industry trends, encouraging developers to participate and share their experiences [12][13].
最受欢迎的开源大模型推理框架 vLLM、SGLang 是如何炼成的?
AI科技大本营· 2025-09-24 02:01
Core Viewpoint - The article discusses the development stories of vLLM and SGLang, two prominent open-source inference engines for large language models (LLMs), highlighting their innovations, community engagement, and performance metrics. Group 1: LLM Inference Challenges - The core challenge of LLM inference lies in deploying models with hundreds of billions of parameters under strict constraints of latency, throughput, and cost [3] - The inference process involves applying learned knowledge to new data, which requires efficient computation and memory management [2][3] Group 2: vLLM Development - vLLM originated from a 2023 paper on PagedAttention, which innovatively applied operating system techniques for memory management, significantly enhancing throughput [7][8] - vLLM demonstrated remarkable performance improvements, handling up to 5 times the traffic and increasing throughput by 30 times compared to previous backends [9] - The project quickly evolved from a research initiative to a community-driven open-source project, amassing over 56,000 stars on GitHub and engaging thousands of developers [15][9] Group 3: SGLang Development - SGLang was developed from the paper "SGLang: Efficient Execution of Structured Language Model Programs," featuring RadixAttention for optimized performance [12] - SGLang retains the KVCache from previous requests to reduce computation during the prefill phase, showing significant performance advantages over traditional inference engines [12] - Although SGLang's community is smaller than vLLM's, it has over 2,000 participants and has shown rapid iteration and growth [13] Group 4: Community Engagement - vLLM has a robust community with over 12,000 participants in issues and pull requests, while SGLang's community is less than half that size [15][13] - Both projects have faced challenges in managing a growing number of issues and pull requests, with vLLM generally responding faster than SGLang [13] Group 5: Performance Metrics and Comparisons - vLLM and SGLang have both integrated advanced features like Continuous Batching and various attention mechanisms, leading to significant performance enhancements [29] - The competition between these two projects has intensified, with both claiming performance leadership in their respective releases [26] Group 6: Future Trends and Developments - The article notes that as the performance race heats up, both vLLM and SGLang are focusing on reproducible methods and real-world metrics rather than just benchmark results [26] - The trend indicates a convergence in model architectures and features among leading inference engines, with a shift in competition towards factors beyond performance [29] Group 7: Investment and Support - Both projects have attracted attention from investment firms and open-source foundations, with vLLM receiving support from a16z and SGLang being recognized in the PyTorch ecosystem [31][40]
LLM开源2.0大洗牌:60个出局,39个上桌,AI Coding疯魔,TensorFlow已死
3 6 Ke· 2025-09-17 08:57
Core Insights - Ant Group's open-source team unveiled the 2.0 version of the "2025 Large Model Open Source Development Ecosystem Panorama" at the Shanghai Bund Conference, showcasing significant changes in the open-source landscape [2][4][10] Group 1: Ecosystem Changes - The updated panorama includes 114 projects, a decrease of 21 from the previous version, with 39 new projects and 60 projects that have exited the stage, including notable ones like TensorFlow, which has been overtaken by PyTorch [4][5] - The overall trend indicates a significant reshuffling within the ecosystem, with a median age of only 30 months for projects, highlighting a youthful and rapidly evolving environment [5][10] - Since the "GPT moment" in October 2022, 62% of the projects have emerged, indicating a dynamic influx of new entrants and exits [5][10] Group 2: Project Performance - The top ten most active open-source projects reflect a focus on AI, LLM, Agent, and Data, indicating the primary areas of interest within the ecosystem [7][9] - The classification framework has evolved from broad categories to more specific segments, including AI Agent, AI Infra, and AI Data, emphasizing the shift towards an "agent-centric" era [10][19] Group 3: Contributions by Region - Among 366,521 developers, the US and China contribute over 55%, with the US leading at 37.41% [10][12] - In specific areas, the US shows a significant advantage in AI Infra and AI Data, with contributions of 43.39% and 35.76% respectively, compared to China's 22.03% and 21.5% [12][14] Group 4: Methodological Evolution - The methodology for selecting projects has shifted from a known starting point to a broader approach that captures high-activity projects, increasing the threshold for inclusion [15][18] - The new methodology aligns with Ant Group's goal of providing insights for internal decision-making and guidance for the open-source community [15][18] Group 5: AI Agent Developments - The AI Agent category has evolved into a structured system with various specialized tools, indicating a transition from chaotic growth to systematic differentiation [19][21] - AI Coding has expanded its capabilities, covering the entire development lifecycle and supporting multimodal and context-aware functionalities [23][27] Group 6: Market Trends - The report predicts significant commercial potential in AI Coding, with new revenue models emerging from subscription services and value-added features [24][27] - Chatbot applications have seen a peak but are now stabilizing, with a shift towards integrating knowledge management for long-term productivity [28][30] Group 7: Infrastructure and Operations - The Model Serving segment remains a key battleground, with high-performance cloud inference solutions like vLLM and SGLang leading the way [42][45] - LLMOps is rapidly growing, focusing on the full lifecycle management of models, emphasizing stability and observability [50][52] Group 8: Data Ecosystem - The AI Data sector appears stable, with many projects originating from the AI 1.0 era, but is facing challenges in innovation and engagement [58][60] - The evolution of data infrastructure is anticipated, moving from static repositories to dynamic systems that provide real-time insights for models [60][61] Group 9: Open Source Dynamics - A trend towards customized open-source licenses is emerging, allowing for more control and flexibility in commercial negotiations [62][63] - The landscape of open-source projects is being challenged, with some projects operating under restrictive licenses, raising questions about the definition of "open source" [62][63] Group 10: Competitive Landscape - The competitive landscape is marked by a divergence between open-source and closed-source models, with Chinese projects flourishing while Western firms tighten their open-source strategies [67][68] - The introduction of MoE architectures and advancements in reasoning capabilities are becoming standard features in new models, indicating a shift in focus from scale to reasoning [69][70]