vLLM
Search documents
Clawdbot国产芯片适配完成!清华特奖出手,开源框架直接一键部署
量子位· 2026-02-03 04:52
Core Viewpoint - Clawdbot, now known as OpenClaw, has gained significant popularity, reaching 120,000 stars on GitHub within a week, with its Mac mini accessories sold out and rapid integration by major companies like Alibaba and Tencent [1][4]. Group 1: Clawdbot Features and Functionality - Clawdbot transforms AI from a standard chatbot into a 24/7 AI employee, capable of performing tasks while users are occupied or asleep [5]. - It can respond to messages on mobile devices and proactively notify users upon task completion [6]. - Users have reported high costs associated with using Clawdbot, as it can quickly consume hundreds of dollars in token fees for minimal output [10]. Group 2: Introduction of Xuanwu CLI - Xuanwu CLI is a new open-source framework that allows users to run Clawdbot locally without needing to purchase a Mac mini or incur API costs, making it more accessible [13][14]. - It simplifies the local deployment of models, providing an "app store-like" experience for users to select and use models without complex configurations [18]. - The command system of Xuanwu CLI is highly compatible with Ollama, allowing for easy transition for users familiar with that platform [20]. Group 3: Technical Advantages of Xuanwu CLI - Xuanwu CLI supports local AI engines, enabling integration with Clawdbot for continuous operation and interaction [25]. - It is designed to be user-friendly, requiring minimal setup and allowing for quick service startup, often within one minute [29]. - The framework is compatible with OpenAI API standards, facilitating easy integration with existing applications and reducing the cost of switching from cloud to local models [30]. Group 4: Adaptation to Domestic Chips - Xuanwu CLI is uniquely adapted to domestic chips, providing a cost-effective solution for running models locally, unlike other solutions that primarily rely on NVIDIA hardware [34]. - It addresses common issues faced with domestic chips, such as configuration complexity and performance variability, by encapsulating hardware differences and providing a unified resource pool [39]. - The architecture of Xuanwu CLI allows for intelligent scheduling and optimal resource allocation, ensuring stability and performance across different hardware setups [46]. Group 5: Company Background - Qingmiao Intelligent, founded in 2022, focuses on chip adaptation and the optimization of models, frameworks, and operators [48]. - The company has received significant investment and aims to create a comprehensive optimization system from hardware to intelligent agents [51]. - Qingmiao has successfully developed various domestic integrated machine solutions, achieving high performance and adaptability across multiple chip platforms [52].
LLM-in-Sandbox:给大模型一台电脑,激发通用智能体能力
机器之心· 2026-01-30 04:25
大模型的能力正在被不同的范式逐步解锁:In-Context Learning 展示了模型无需微调即可泛化到新任务;Chain-of-Thought 通过引导模型分步推理来提升复杂问题 的求解能力;近期,智能体框架则赋予模型调用工具、多轮交互的能力。 沿着这条技术演进路线,下一步是什么? 近日,来自中国人民大学高瓴人工智能学院、微软研究院和清华大学的研究者提出了一个简洁而有效的范式: LLM-in-Sandbox ——让大模型在代码沙盒(即虚 拟电脑)中自由探索来完成任务。实验表明, 这一范式不仅在代码任务上有效,更能显著提升模型在数学、物理、化学、生物医学、长文本理解、指令遵循等多 个非代码领域的表现,且无需额外训练,同时显著减少长文本场景下的 token 消耗,并保持相当水平的推理速度。 研究者已将 LLM-in-Sandbox 开源为 Python 包,可与 vLLM、SGLang 等主流推理后端无缝集成。 LLM-in-Sandbox 应当成为大模型的默认部署范式 , 取代纯 LLM 推理 。 1. 核心思想:给大模型一台电脑 电脑可能是人类创造的最通用的工具,几乎任何任务都可以通过电脑完成。这种通用性 ...
vLLM团队创业,种子轮10.5亿!清华特奖游凯超加盟
量子位· 2026-01-23 05:03
Core Insights - The core viewpoint of the article is the establishment of a new company, Inferact, by the core team behind the open-source inference framework vLLM, which has successfully raised $150 million in seed funding, achieving a valuation of $800 million [1][2][7]. Funding and Market Trends - The $150 million seed round marks a new high in AI infrastructure funding and is one of the largest seed rounds in history [2]. - Investors highlight a shift in focus from training to inference as AI applications mature, with a growing need for low-cost, reliable operation of existing models [4][9]. Company Mission and Strategy - Inferact aims to address the "inference bottleneck" by building the next-generation commercial engine to tackle large-scale deployment challenges [5]. - The company plans to maintain a dual approach, supporting vLLM as an independent open-source project while developing commercial products to enhance hardware efficiency for AI model deployment [12][14]. Technology and Market Validation - vLLM has already been deployed in real-world industrial environments, including Amazon's core shopping application, validating its stability under high concurrency [10][11]. - The demand for low-cost, reliable operation of existing models has surpassed expectations for new model development [9]. Founding Team and Expertise - Simon Mo, the CEO, has a background in machine learning systems design and was an early engineer at Anyscale, bringing experience in transforming research into industrial-grade products [26][27]. - Co-founder Woosuk Kwon, a PhD from UC Berkeley, contributed significant innovations to vLLM, including the Paged Attention algorithm [30][31]. - The team also includes Kaichao You, a Tsinghua University award winner, and experienced advisors from academia and industry, enhancing the company's technical and strategic capabilities [33][36].
速递|a16z全程跟进:vLLM之父创AI推理Inferact,顶级投资阵容融资,估值达8亿美元
Sou Hu Cai Jing· 2026-01-23 04:46
Core Insights - Inferact, an AI startup founded by the creators of the open-source software vLLM, has completed a $150 million seed funding round, achieving a valuation of $800 million [2] - The funding round was led by Andreessen Horowitz and Lightspeed Venture Partners, with participation from Sequoia Capital, Altitude Capital, Redpoint Ventures, and ZhenFund [2] - Inferact focuses on the inference stage of AI, which involves running existing models efficiently and reliably, rather than building new models [2][4] Company Overview - Inferact was founded in November 2025 and is led by CEO Simon Mo, one of the original maintainers of the vLLM project [3] - The company aims to support vLLM as an independent open-source project while also developing commercial products to help enterprises run AI models more efficiently on various hardware [4] - The vLLM project, initiated by the University of California, Berkeley, has attracted contributions from thousands of developers in the AI industry [2][3] Market Context - The interest from investors reflects a broader shift in the AI industry, where developers can now utilize existing powerful models without waiting for significant upgrades [3] - The inference stage is becoming a bottleneck, increasing costs and putting pressure on systems, which may worsen in the coming years [4] - The significant seed funding indicates the scale of market opportunities, with even minor efficiency improvements having a substantial impact on costs [4] Application Example - An example of vLLM's widespread application is Amazon, which relies on the software for both its cloud services and shopping applications to run internal AI systems [5]
速递|a16z全程跟进:vLLM之父创AI推理Inferact,顶级投资阵容融资,估值达8亿美元
Z Potentials· 2026-01-23 04:13
Core Insights - Inferact, an AI startup founded by the creators of the open-source software vLLM, has raised $150 million in seed funding, achieving a valuation of $800 million [2] - The company focuses on the inference stage of AI, where trained models begin to answer questions and solve tasks, predicting that the biggest challenge in the AI industry will shift from building new models to operating existing models efficiently and reliably [2][4] Funding and Investment - The seed round was led by Andreessen Horowitz and Lightspeed Venture Partners, with participation from Sequoia Capital, Altitude Capital, Redpoint Ventures, and ZhenFund [2] - Andreessen Horowitz's involvement dates back to the early stages of the vLLM project, which became the first recipient of their "AI Open Source Grant Program" in 2023 [3] Technology and Development - Inferact's core technology is built around vLLM, an open-source project launched in 2023 to help enterprises efficiently deploy AI models on data center hardware [2][4] - The company aims to support vLLM as an independent open-source project while also developing commercial products to help businesses run AI models more efficiently on various hardware [4] Market Trends - The AI industry is experiencing a shift where developers can utilize existing powerful models without waiting for significant upgrades, contrasting with the past when new model releases took years [3] - The inference stage is becoming a bottleneck, increasing costs and putting pressure on systems, which may worsen in the coming years [4] Business Strategy - Inferact's significant seed funding reflects the scale of market opportunities, indicating that even small efficiency improvements can have a substantial impact on costs [4] - The company does not aim to replace or limit open-source projects but seeks to build a business that supports and expands the vLLM project [4]
vLLM团队官宣创业:融资1.5亿美元,清华特奖游凯超成为联创
机器之心· 2026-01-23 00:45
编辑|泽南 大模型推理的基石 vLLM,现在成为创业公司了。 北京时间周五凌晨传来消息,由开源软件 vLLM 的创建者创立的人工智能初创公司 Inferact 正式成立,其在种子轮融资中筹集了 1.5 亿美元(约合 10 亿 元人民币),公司估值达到 8 亿美元。 该公司认为,AI 行业未来面临的最大挑战不是构建新模型,而是如何以低成本、高可靠性地运行现有模型。 毫无疑问,Inferact 的核心是开源项目 vLLM,这是一个于 2023 年启动的开源项目,旨在帮助企业在数据中心硬件上高效运行 AI 模型。 | III | | | | | | | | | Sign in | 글도 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ಇ vllm-project / vllm | | | Sponsor | 2 Notifications | | ಳಿ Fork 12.8k | | 8 | Star ( | 68.2k | | <> Code | · Issues (1.7k | 8% Pull requests 1.4 ...
听LLaMA Factory、vLLM、RAGFlow作者亲述顶级开源项目的增长法则|GOBI 2025
AI科技大本营· 2025-12-17 09:42
Core Insights - The article discusses the challenges of maintaining open-source projects, emphasizing that while initiating a project is easy, sustaining it requires significant effort and dedication [1][2] - The GOBI 2025 Global Open-source Business Innovation Conference aims to address these challenges by bringing together successful open-source contributors to share their experiences and strategies [2][14] Group 1: Conference Overview - The GOBI 2025 conference will feature prominent figures from the open-source community, including contributors from projects with over 60,000 stars on GitHub [2][14] - The event will take place on December 21, from 10:00 to 17:15, at the Renaissance Beijing Dongsheng Hotel [5][19] - The conference will include various panels discussing the evolution of open-source communities and the intersection of AI and business [6][19] Group 2: Key Themes and Discussions - The conference will explore how to transition from individual contributions to community-driven projects, focusing on leveraging community power for personal and project growth [3][14] - Discussions will include strategies for converting observers into co-creators, igniting project momentum, and fostering a sense of community among members [3][14] - The event will feature keynote speeches and roundtable discussions on sustainable open-source development and the commercialization of open-source in the AI era [20][21]
DeepSeek倒逼vLLM升级,芯片内卷、MoE横扫千模,vLLM核心维护者独家回应:如何凭PyTorch坐稳推理“铁王座”
3 6 Ke· 2025-12-15 00:36
Core Insights - vLLM has rapidly become a preferred inference engine for global tech companies, with GitHub stars increasing from 40,000 to 65,000 in just over a year, driven by the open-source PagedAttention technology [1] - Neural Magic played a crucial role in vLLM's success, utilizing a "free platform + open-source tools" strategy to build a robust enterprise-level inference stack and maintain a library of pre-optimized models [1] - Red Hat's acquisition of Neural Magic in November 2024, including key team members like Michael Goin, is expected to enhance vLLM's competitive edge in the AI large model sector [1][2] Development and Optimization - The vLLM core team, led by Michael Goin, has shifted focus from optimizing Llama models to enhancing features related to the DeepSeek model, particularly with the release of DeepSeek R1 [3] - The development cycle for version 0.7.2 was tight, efficiently supporting Qwen 2.5 VL and introducing a Transformers backend for running Hugging Face models [3] - Version 0.7.3 marked a significant update with numerous contributors involved, enhancing DeepSeek with multi-token prediction and MLA attention optimizations, as well as expanding support for AMD hardware [4] Hardware Compatibility and Ecosystem - The vLLM team is committed to building an open and efficient hardware inference ecosystem, supporting various mainstream chips and collaborating closely with hardware teams like NVIDIA and AMD [8] - The integration of PyTorch as a foundational layer allows vLLM to support a wide range of hardware, simplifying the adaptation process for hardware vendors [10][11] - The team's collaboration with hardware partners ensures that vLLM can maintain high performance across different platforms, with a focus on optimizing the architecture for new hardware like the Blackwell chip [8][9] Multi-Modal Capabilities - vLLM has evolved from a text-only inference engine to a unified service platform supporting multi-modal generation and understanding, including text, images, audio, and video [17][19] - The introduction of multi-modal prefix caching significantly improves efficiency in processing various input types, while the decoupling of encoders enhances resource utilization for large-scale inference [18][19] - The release of vLLM-Omni marks a milestone in multi-modal inference, allowing for seamless integration and resource allocation across different modalities [19][21] Community and Feedback Loop - The growing trend of companies contributing modifications back to the upstream vLLM project reflects a positive feedback loop driven by the speed of community version iterations [22][23] - Collaboration with leading model labs and companies enables rapid feedback collection, ensuring that vLLM remains competitive and aligned with industry developments [23][24] - The vLLM team is actively addressing developer concerns, such as startup speed, by implementing tracking projects and optimizing performance through community engagement [24][25] Strategic Positioning - Red Hat's deep involvement in vLLM is rooted in the strategic understanding that inference is a critical component of AI application costs, aiming to integrate cutting-edge model optimizations [26][27] - The governance structure of vLLM is decentralized, with contributions from multiple organizations, allowing Red Hat to influence the project while adhering to open-source principles [26][27] - The collaboration with the PyTorch team has led to significant improvements in supporting new hardware and models, reinforcing vLLM's position as a standard in inference services [27]
LMCache:基于KV缓存复用的LLM推理优化方案
Xin Lang Cai Jing· 2025-12-09 13:41
Core Insights - The article discusses the importance of Time-To-First-Token (TTFT) in LLM inference services, emphasizing that a shorter TTFT leads to a better user experience, but practical deployments often face challenges [1][15]. Group 1: LMCache Overview - LMCache proposes a solution for TTFT by implementing a KV cache persistence and reuse mechanism, which is open-source and deeply integrated with vLLM [1][16]. - Traditional methods require recalculating KV caches for each input, while LMCache allows for the storage of KV caches not only in GPU memory but also in CPU memory and disk, enabling faster retrieval for repeated text [2][18]. Group 2: Performance Improvements - Testing shows that when used with vLLM, LMCache can improve response speeds by 3 to 10 times in scenarios like multi-turn conversations and RAG applications [2][18]. - The cache read speed is approximately 7 times faster than native solutions, with increased throughput, allowing for text matches regardless of their position in the prompt [5][19]. Group 3: Storage and Integration Features - LMCache supports multi-level storage across GPU memory, CPU memory, and disk, which can significantly reduce GPU load [6][20]. - It features deep integration with vLLM v1, supporting cross-device sharing of KV caches and cross-node transmission, making it compatible with tools like llm-d and KServe in production environments [7][21]. Group 4: Installation and Requirements - Currently, LMCache primarily supports Linux, with Windows compatibility available through WSL or community adaptations [9][23]. - Basic requirements include Python 3.9+, NVIDIA GPUs (like V100, H100), and CUDA 12.8 or higher, with offline functionality available post-installation [10][24]. Group 5: Summary and Future Outlook - The concept of KV cache reuse is becoming standard, and LMCache has implemented it comprehensively with features like multi-level storage and arbitrary position matching, effectively addressing real-world issues [14][26]. - While LMCache is primarily tied to the vLLM ecosystem and focuses on Linux, it is an open-source solution worth monitoring as AMD GPU support is still being developed [14][27].