vLLM

Search documents
迈向超级人工智能之路
3 6 Ke· 2025-09-29 09:33
Core Insights - The core viewpoint is that AI represents a new leap in technology, with the potential to enhance human intelligence and evolve into Artificial Superintelligence (ASI) beyond Artificial General Intelligence (AGI) [1][11][19] - The increasing adoption of AI Agents in business operations is leading to automation of repetitive tasks, improved efficiency, and enhanced decision-making capabilities [1][2][16] Group 1: AI Agent Adoption and Impact - A survey by PwC revealed that 79% of companies are already using AI Agents in some capacity, with 66% reporting productivity improvements and 57% noting cost reductions [1][2] - Major tech companies are actively developing AI Agents, with products like OpenAI's Agent Mode and Microsoft's Copilot gaining traction [2][3] - Alibaba Cloud's Bailian platform aims to provide a comprehensive environment for enterprises to develop and deploy AI Agents, integrating all necessary components for effective implementation [2][12] Group 2: Infrastructure and Model Development - Alibaba Cloud has upgraded to a "full-stack AI service provider," focusing on building robust infrastructure and foundational models to support AI Agent deployment [3][19] - The strength of foundational models, such as the Tongyi Qianwen series, is crucial for the performance of AI Agents, with recent evaluations showing competitive advantages over international counterparts [5][6] - The introduction of multiple new models at the Yunqi Conference demonstrates Alibaba Cloud's commitment to advancing AI capabilities across various applications [6][8] Group 3: Scalability and Reliability - Scalability is a primary requirement for AI platforms, with Alibaba Cloud offering serverless architectures to handle unpredictable traffic and resource demands [7][9] - High availability and stability are essential for enterprises to trust AI Agents in critical processes, with Alibaba Cloud ensuring low-cost, high-concurrency storage and reliable computing capabilities [7][9] - The integration of memory management and retrieval systems is vital for AI Agents to evolve and retain knowledge over time, enhancing their productivity [8][9] Group 4: Development Framework and Business Integration - Alibaba Cloud's "1+2+7" framework for enterprise-level AI Agents includes a model service, two development modes, and seven key capabilities to facilitate integration into business processes [13][14] - The dual-track approach allows companies to quickly prototype using low-code solutions and transition to high-code for deeper customization, reducing exploration costs and ensuring business continuity [14][15] - Successful implementations of AI Agents in various sectors, such as finance and recruitment, highlight the tangible benefits and efficiency gains achieved through Alibaba Cloud's solutions [15][16] Group 5: Strategic Positioning and Future Outlook - Alibaba Cloud's leadership in the AI and cloud computing market is underscored by its significant market share and the trust of over 100,000 enterprise customers [18][21] - The development of AI Agents is seen as a critical step in the evolution of AI from theoretical models to practical applications that drive business growth [19][21] - The comprehensive strategy of combining models, platforms, and infrastructure positions Alibaba Cloud as a global leader in the AI space, enabling local enterprises to innovate without relying on foreign solutions [21]
从模型到生态:2025 全球机器学习技术大会「开源模型与框架」专题前瞻
AI科技大本营· 2025-09-26 05:49
Core Insights - The article discusses the growing divide between open-source and closed-source AI models, highlighting that the performance gap has narrowed from 8% to 1.7% as of 2025, indicating that open-source models are catching up [1][12]. Open Source Models and Frameworks - The 2025 Global Machine Learning Technology Conference will feature a special topic on "Open Source Models and Frameworks," inviting creators and practitioners to share their insights and experiences [1][12]. - Various open-source projects are being developed, including mobile large language model inference, reinforcement learning frameworks, and efficient inference services, aimed at making open-source technology more accessible to developers [2][7]. Key Contributors - Notable contributors to the open-source projects include: - Wang Zhaode, a technical expert from Alibaba Taotian Group, focusing on mobile large language model inference [4][23]. - Chen Haiquan, an engineer from ByteDance, contributing to the Verl project for flexible and efficient reinforcement learning programming [4][10]. - Jiang Yong, a senior architect at Dify, involved in the development of open-source tools [4][23]. - You Kaichao, the core maintainer of vLLM, which provides low-cost large model inference services [4][7]. - Li Shenggui, a core developer of SGLang, currently a PhD student at Nanyang Technological University [4][23]. Conference Highlights - The conference will feature discussions on the evolution of AI competition, which now encompasses data, models, systems, and evaluation, with major players like Meta, Google, and Alibaba vying for dominance in the AI ecosystem [12][13]. - Attendees will have the opportunity to hear from leading experts, including Lukasz Kaiser, a co-inventor of GPT-5 and Transformer, who will provide insights into the future of AI technology [12][13]. Event Details - The conference is set to take place soon, with a focus on the latest technological insights and industry trends, encouraging developers to participate and share their experiences [12][13].
最受欢迎的开源大模型推理框架 vLLM、SGLang 是如何炼成的?
AI科技大本营· 2025-09-24 02:01
Core Viewpoint - The article discusses the development stories of vLLM and SGLang, two prominent open-source inference engines for large language models (LLMs), highlighting their innovations, community engagement, and performance metrics. Group 1: LLM Inference Challenges - The core challenge of LLM inference lies in deploying models with hundreds of billions of parameters under strict constraints of latency, throughput, and cost [3] - The inference process involves applying learned knowledge to new data, which requires efficient computation and memory management [2][3] Group 2: vLLM Development - vLLM originated from a 2023 paper on PagedAttention, which innovatively applied operating system techniques for memory management, significantly enhancing throughput [7][8] - vLLM demonstrated remarkable performance improvements, handling up to 5 times the traffic and increasing throughput by 30 times compared to previous backends [9] - The project quickly evolved from a research initiative to a community-driven open-source project, amassing over 56,000 stars on GitHub and engaging thousands of developers [15][9] Group 3: SGLang Development - SGLang was developed from the paper "SGLang: Efficient Execution of Structured Language Model Programs," featuring RadixAttention for optimized performance [12] - SGLang retains the KVCache from previous requests to reduce computation during the prefill phase, showing significant performance advantages over traditional inference engines [12] - Although SGLang's community is smaller than vLLM's, it has over 2,000 participants and has shown rapid iteration and growth [13] Group 4: Community Engagement - vLLM has a robust community with over 12,000 participants in issues and pull requests, while SGLang's community is less than half that size [15][13] - Both projects have faced challenges in managing a growing number of issues and pull requests, with vLLM generally responding faster than SGLang [13] Group 5: Performance Metrics and Comparisons - vLLM and SGLang have both integrated advanced features like Continuous Batching and various attention mechanisms, leading to significant performance enhancements [29] - The competition between these two projects has intensified, with both claiming performance leadership in their respective releases [26] Group 6: Future Trends and Developments - The article notes that as the performance race heats up, both vLLM and SGLang are focusing on reproducible methods and real-world metrics rather than just benchmark results [26] - The trend indicates a convergence in model architectures and features among leading inference engines, with a shift in competition towards factors beyond performance [29] Group 7: Investment and Support - Both projects have attracted attention from investment firms and open-source foundations, with vLLM receiving support from a16z and SGLang being recognized in the PyTorch ecosystem [31][40]
LLM开源2.0大洗牌:60个出局,39个上桌,AI Coding疯魔,TensorFlow已死
3 6 Ke· 2025-09-17 08:57
Core Insights - Ant Group's open-source team unveiled the 2.0 version of the "2025 Large Model Open Source Development Ecosystem Panorama" at the Shanghai Bund Conference, showcasing significant changes in the open-source landscape [2][4][10] Group 1: Ecosystem Changes - The updated panorama includes 114 projects, a decrease of 21 from the previous version, with 39 new projects and 60 projects that have exited the stage, including notable ones like TensorFlow, which has been overtaken by PyTorch [4][5] - The overall trend indicates a significant reshuffling within the ecosystem, with a median age of only 30 months for projects, highlighting a youthful and rapidly evolving environment [5][10] - Since the "GPT moment" in October 2022, 62% of the projects have emerged, indicating a dynamic influx of new entrants and exits [5][10] Group 2: Project Performance - The top ten most active open-source projects reflect a focus on AI, LLM, Agent, and Data, indicating the primary areas of interest within the ecosystem [7][9] - The classification framework has evolved from broad categories to more specific segments, including AI Agent, AI Infra, and AI Data, emphasizing the shift towards an "agent-centric" era [10][19] Group 3: Contributions by Region - Among 366,521 developers, the US and China contribute over 55%, with the US leading at 37.41% [10][12] - In specific areas, the US shows a significant advantage in AI Infra and AI Data, with contributions of 43.39% and 35.76% respectively, compared to China's 22.03% and 21.5% [12][14] Group 4: Methodological Evolution - The methodology for selecting projects has shifted from a known starting point to a broader approach that captures high-activity projects, increasing the threshold for inclusion [15][18] - The new methodology aligns with Ant Group's goal of providing insights for internal decision-making and guidance for the open-source community [15][18] Group 5: AI Agent Developments - The AI Agent category has evolved into a structured system with various specialized tools, indicating a transition from chaotic growth to systematic differentiation [19][21] - AI Coding has expanded its capabilities, covering the entire development lifecycle and supporting multimodal and context-aware functionalities [23][27] Group 6: Market Trends - The report predicts significant commercial potential in AI Coding, with new revenue models emerging from subscription services and value-added features [24][27] - Chatbot applications have seen a peak but are now stabilizing, with a shift towards integrating knowledge management for long-term productivity [28][30] Group 7: Infrastructure and Operations - The Model Serving segment remains a key battleground, with high-performance cloud inference solutions like vLLM and SGLang leading the way [42][45] - LLMOps is rapidly growing, focusing on the full lifecycle management of models, emphasizing stability and observability [50][52] Group 8: Data Ecosystem - The AI Data sector appears stable, with many projects originating from the AI 1.0 era, but is facing challenges in innovation and engagement [58][60] - The evolution of data infrastructure is anticipated, moving from static repositories to dynamic systems that provide real-time insights for models [60][61] Group 9: Open Source Dynamics - A trend towards customized open-source licenses is emerging, allowing for more control and flexibility in commercial negotiations [62][63] - The landscape of open-source projects is being challenged, with some projects operating under restrictive licenses, raising questions about the definition of "open source" [62][63] Group 10: Competitive Landscape - The competitive landscape is marked by a divergence between open-source and closed-source models, with Chinese projects flourishing while Western firms tighten their open-source strategies [67][68] - The introduction of MoE architectures and advancements in reasoning capabilities are becoming standard features in new models, indicating a shift in focus from scale to reasoning [69][70]
LLM开源2.0大洗牌:60个出局,39个上桌,AI Coding疯魔,TensorFlow已死
机器之心· 2025-09-17 04:00
Core Insights - The article discusses the significant changes in the open-source AI model ecosystem, highlighting a shift towards a more competitive and rapidly evolving landscape, particularly in the AI Agent and Model Serving sectors [4][9][61]. Group 1: Ecosystem Changes - The latest version of the open-source landscape includes 114 projects, a decrease of 21 from the previous version, with 39 new projects and 60 projects that have disappeared, indicating a significant reshuffling in the ecosystem [7][10]. - The average lifespan of projects in the AI model ecosystem is only 30 months, with 62% of projects emerging after the "GPT moment" in October 2022, showcasing a high turnover rate [10][11]. - TensorFlow has been overtaken by PyTorch, which now dominates the landscape, marking a dramatic shift in the competitive dynamics [8]. Group 2: Key Trends - The article identifies three main areas of focus: AI Coding, Model Serving, and LLMOps, which are emerging as the primary tracks in the evolving landscape [29][61]. - AI Coding has transitioned from merely assisting in code writing to becoming a comprehensive lifecycle engine, indicating a significant increase in its capabilities and market potential [43][44]. - The AI Data sector remains relatively stable but is expected to evolve as new challenges arise in the native large model era, suggesting a potential for future growth [82][88]. Group 3: Global Contributions - The United States and China contribute over 55% of the total developer population in the open-source AI space, with the U.S. leading at 37.41% [17][20]. - In specific areas, the U.S. has a dominant position in AI Infrastructure and AI Data, with contributions significantly higher than those from China [19][23]. Group 4: Licensing Trends - There is a noticeable trend towards more restrictive open-source licenses, with many new projects adopting custom agreements that allow for greater control by the license holders [90][92]. - This shift raises questions about the definition of "open source" in the current competitive environment, as some projects that are popular on platforms like GitHub are not fully open-source [94].
昔日王者TensorFlow,已死
量子位· 2025-09-15 00:30
Core Viewpoint - The article discusses the decline of TensorFlow as an open-source framework, contrasting it with the rapid rise of PyTorch and other emerging projects in the AI open-source ecosystem [3][8][54]. Group 1: Decline of TensorFlow - TensorFlow's community activity peaked but has since declined to its lowest point, even lower than its inception [3][10]. - Ant Financial's open-source technology committee vice-chairman Wang Xu announced TensorFlow's removal from the latest open-source landscape map, indicating its diminishing relevance [6][8]. - The decline of TensorFlow reflects a broader trend in the AI open-source landscape, where project lifecycles are now measured in days rather than years [10][53]. Group 2: Open-Source Project Dynamics - The latest open-source landscape map (version 2.0) shows a significant turnover, with 39 new projects added and 60 existing projects removed, indicating a rapid evolution in the ecosystem [17][18]. - Projects that fail to maintain community engagement or lag in iteration speed are at risk of being excluded from the landscape [19][20][21]. - The competitive nature of the AI open-source ecosystem emphasizes the need for continuous innovation and effective community management to sustain project viability [24]. Group 3: New Paradigms in Open Source - The definition and operational model of open source are evolving, with some high-activity projects not adhering to traditional open-source licenses [26][30]. - The operational attributes of open source are becoming more pronounced, with platforms like GitHub serving as critical channels for product release and community engagement [31]. - New AI open-source projects are increasingly adopting customized licensing terms to balance community benefits with commercial interests, indicating a shift towards a more pragmatic approach to open source [32][33]. Group 4: Competitive Landscape - The focus of competition in the AI ecosystem has shifted from broad functionality to performance optimization, particularly in model serving and inference efficiency [35][44]. - The decline in activity for agent frameworks suggests a transition from exploratory phases to more practical, performance-driven applications [41][42]. - The emergence of high-performance inference engines highlights the importance of optimizing model serving to reduce operational costs and enhance application viability [43][44]. Group 5: Global Contribution Dynamics - The global AI open-source landscape is characterized by a "dual center" model, with the U.S. and China as the primary contributors, each excelling in different technological domains [46][49]. - U.S. developers lead in infrastructure contributions, while Chinese developers show strong growth in application innovation, driven by local market demands [51][52]. - The evolving contribution dynamics reflect a shift towards application-driven innovation, with real-world needs shaping the development of AI tools and solutions [50].
Mira Murati 创业公司首发长文,尝试解决 LLM 推理的不确定性难题
Founder Park· 2025-09-11 07:17
Core Insights - The article discusses the challenges of achieving reproducibility in large language model (LLM) inference, highlighting that even with the same input, different outputs can occur due to the probabilistic nature of the sampling process [10][11] - It introduces the concept of "batch invariance" in LLM inference, emphasizing the need for consistent results regardless of batch size or concurrent requests [35][40] Group 1 - Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, has launched a blog series called "Connectionism" to share insights on AI research [3][8] - The blog's first article addresses the non-determinism in LLM inference, explaining that even with a temperature setting of 0, results can still vary [10][12] - The article identifies floating-point non-associativity and concurrency as key factors contributing to the uncertainty in LLM outputs [13][24] Group 2 - The article explains that the assumption of "concurrency + floating-point" as the sole reason for non-determinism is incomplete, as many operations in LLMs can be deterministic [14][16] - It discusses the importance of understanding the implementation of kernel functions in GPUs, which can lead to unpredictable results due to the lack of synchronization among processing cores [25][29] - The article emphasizes that most LLM operations do not require atomic addition, which is often a source of non-determinism, thus allowing for consistent outputs during forward propagation [32][33] Group 3 - The concept of batch invariance is explored, indicating that the results of LLM inference can be affected by the batch size and the order of operations, leading to inconsistencies [36][40] - The article outlines strategies to achieve batch invariance in key operations like RMSNorm, matrix multiplication, and attention mechanisms, ensuring that outputs remain consistent regardless of batch size [42][60][64] - It concludes with a demonstration of deterministic inference using batch-invariant kernel functions, showing that consistent outputs can be achieved with the right implementation [74][78]
刚刚,Thinking Machines Lab首次发长文,揭开LLM推理不确定性真相
机器之心· 2025-09-11 03:36
Core Viewpoint - The article discusses the challenges of achieving reproducibility in large language models (LLMs) due to the lack of batch invariance, which leads to nondeterministic outputs even under controlled conditions [10][41][46]. Group 1: Introduction to the Issue - Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, published its first article addressing nondeterminism in LLM inference [1][3]. - The blog aims to cover a wide range of topics related to their research, including numerical computation and prompt engineering [3]. Group 2: Understanding Nondeterminism - Reproducibility is a cornerstone of scientific progress, yet obtaining consistent results from LLMs is challenging [10]. - Even with the temperature parameter set to 0, LLM APIs can still produce nondeterministic outputs [11]. - The nondeterminism is attributed to floating-point non-associativity and concurrency, which affects the order of operations in GPU computations [13][30]. Group 3: The Root Cause of Nondeterminism - The article argues that the common assumption linking concurrency and floating-point operations to nondeterminism does not fully explain the issue [14][30]. - Floating-point non-associativity leads to different results based on the order of operations, especially in parallel computations [19][26]. - The actual implementation of kernel functions in LLMs contributes to the nondeterministic behavior observed [27][30]. Group 4: Batch Invariance - The lack of batch invariance is identified as a key factor causing nondeterminism in LLM outputs [41][46]. - Batch size changes can lead to different results for the same input, which is counterintuitive for mathematical functions [43]. - The article emphasizes that ensuring kernel functions are batch invariant is crucial for achieving consistent outputs in LLM inference [46]. Group 5: Solutions for Achieving Determinism - The article outlines strategies to implement batch invariance in key operations such as RMSNorm, matrix multiplication, and attention mechanisms [49][60][71]. - By ensuring that the operations do not depend on batch size, the LLM inference can produce consistent results [46][81]. - The authors provide a demonstration of deterministic inference using their batch-invariant kernel function library [82]. Group 6: Performance Considerations - Initial performance tests indicate that while the batch-invariant kernel functions may not be fully optimized, they do not lead to catastrophic performance declines [89]. - The article highlights the importance of maintaining performance while achieving deterministic outputs in LLMs [88]. Group 7: Implications for Reinforcement Learning - The article discusses how achieving deterministic inference can facilitate true on-policy reinforcement learning by ensuring consistent outputs between training and inference [90]. - This consistency is essential for effective training and sampling processes in reinforcement learning environments [90]. Group 8: Conclusion - The article advocates for a proactive approach to understanding and addressing the sources of nondeterminism in LLMs, encouraging the community to strive for reproducibility in AI systems [93].
躺在风口上的硅谷教授,身家180亿不离讲台,捧出7家AI创企
3 6 Ke· 2025-09-02 07:20
Core Insights - Databricks has achieved a valuation exceeding $100 billion, positioning it as one of the highest-valued AI unicorns globally, drawing attention to its co-founder Ion Stoica's dual role in academia and industry [1][2] - Stoica has been instrumental in founding and managing several significant research labs at UC Berkeley, contributing to 118 research projects in big data, cloud computing, and AI [1][2] - Despite his commercial success, Stoica remains committed to education and research, continuing to teach undergraduate courses [2][29] Group 1: Entrepreneurial Ventures - Stoica has co-founded or incubated at least seven notable startups, including Databricks, Anyscale (valued at $1 billion), LMArena (valued at $600 million), and Conviva (valued at $300 million) [2][10] - Databricks, which emerged from the open-source project Spark, has raised a total of $20.8 billion and serves over 60% of Fortune 500 companies [10][12] - Conviva, founded in 2006, focuses on real-time video stream analysis and has raised $110 million across seven funding rounds [8][12] Group 2: Research Contributions - Stoica has played a key role in the establishment of three major labs at UC Berkeley, including the AMP lab, which produced influential open-source projects like Apache Spark and Alluxio [9][10] - The RISE lab, which Stoica helped create, has contributed over 41 open-source projects, with the distributed execution framework Ray being a notable success that led to the founding of Anyscale [19][20] - The Sky Computing Lab, established in 2022, has produced 52 projects, including the vLLM inference engine and the LMArena evaluation platform [24][25] Group 3: Funding and Sponsorship - Stoica has secured sponsorship from major companies like NVIDIA, Meta, and Google for his labs, allowing for extensive research funding [2][28] - His entrepreneurial ventures have also provided financial support for his research activities, with Stoica investing part of his personal wealth into lab operations [28][29] - The collaborative nature of his labs has attracted significant industry partnerships, enhancing the practical application of academic research [28][31] Group 4: Educational Impact - Stoica has mentored over 80 students, many of whom have gone on to work in academia or start their own companies, including several who are now at Databricks [29][31] - His commitment to education is evident as he continues to teach and guide students, emphasizing the importance of innovation and exploration in research [29][31] - Stoica's approach demonstrates the potential for academic research to translate into substantial commercial value, particularly in the AI sector [31]
o3-pro通关“推箱子”,人类怀旧小游戏成了大模型新Benchmark
量子位· 2025-06-16 04:50
Core Viewpoint - Classic nostalgic games like Sokoban and Tetris have become benchmarks for evaluating large models, with the o3-pro model recently surpassing previous performance limits in these games [1][2][6]. Group 1: Benchmark Performance - The o3-pro model successfully completed all levels of Sokoban, which previously had a benchmark limit at the sixth level [3][8]. - In comparison to the previous state-of-the-art model (SOTA), o3, the performance of o3-pro has doubled [3][10]. - The scoring system for Tetris involves calculating the number of placed blocks and the number of cleared lines multiplied by ten, until the game ends [13][22]. Group 2: Game Characteristics and Evaluation - The Lmgame benchmark includes several games, such as 2048, Candy Crush, Super Mario Bros, and Phoenix Wright, each with unique evaluation criteria [18][24]. - The evaluation for 2048 is based on the total value of merged blocks, while Candy Crush measures the total candies eliminated in a fixed number of rounds [24]. - The evaluation methods do not consider time as a factor, focusing instead on game-specific performance metrics [22][24]. Group 3: Model Development and Support - The project is developed by the Hao AI Lab at UCSD, which is affiliated with the machine learning systems and NLP labs [28]. - The lab has received funding from Google and NVIDIA, with NVIDIA donating a DGX B200 system to support their research [34]. - The benchmark is open-source, allowing interested parties to download and test their models [23].