SGLang
Search documents
LLM-in-Sandbox:给大模型一台电脑,激发通用智能体能力
机器之心· 2026-01-30 04:25
大模型的能力正在被不同的范式逐步解锁:In-Context Learning 展示了模型无需微调即可泛化到新任务;Chain-of-Thought 通过引导模型分步推理来提升复杂问题 的求解能力;近期,智能体框架则赋予模型调用工具、多轮交互的能力。 沿着这条技术演进路线,下一步是什么? 近日,来自中国人民大学高瓴人工智能学院、微软研究院和清华大学的研究者提出了一个简洁而有效的范式: LLM-in-Sandbox ——让大模型在代码沙盒(即虚 拟电脑)中自由探索来完成任务。实验表明, 这一范式不仅在代码任务上有效,更能显著提升模型在数学、物理、化学、生物医学、长文本理解、指令遵循等多 个非代码领域的表现,且无需额外训练,同时显著减少长文本场景下的 token 消耗,并保持相当水平的推理速度。 研究者已将 LLM-in-Sandbox 开源为 Python 包,可与 vLLM、SGLang 等主流推理后端无缝集成。 LLM-in-Sandbox 应当成为大模型的默认部署范式 , 取代纯 LLM 推理 。 1. 核心思想:给大模型一台电脑 电脑可能是人类创造的最通用的工具,几乎任何任务都可以通过电脑完成。这种通用性 ...
来这场沙龙,一览SGLang X 超长上下文扩展、RL后训练框架、扩散语言模型等前沿技术实践
机器之心· 2026-01-29 08:12
在当前人工智能从"聊天"范式加速向"能办事"的智能体时代演进的关键节点,LLM 系统优化与技术落地的实践探索,更需要开发者们的深度联结与经验共创。 基于此,由 SGLang 社区、机器之心、张江孵化器联合举办线下 Meetup,让屏幕前的贡献者走到台前,让幕后优化者分享实战心法。2 月 6日下午,「 SGLang 上 海 Meetup」 将 在上海浦东·纳贤路 800 号 1 层举办。 本次 Meetup 将围绕 SGLang 技术路线、超长上下文扩展、RL 后训练框架、 扩散语言模型 探索等议题展开深度解析,并设有自由交流环节。诚邀开发者与研究同 仁共赴现场,探讨 LLM 系统优化与落地实践的新可能。 最新日程 最新日程正式揭晓,扫描下方报名二维码,锁定您的专属入场资格。 1层 活动日程 | 13:30-14:00 签 झ | 14:00-14:30 主题分享一 SGLang roadmap 张柏舟 SGLang 核心开发成员 | 14:30-15:00 主题分享二 Omni-infer 对 SGL 的性能优化实践 郑锦焕 Omni-infer 核心开发者 | 15:00-15:30 主题分享三 slime ...
给AI打个分,结果搞出17亿估值独角兽?
3 6 Ke· 2026-01-07 11:04
Core Insights - LMArena has successfully secured $150 million in Series A funding, raising its valuation to $1.7 billion, marking a strong start to the new year [1] - The funding round was led by Felicis and UC Investments, with participation from Andreessen Horowitz and The House Fund, indicating strong investor confidence in the AI model evaluation sector [3] Company Background - LMArena originated from Chatbot Arena, created by the open-source organization LMSYS, comprised mainly of members from top universities like UC Berkeley and Stanford [4] - The team developed the open-source inference engine SGLang, which achieved performance comparable to DeepSeek's official report on 96 H100 GPUs [4] - The primary focus of LMArena is on evaluating AI models, having established a crowdsourced benchmarking platform during the rise of models like ChatGPT and Claude [6][7] Evaluation Methodology - LMArena employs a unique evaluation method where users anonymously vote on model responses, ensuring unbiased assessments [10] - The platform utilizes an Elo rating system based on the Bradley–Terry model to score models, allowing for real-time updates and fair comparisons [10] - LMArena has become a go-to platform for new models to be tested, with Gemini 3 Pro currently leading the rankings with a score of 1490 [10][11] Growth and Future Plans - Since its seed funding of $100 million last year, LMArena has rapidly expanded, accumulating 50 million votes across various modalities and evaluating over 400 models [12] - The newly raised funds will be used to enhance platform operations, improve user experience, and expand the technical team [12]
给AI打个分,结果搞出17亿估值独角兽???
量子位· 2026-01-07 09:11
Core Insights - LMArena has successfully secured $150 million in Series A funding, raising its valuation to $1.7 billion, marking a strong start to the new year [1][3]. Group 1: Funding and Valuation - The funding round was led by Felicis and UC Investments, with participation from Andreessen Horowitz and The House Fund [3]. - The significant investment reflects the attractiveness of the AI model evaluation sector in the current market [4]. Group 2: Company Background - LMArena originated from Chatbot Arena, which was created by the open-source organization LMSYS following the emergence of ChatGPT in 2023 [5][4]. - The core team consists of highly educated individuals from top universities such as UC Berkeley, Stanford, UCSD, and CMU [6]. Group 3: Technology and Evaluation Methodology - LMArena's open-source inference engine, SGLang, has achieved performance comparable to DeepSeek's official report on 96 H100 GPUs [7]. - SGLang has been widely adopted by major companies including xAI, NVIDIA, AMD, Google Cloud, Oracle Cloud, Alibaba Cloud, Meituan, and Tencent Cloud [8]. - The primary focus of LMArena is on evaluating AI models, which they began with the launch of Chatbot Arena, a crowdsourced benchmarking platform [9][10]. Group 4: Evaluation Process - LMArena employs a unique evaluation process that includes anonymous battles, an Elo-style scoring system, and human-machine collaboration [20]. - Users input questions, and the system randomly matches two models for anonymous responses, allowing users to vote on the quality of the answers without knowing the model identities [21][22]. - The platform's Elo scoring mechanism updates model rankings based on performance, ensuring a fair and objective evaluation process [22]. Group 5: Growth and Future Plans - Since securing $100 million in seed funding, LMArena has rapidly exceeded expectations, accumulating 50 million votes across various modalities and evaluating over 400 models [25]. - The newly raised funds will be used to enhance platform operations, improve user experience, and expand the technical team to support further development [25].
SGLang原生支持昇腾,新模型一键拉起无需改代码
量子位· 2025-12-21 14:13
henry 发自 凹非寺 量子位 | 公众号 QbitAI 当Agent在应用侧不断加速,推理系统能否承受随之而来的真实负载,正在成为行业关注的焦点。 这是12月20日在杭州收官的 SGLang AI 金融 π 对 上,被反复提及的一个背景。 在这场聚焦大模型推理效率的"π对"上—— Agent的Vibe被暂时搁到一边,真正摆上桌面的,是推理系统在真实负载中的工程问题: 高并发请求 、 长上下文窗口 、 多轮推理 、 内存 管理, 以及在具体金融agent场景下的 一致性生成 问题。 同时,在活动讨论中,昇腾作为算力平台也被多次提及。 当前,昇腾已作为SGLang原生支持的后端之一进入主仓库,随着 SGLang推理引擎的更新,DeepSeek、Qwen、GLM等模型可以在不调整 模型参数、不引入额外插件的情况下直接运行,HiCache、Mooncake等系统能力也在对应版本中引入。 可以说,这次SGLang AI金融π对呈现的,并非零散技术点,而是一条清晰的推理工程演进路径——从缓存与内存体系,到权重更新、强化学 习效率,再到算力与模型生态的协同。 接下来,我们具体来看。 而在特定的部署场景,如 金融Agen ...
基于 SGlang RBG + Mooncake 打造生产级云原生大模型推理平台
AI前线· 2025-12-12 00:40
Core Insights - The article emphasizes the rapid evolution of large language model (LLM) inference services into core enterprise infrastructure, focusing on the balance of performance, stability, and cost in building high-performance inference systems [2] - It discusses the transition from monolithic to distributed architectures in LLM inference, highlighting the need for external KVCache to alleviate memory pressure and enhance performance in high-demand scenarios [2][4] Distributed KVCache and Mooncake - Mooncake is introduced as a leading distributed KVCache storage engine designed to provide high throughput and low latency for inference frameworks like SGLang [3] - The article outlines the challenges in managing distributed KVCache systems in production environments, which necessitate the development of RoleBasedGroup (RBG) for unified management of caching and inference nodes [4] RoleBasedGroup (RBG) Design and Challenges - RBG is presented as a Kubernetes-native API aimed at AI inference, facilitating multi-role orchestration to ensure stable and high-performance operations [4][12] - The article identifies five fundamental challenges in deploying large model inference services, including the need for strong state management and performance optimization [12][15] SCOPE Framework - The SCOPE framework is introduced, focusing on five core capabilities: Stability, Coordination, Orchestration, Performance, and Extensibility, which are essential for managing LLM inference services [16][18] - RBG's design allows for rapid architecture iteration and performance-sensitive operations, addressing the complexities of multi-role dependencies and operational efficiency [15][24] Benchmark Testing and Performance Metrics - Benchmark tests demonstrate significant improvements in KVCache hit rates and inference performance, with L3 Mooncake cache achieving a 64.67% hit rate and reducing average TTFT to 2.58 seconds [32][48] - The article highlights the importance of a multi-tier caching architecture in enhancing performance for applications like multi-turn dialogue and AI agents [44] Conclusion and Future Outlook - The integration of RBG and Mooncake is positioned as a transformative approach to building production-grade LLM inference services, emphasizing the need for deep integration of high-performance design with cloud-native operational capabilities [43][44] - The article concludes with a call for community collaboration to advance this paradigm and lay the foundation for the next generation of AI infrastructure [43]
Z Event|Z Potentials × SGLang NeurIPS 全球前沿研究者峰会之夜
Z Potentials· 2025-11-26 04:34
Core Insights - NeurIPS 2025 is set to be a historic event for the future of AI technology, gathering top researchers and engineers in San Diego [1] - Z Potentials is collaborating with SGLang, a leading open-source inference engine community, to create a unique networking opportunity for frontier researchers [2] Event Details - The event will feature prominent researchers from organizations like OpenAI, DeepMind, and Nvidia, focusing on next-generation generative AI and system innovations [1] - The event is scheduled for December 5, from 6:00 PM to 8:00 PM, near the NeurIPS venue in San Diego [6] Collaboration and Support - Z Potentials aims to bridge investment, research, and infrastructure, with SGLang recognized as a standard in the large model inference field [2] - Atlas Cloud is providing significant computational support for the event, enabling the gathering of leading researchers [3]
大模型优秀大脑齐聚硬核开源聚会,SGLang社区举办国内首次Meetup
机器之心· 2025-10-28 06:29
Core Insights - The Pytorch Conference 2025 showcased the vibrant community and significant developments in deep learning, particularly highlighting SGLang's contributions and potential in the industry [1][3][4]. SGLang Overview - SGLang, an open-source high-performance inference engine for large language models and visual language models, originated from RadixAttention and is incubated by the non-profit organization LMSYS. It offers low latency and high throughput inference across various environments, from single GPUs to large distributed clusters [7][8]. Community Engagement - The first Meetup event in Beijing, co-hosted by SGLang, Meituan, and Amazon Web Services, attracted numerous contributors, developers, and scholars, indicating a strong community presence and development potential [4][8]. Technical Developments - The Meetup featured technical discussions on SGLang's architecture, including advancements in KV Cache, Piecewise CUDA Graph, and Spec Decoding, aimed at improving efficiency and compatibility [21][22]. - SGLang's quantization strategies were also discussed, focusing on expanding application range and optimizing model performance [34][35]. Application and Practice - Various industry applications of SGLang were presented, including its integration with Baidu's Ernie 4.5 model for large-scale deployment and optimization in search scenarios [41][42]. - The application of SGLang in WeChat's search function was highlighted, emphasizing the need for high throughput and low latency in user experience [44]. Future Directions - The roadmap for SGLang includes further integration with various hardware and software solutions, aiming to enhance stability and compatibility across different platforms [22][35]. - The Specforge framework, developed by the SGLang team, aims to accelerate large language model inference and has been adopted by major companies like Meituan and NVIDIA [57][58].
KTransformers入选计算机系统顶会、与主流框架合作,趋境&清华让「异构」成为推理新范式
量子位· 2025-10-22 09:12
Core Insights - KTransformers, an open-source project developed by Turing Technology and Tsinghua University's KVCache.AI team, focuses on system innovation during the inference phase of large models, enabling efficient operation on diverse hardware architectures with lower computational power [2][4]. Group 1: KTransformers Overview - KTransformers is a high-performance heterogeneous inference framework that optimally utilizes various computing resources such as GPUs, CPUs, and memory [2]. - The project paper was recognized at the prestigious SOSP 2025 conference, highlighting its significance in the field of computer systems [2][4]. Group 2: Technical Innovations - The framework introduces an "Expert Deferral" mechanism, allowing for efficient scheduling of experts in Mixture of Experts (MoE) models, which reduces computational load without sacrificing model performance [7][13]. - KTransformers achieves nearly 4x speedup on a single Intel Xeon processor compared to traditional PyTorch implementations, significantly enhancing CPU performance in expert calculations [12]. - The system allows for dynamic overlapping of CPU and GPU loads, resulting in a model throughput increase of approximately 1.45 times, with minimal impact on model accuracy [15][16]. Group 3: Collaboration and Ecosystem - KTransformers has partnered with SGLang, a mainstream inference framework, to integrate full GPU inference with heterogeneous inference, enhancing the overall architecture for large model deployment [5][19]. - This collaboration enables developers to access both full GPU and heterogeneous inference capabilities seamlessly, particularly beneficial in scenarios with limited GPU resources [21]. Group 4: Market Position and Future Directions - KTransformers has gained significant traction in the developer community, with over 15.2K stars on GitHub, indicating its widespread adoption as a foundational framework for large model inference [24]. - The project aims to democratize AI capabilities, making them accessible beyond elite computational paths, and is actively collaborating with various domestic CPU and GPU platforms to promote cost-effective solutions [28][29].
首个开源实现100%可复现的稳定RL训练框架来了!2次结果完全重合
量子位· 2025-09-27 01:30
Core Insights - The article discusses the achievement of SGLang and slime teams in creating a fully reproducible and stable reinforcement learning (RL) training framework based on the Qwen3-8B model, addressing the issue of non-deterministic outputs in large language model (LLM) inference [1][2][6]. Group 1: Deterministic Inference - SGLang and slime teams have developed a deterministic inference solution that integrates batch invariant operators, CUDA Graph, radix cache, and chunked prefill, ensuring high performance while maintaining compatibility with key features [5][8]. - The implementation of batch invariant operators addresses the core issue of output uncertainty in LLM inference, which arises from varying batch sizes during dynamic batching [7][8]. - Testing has shown that the average performance drop for SGLang's solution is 34.35%, significantly better than the 61.5% decline reported by Thinking Machines Lab [5][12]. Group 2: Performance Metrics - The article presents performance metrics for different inference modes, showing that deterministic modes yield consistent outputs across various batch sizes, with unique output counts significantly reduced [10][11]. - In terms of end-to-end latency, deterministic inference shows a performance drop of 25% to 45%, with specific backend performance metrics indicating improvements in certain configurations [12][13]. Group 3: Future Developments - Future efforts will focus on optimizing batch invariant operators to enhance performance, particularly for RL inference, and expanding support to mixture of experts (MoE) models [16][18]. - The team aims to improve radix cache functionality and explore tensor parallelism to further enhance the capabilities of deterministic inference [18].