LLM serving - filings, earnings calls, financial reports, news

LLM serving

Search documents

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

AI Engineer· 2025-07-26 17:45

SGLang Overview - SGLang is an open-source, high-performance serving framework for large language models (LLMs) and large vision models (VLMs) [5] - SGLang supports day zero releases for new models from labs like Quen and DeepSeek, and has a strong open-source community [7] - The project has grown rapidly, from a research paper in December 2023 to nearly 15,000 GitHub stars in 18 months [9] Usage and Adoption - Base 10 uses SGLang as part of its inference stack for various models [8] - SGLang is also used by XAI for their Glock models, inference providers, cloud providers, research labs, universities, and product companies like Koser [8] Performance Optimization - SGLang's performance can be optimized using flags and configuration options, such as CUDA graph settings [20] - Eagle 3, a speculative decoding algorithm, can be used to improve performance by increasing the token acceptance rate [28][42][43] - The default CUDA graph max batch size on L4 GPUs is eight, but it can be adjusted to improve performance [31][36] Community and Contribution - The SGLang community is active and welcomes contributions [7][54] - Developers can get involved by starring the project on GitHub, filing issues, joining the Slack channel, and contributing to the codebase [9][54][55] - The codebase includes the SGLang runtime, a domain-specific front-end language, and a set of optimized kernels [58]