speculative decoding
Search documents
Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten
AI Engineer· 2025-07-26 17:45
SGLang Overview - SGLang is an open-source, high-performance serving framework for large language models (LLMs) and large vision models (VLMs) [5] - SGLang supports day zero releases for new models from labs like Quen and DeepSeek, and has a strong open-source community [7] - The project has grown rapidly, from a research paper in December 2023 to nearly 15,000 GitHub stars in 18 months [9] Usage and Adoption - Base 10 uses SGLang as part of its inference stack for various models [8] - SGLang is also used by XAI for their Glock models, inference providers, cloud providers, research labs, universities, and product companies like Koser [8] Performance Optimization - SGLang's performance can be optimized using flags and configuration options, such as CUDA graph settings [20] - Eagle 3, a speculative decoding algorithm, can be used to improve performance by increasing the token acceptance rate [28][42][43] - The default CUDA graph max batch size on L4 GPUs is eight, but it can be adjusted to improve performance [31][36] Community and Contribution - The SGLang community is active and welcomes contributions [7][54] - Developers can get involved by starring the project on GitHub, filing issues, joining the Slack channel, and contributing to the codebase [9][54][55] - The codebase includes the SGLang runtime, a domain-specific front-end language, and a set of optimized kernels [58]
热乎出炉的面经,刚面完NVIDIA TRT LLM~
自动驾驶之心· 2025-06-23 11:34
Core Insights - The article discusses a recent interview experience with Nvidia for a position related to LLM inference acceleration, highlighting the rigorous interview process and technical discussions involved [1]. Group 1: Interview Process - The interview consisted of four rounds, each lasting one hour, with a total duration of four hours, indicating a thorough evaluation process by Nvidia [1]. - The first interviewer focused on the candidate's research work, particularly on speculative decoding, and included a coding challenge that the candidate struggled with due to lack of practice [1]. - The second interviewer demonstrated familiarity with the candidate's research, engaging in a deeper discussion about speculative decoding and presenting a string-related coding problem [1]. Group 2: Technical Discussions - The third interviewer, a female group leader, discussed the development directions of speculative decoding in high-batch scenarios and posed questions about transformer structures, specifically regarding the dimensions of Q and K [1]. - The fourth interviewer, who was the only one to turn on the camera, engaged in discussions from a systems perspective, providing valuable insights and confirming understanding during the presentation [1]. Group 3: Internship Details - The internship location options include Shanghai, Beijing, or remote work, with a focus on inference optimization rather than purely research-oriented tasks [1]. - The expected internship salary ranges from 8,000 to 10,000 yuan, reflecting the competitive nature of positions in the tech industry [1].