Workflow
speculative decoding
icon
Search documents
Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten
AI Engineer· 2025-07-26 17:45
SGLang Overview - SGLang is an open-source, high-performance serving framework for large language models (LLMs) and large vision models (VLMs) [5] - SGLang supports day zero releases for new models from labs like Quen and DeepSeek, and has a strong open-source community [7] - The project has grown rapidly, from a research paper in December 2023 to nearly 15,000 GitHub stars in 18 months [9] Usage and Adoption - Base 10 uses SGLang as part of its inference stack for various models [8] - SGLang is also used by XAI for their Glock models, inference providers, cloud providers, research labs, universities, and product companies like Koser [8] Performance Optimization - SGLang's performance can be optimized using flags and configuration options, such as CUDA graph settings [20] - Eagle 3, a speculative decoding algorithm, can be used to improve performance by increasing the token acceptance rate [28][42][43] - The default CUDA graph max batch size on L4 GPUs is eight, but it can be adjusted to improve performance [31][36] Community and Contribution - The SGLang community is active and welcomes contributions [7][54] - Developers can get involved by starring the project on GitHub, filing issues, joining the Slack channel, and contributing to the codebase [9][54][55] - The codebase includes the SGLang runtime, a domain-specific front-end language, and a set of optimized kernels [58]
热乎出炉的面经,刚面完NVIDIA TRT LLM~
自动驾驶之心· 2025-06-23 11:34
作者 | 笑渐不闻声渐悄 编辑 | 自动驾驶之心 原文链接: https://zhuanlan.zhihu.com/p/1918033580103282744 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 >>点击进入→ 自动驾驶之心 『求职招聘』技术交流群 本文只做学术分享,如有侵权,联系删文 热乎出炉,刚面完Nvidia TRTLLM。本人bg是做llm推理加速的,主要在做speculative decoding,也 有一篇文章中了ICLR 2025。因为想继续做推理加速,所以尝试性的面了一下Nvidia,看能不能积累 connection。首先得吐槽一下这个面试机制:4位面试官一人面了我一个小时,整整连续面了4个小 时,面完感觉就是一个虚弱无力...然后简单聊一聊面试的问题 第一位面试官:自我介绍,讲一下自己的iclr 25关于spec的工作。面试官问的比较细致,从方法的 设置到evaluation都问到了,然后简单讲了一下自己nips 23的科研工作。感觉面试官对我的科研经 历还是比较满意,随后出了一道coding:n位数字插入任意数量的+,最后 ...