Workflow
数据并行
icon
Search documents
AI教父Hinton首爆十年前拍卖:我早已内定谷歌必赢
3 6 Ke· 2025-12-21 23:25
AI界「双神会」来了!一场NeurIPS 2025炉边谈话,AI教父Hinton和Jeff Dean同台,亲口爆料了AI革命「那些年」,还有更多鲜为人知的轶 事。 NeurIPS 2025那场轰动一时的访谈,如今终于放出来了! AI教父Hinton和DeepMind首席科学家Jeff Dean,两位AI圈关键人物,曾经合作多年的老友聚在一起。 现场,Hinton直接抛出了一个尖锐的问题—— 谷歌是否后悔发表Transformer论文? Jeff Dean给出了干脆的回应,「不后悔!因为它对世界产生了巨大的影响」。 不仅如此,Hinton还公开透露,自己关于Scaling的顿悟,源于Ilya的一场演讲。 在近1小时的对话中,两位大佬回顾了从ML早期突破,到当今塑造该领域的挑战、机遇等等。 他们还分享了,一些非常精彩的轶事—— 从卧室运行AlexNet的两块GPU,到谷歌大脑(Google Brain)的早期岁月。 AI教父Scaling顿悟,来自Ilya 对话的开场,先从一个有趣的共同点开始: 两位Geoff和Jeff都对「反向传播」(backpropagation)着迷。 这一概念的论文虽在1986年于Nat ...
深度拆解,硬核解构,揭开vLLM推理系统实现高效吞吐的秘籍
机器之心· 2025-10-26 04:03
Core Insights - The article discusses the rapid development of large model applications and the focus on making inference faster and more efficient, highlighting the emergence of vLLM as a high-performance inference framework specifically optimized for large language models [1][4]. Inference Engine Basics - The vLLM framework includes fundamental processes such as input/output request handling, scheduling, paged attention, and continuous batching [4]. - Advanced features of vLLM include chunked prefill, prefix caching, guided decoding, speculative decoding, and decoupled prefill/decoding [4]. Performance Measurement - The performance of the inference system is measured through metrics such as latency (including time to first token, iteration latency, end-to-end latency, and throughput time) and throughput, along with GPU performance roofline models [4]. Architecture and Components - The LLM engine is the core module of vLLM, capable of achieving high throughput inference in offline scenarios [8]. - Key components of the engine include the engine core, processor, output processor, model executor, and scheduler, each playing a critical role in the inference process [15][16]. Scheduling Mechanism - The scheduling mechanism prioritizes decode requests over prefill requests, allowing for more efficient processing of inference tasks [38][39]. - The vLLM V1 scheduler can intelligently mix prefill and decode requests within the same step, enhancing overall efficiency [39]. Advanced Features - Chunked prefill allows for processing long prompts by breaking them into smaller chunks, preventing resource monopolization [57]. - Prefix caching avoids redundant computations for shared tokens across multiple prompts, significantly speeding up prefill requests [69][73]. Guided and Speculative Decoding - Guided decoding utilizes a finite state machine to constrain logits based on grammar rules, ensuring only syntactically valid tokens are sampled [93][95]. - Speculative decoding introduces a draft model to quickly generate candidate tokens, reducing the time required for each forward pass in autoregressive generation [106][110]. Distributed System Deployment - vLLM can be deployed across multiple GPUs and nodes, utilizing tensor and pipeline parallelism to manage large models that exceed single GPU memory limits [146][150]. - The architecture supports both data parallelism and load balancing, ensuring efficient handling of incoming requests [130][156].