外网热议：为什么 DeepSeek 大规模部署成本低，但本地运行昂贵？

Core Viewpoint - The article discusses the cost-effectiveness of deploying AI models like DeepSeek-V3 at scale compared to running them locally, highlighting the trade-off between throughput and latency in AI inference services [2][13]. Group 1: Cost and Performance of AI Models - DeepSeek-V3 appears to be fast and cost-effective for large-scale deployment, but running it locally is slow and expensive due to low GPU utilization [2][13]. - The fundamental trade-off in AI inference services is between high throughput with high latency and low throughput with low latency [2][11]. Group 2: Batch Inference - Batch inference allows for efficient processing of multiple tokens simultaneously, leveraging GPU capabilities for large matrix multiplications (GEMM) [3][11]. - The implementation of inference servers involves receiving requests, pre-filling prompts, queuing tokens, and processing them in batches to maximize GPU efficiency [4][11]. Group 3: GPU Efficiency and Model Design - High batch sizes are necessary for models like expert mixture models (MoE) to maintain GPU efficiency, as they require many small multiplications unless batch processing is employed [7][11]. - Large pipelines in models necessitate high batch sizes to avoid pipeline bubbles, ensuring that GPUs remain active throughout the inference process [8][9]. Group 4: Latency and Throughput Trade-offs - Increasing batch size can lead to higher latency as users may need to wait for enough tokens to fill a batch, but it significantly improves throughput [11][12]. - The choice of batch size and collection window directly impacts the balance between throughput and latency, with larger windows helping to avoid pipeline bubbles [9][11]. Group 5: Implications for AI Service Providers - AI service providers must select batch sizes that eliminate pipeline bubbles and keep experts saturated, which often results in higher latency for improved throughput [11][13]. - The architecture of models like DeepSeek may not be easily adaptable for personal use due to their low efficiency when run by a single user [13].