Workflow
AI稳定性
icon
Search documents
deepseek崩了
Xin Lang Cai Jing· 2025-07-03 07:47
Core Insights - DeepSeek experienced a dramatic 24-hour period, achieving a peak of 30 million daily active users but subsequently suffering multiple service outages, highlighting the AI industry's focus on performance over stability [1][3] - The incident revealed systemic risks within DeepSeek's infrastructure, prompting a reevaluation of AI companies' approaches to stability as they transition from speed competition to endurance challenges [1][3] Group 1: Service Outage Details - The outage was not an isolated incident but a systemic risk, starting with a complete API service interruption that saw a 100% failure rate for developer calls [3] - The first failure occurred at 10:55 AM, with partial recovery by 11:32 AM, but a more severe crash followed, leading to a total service outage until 4:43 PM [3] - The economic impact was significant, with one enterprise reporting a 500% increase in customer complaints and direct losses exceeding 2 million yuan [3] Group 2: Technical Challenges - DeepSeek attributed the outages to sudden traffic spikes, system upgrades, and infrastructure fluctuations, but three structural issues were identified [5] - The first issue was a failure in traffic prediction, as user growth surged from zero to 30 million in just seven days, overwhelming server resources [5] - The second issue was the vulnerability of GPU clusters, which faced severe delays and data loss during peak traffic, leading to system protection mechanisms being triggered [5] - The third issue stemmed from the open-source model, which increased third-party deployments by 300%, further straining server capacity [6] Group 3: Recommendations for Stability - The incident underscored the need for a comprehensive stability assurance system in the AI industry, encompassing both technical and commercial aspects [7] - Upgrading technical architecture is essential, with examples like GMI Cloud's high-bandwidth GPU interconnects and Meta's software optimizations to improve task scheduling efficiency [7][8] - Innovations in business models, such as DeepSeek's private deployment option for enterprise clients, can alleviate pressure on public services [8] - The establishment of industry standards for AI service stability is crucial, with proposed requirements for top service providers to achieve 99.9% availability [8]