WARP虚拟专用网络(VPN)服务
Search documents
一行 Rust 代码,全球一半流量瘫痪!Cloudflare 用六年最惨宕机,给所有技术人上了一课
程序员的那些事· 2025-11-19 11:30
Core Viewpoint - Cloudflare experienced a significant outage on November 18, 2025, lasting approximately five and a half hours, affecting numerous popular websites and AI services, including OpenAI's ChatGPT and Shopify [2][3]. Group 1: Incident Overview - The outage began around 5:20 AM ET, triggered by a mysterious spike in traffic, which Cloudflare identified about an hour and a half later [2][4]. - The incident not only impacted CDN services but also affected Cloudflare's application services, including its zero trust network access tools [3][4]. - By 8:09 AM ET, Cloudflare acknowledged the issue and began implementing fixes, with full service restoration completed by 11:44 AM ET [4][10]. Group 2: Root Cause Analysis - Experts indicated that the outage was not due to a single point of failure but rather a combination of low-probability events, including a database user permission change that led to duplicate data in SQL queries [5][10]. - A significant factor was a malfunction in the bot management system, which was exacerbated by a configuration change that caused a spike in the size of a feature file, leading to software failures across multiple services [10][11]. - The underlying issue was traced back to a change in ClickHouse query behavior, which resulted in the generation of a feature file with excessive entries, ultimately causing HTTP 5xx errors for dependent services [11][12]. Group 3: Impact and Response - The outage was described as Cloudflare's most severe incident since 2019, with the company's stock price dropping approximately 3% during the event [13][14]. - Cloudflare acknowledged the gravity of the situation, emphasizing the need for improved internal checks and emergency shutdown mechanisms to prevent future occurrences [14]. - The incident raised concerns about the internet's reliance on single service providers, highlighting vulnerabilities in the infrastructure [15].
Rust闯大祸了,重写53天后Cloudflare搞出六年来最大失误,ChatGPT、Claude集体失联
3 6 Ke· 2025-11-19 10:08
Core Insights - Cloudflare experienced a significant outage lasting approximately five and a half hours, affecting numerous popular websites and AI services, including OpenAI's ChatGPT and Shopify [1][2][12] - The outage was triggered by an unexpected spike in traffic due to a malfunction in Cloudflare's bot management system, which was not caused by a DDoS attack [9][10][12] - This incident is noted as Cloudflare's most severe outage since 2019, with the company's stock price dropping by about 3% during the downtime [12][13] Incident Details - The outage began around 5:20 AM ET on November 18, when Cloudflare detected abnormal traffic, leading to service disruptions and error messages [2][3] - Cloudflare's internal services were affected, including its CDN and application services, which protect workloads from malicious traffic [2][3] - The company confirmed that the root cause was related to a configuration change that resulted in an oversized threat management configuration file, leading to software failures across multiple services [9][10][12] Recovery Process - Cloudflare's engineering team worked to identify and rectify the issue, with the problem being acknowledged and a fix being implemented by 8:09 AM ET [3][4] - The recovery process involved monitoring and restoring services, with full restoration completed by 11:44 AM ET [3][4] Technical Insights - The malfunction was linked to a specific line of Rust code that was part of a recent rewrite aimed at improving performance and security [4][6] - The bot management module, which assesses incoming traffic, was particularly affected due to a change in the database query behavior that resulted in duplicate entries in the configuration file [10][11] Company Response - Cloudflare acknowledged the severity of the incident and outlined steps to prevent future occurrences, including enhanced validation of internal configuration files and the introduction of global emergency shutdown switches [12][13] - The company expressed regret over the incident, emphasizing the importance of their services and the impact of the outage on customer trust [12][13]