Workflow
Cloudflare(NET)
icon
Search documents
Cloudflare全球故障,搞瘫了半个互联网!
猿大侠· 2025-11-21 04:11
Core Points - A significant outage occurred at Cloudflare on November 18, 2025, affecting major internet services globally, including ChatGPT, X (Twitter), and Spotify [1][13]. - The incident is described as a notable event in the history of internet disasters, warranting detailed documentation [2]. Incident Timeline - At 19:05, Cloudflare engineers deployed a change related to ClickHouse database access control [5]. - The change took effect at 19:28, initiating the outage [6]. - By 22:24, the team stopped generating new error configurations and rolled back to the previous stable version [7]. - The core outage lasted approximately 3 hours, with full recovery taking about 6 hours [8]. Impact and Scope - The outage had a global impact, affecting nearly half of internet services, including social media, AI platforms, online tools, and gaming services [13]. - Users experienced various errors, such as 500 errors and "Internal Server Error" messages, particularly noticeable during peak usage hours in China [15]. Technical Details - The root cause was identified as an internal database permission change that triggered a latent bug, leading to abnormal growth in bot management configuration files and subsequent software crashes across global nodes [8][14]. - The Cloudflare team began investigating the issue between 19:32 and 21:05, with the core problem identified by 21:37 [8]. Service Level Agreement (SLA) and Compensation - Cloudflare has not yet announced a compensation plan, but it offers SLA credit for Business and Enterprise plan customers if availability falls below 99.9%, which could result in a partial refund for the outage duration [19].
Ramsey Theory Group CEO Dan Herbatschek Shares Six Ways to Prevent Latent Bugs from Crashing Bot Mitigation Systems Following Cloudflare's November 18 Incident
Globenewswire· 2025-11-20 12:50
Core Insights - The recent Cloudflare outage highlights the operational risk posed by latent defects in core services, particularly during routine configuration changes [2][3] - Organizations are urged to enhance their configuration governance and resilience planning to prevent similar disruptions in the future [1][11] Group 1: Incident Overview - On November 18, Cloudflare experienced a significant outage due to a configuration update that revealed a dormant defect in its bot mitigation service, leading to degraded performance across multiple regions [2] - The outage affected major digital platforms, disrupting access to various consumer and enterprise services globally [2] Group 2: Recommendations for Businesses - **Treat Bot Mitigation as Tier-Zero Infrastructure**: Bot mitigation and related services should be considered core systems, with appropriate service level objectives (SLOs) and executive oversight [4] - **Require Staged Rollouts for All Configuration Changes**: Implement gradual deployment strategies to minimize risk, utilizing canary regions and rollback triggers [5] - **Establish Production-Mirroring Pre-Prod Environments**: Create pre-production environments that accurately reflect real-world conditions to test configuration updates [6] - **Enhance Observability Around Configuration Events**: Improve tracking of configuration changes to enable quick responses to issues [7] - **Architect for Graceful Degradation**: Design systems to handle failures gracefully, ensuring fallback options are available [8] - **Strengthen Change Management and Post-Incident Learning**: Implement peer reviews and conduct blameless post-mortems to learn from incidents [9] Group 3: Questions for Security Providers - Organizations should inquire about the staging and testing processes for bot mitigation updates, automated safeguards against configuration changes causing outages, and rollback protocols for latent bugs [10][13] - Emphasis is placed on the importance of resilience, which cannot be outsourced, as customers will not differentiate between vendor outages and the organization's own [11]
一个网站的更新,让外国人集体断网6小时
虎嗅APP· 2025-11-20 10:18
Core Points - The article discusses a significant outage of Cloudflare that caused widespread internet disruptions for approximately six hours, affecting numerous websites and online services globally [5][6][76]. - Cloudflare is described as an essential internet infrastructure provider, likened to a property management company for websites, responsible for security, speed, and traffic management [35][41]. - The outage was triggered by a misconfiguration during an update, leading to a database overload that caused the system to crash [46][52][76]. Group 1: Incident Overview - The outage began when users experienced difficulties accessing popular platforms like Twitter and ChatGPT, with many websites displaying Error 500 messages indicating Cloudflare's failure [7][14][16]. - The incident led to a collective outcry from users, highlighting the dependency on Cloudflare for internet access [16][19]. - The outage lasted nearly six hours, with services gradually restored after identifying and reverting to a previous stable configuration [75][76]. Group 2: Cloudflare's Role and Functionality - Cloudflare operates over 330 data centers worldwide, optimizing website access speed and providing security features such as DDoS protection and web application firewalls [38][41]. - The company’s architecture involves a complex database system designed to handle vast amounts of data, which was compromised during the incident due to a permissions adjustment [52][54]. - The misconfiguration led to a chaotic response from the system, where multiple data sources provided conflicting information, overwhelming the database and causing the crash [58][62]. Group 3: Implications and Future Considerations - The outage underscores the vulnerabilities inherent in relying on a few key infrastructure providers, as disruptions can have far-reaching consequences for businesses and users alike [81][87]. - Previous incidents, such as an AWS outage affecting millions, highlight the potential economic impact of such failures, with losses estimated in the millions per hour [81][82]. - The article calls for infrastructure companies to learn from these incidents to improve their systems and prevent future outages [85][88].
Cloudflare outage rocks stock amid sell-off
Yahoo Finance· 2025-11-19 18:33
Whenever a major cloud provider has an outage, we are reminded how dependent we are on the internet, and that a very small number of companies control the majority of it. The last AWS outage lasted significantly longer than most people, including experts, had expected. “I don't think this was just a ‘stuff happens’ outage. I would have expected a full remediation much faster,” Jake Williams, vice president of research and development at Hunter Strategy, told Wired. In addition to causing numerous servic ...
腾讯研究院AI速递 20251120
腾讯研究院· 2025-11-19 16:13
Group 1: Gemini 3 and AI Innovations - Google officially launched Gemini 3 Pro, achieving a top Elo score of 1501 in the LMArena leaderboard, surpassing GPT-5.1 and Claude Sonnet 4.5 with scores of 37.5% in Humanity's Last Exam and 91.9% in GPQA Diamond [1] - The introduction of the Deep Think mode enhances reasoning capabilities, achieving a groundbreaking score of 45.1% in the ARC-AGI-2 test, with a pricing model based on context length [1] - Gemini 3 is positioned as a significant step towards AGI, ranking first in the WebDev Arena with an Elo score of 1487, and features a direct interaction style that rejects flattery, acting as a true thinking partner [1] Group 2: Antigravity AI IDE - Google launched Antigravity, an AI-native IDE that integrates AI agents, code editors, and browsers to create a complete workflow from coding to deployment [2] - The core innovation is a "product-driven" workflow that enhances transparency and control over AI processes, supporting user feedback and approval mechanisms [2] - Antigravity currently supports Gemini 3.0 Pro, Claude 4.5 Sonnet, and GPT-OSS120B, available for MacOS, Windows, and Linux, directly challenging Cursor [2] Group 3: Manus Browser Operator - Manus introduced the Browser Operator extension, allowing any browser to upgrade to an AI browser without downloading a full application [3] - This extension can read user sessions, automate tasks, and execute operations across tabs, transforming the browser into a "programmable workspace" [3] - Demonstrations show its capability to automatically search for candidates on LinkedIn, parse job descriptions, analyze networks, and generate job requirement documents [3] Group 4: Microsoft's Work IQ - Microsoft unveiled Work IQ at the 2025 Ignite conference, which remembers user styles, preferences, habits, and workflows to recommend suitable AI agents for task completion [4] - The Microsoft 365 Copilot has been upgraded to support voice conversations, image and text capture, and allows Excel to choose between Anthropic and OpenAI reasoning models [4] - The Agent 365 platform offers unified management, access control, visualization, interoperability, and security features, fully integrating AI agents into Windows [4] Group 5: Microsoft and Nvidia's Investment in Anthropic - Nvidia and Microsoft committed to investing $10 billion and $5 billion in Anthropic, respectively, with Anthropic agreeing to purchase $30 billion worth of Azure computing power [5][6] - The Claude series models, including Claude Sonnet 4.5, Opus 4.1, and Haiku 4.5, will be fully integrated into Azure, making them the only models available on all three major cloud services [6] - Anthropic will utilize Nvidia's Grace Blackwell and Vera Rubin systems for collaborative design and engineering to optimize model performance and future architecture [6] Group 6: Cloudflare Outage - Cloudflare experienced a global service outage for three hours due to an unexpected expansion of its robot management system's feature file, affecting approximately 20% of websites [7] - Major services like ChatGPT, X, Amazon, and Spotify were down, with Downdetector reporting over 2.1 million error feedbacks, leading to a 7% drop in Cloudflare's stock price [7] - The incident highlighted vulnerabilities in AI infrastructure, revealing how complex defense systems designed to combat AI crawlers can inadvertently disrupt top AI service providers [7] Group 7: Zebra's AI Application - Zebra's AI application uses a pure AI foreign teacher for one-on-one English lessons, achieving a 98.8% speaking rate in the first three minutes, significantly higher than the 85% rate of human teachers [8] - The "product-model integration" approach allows the AI to communicate with children at different levels and provide personalized learning paths [8] - The team has broken traditional workflows, fostering direct collaboration between research and product development to create an AI-native organization aimed at transforming English learning from "foreign language learning" to "native language acquisition" [8] Group 8: Arm and Nvidia Collaboration - Arm and Nvidia are deepening their collaboration to promote the Neoverse computing platform through the NVLink Fusion architecture, potentially replicating Grace Blackwell-level performance across the ecosystem [9] - The Fusion version enables seamless data transfer between Neoverse platforms and Nvidia GPUs using the AMBA CHI C2C protocol, enhancing efficiency for Neoverse-based ASICs or CPUs [9] - This partnership aims to solidify NVLink's position as the industry standard for AI chip interconnects, with major cloud service providers like AWS, Google, Microsoft, Oracle, and Meta building applications based on Neoverse [9] Group 9: Andrew Ng on AI Bottlenecks - Andrew Ng identified the primary bottlenecks for AI as power and semiconductors rather than algorithms, emphasizing the need for sufficient GPU, data centers, and power to enhance computational capabilities [10] - AI coding assistants are redefining software production methods, acting as "skill amplifiers" that enable more positions to exceed capability boundaries, shifting competition towards maximizing AI efficiency [10] - The main obstacle to AI implementation in enterprises is organizational structure and behavioral inertia rather than technology, with AI investment logic evolving from "cost-cutting tools" to "speed tools," driving the economy towards a higher "intelligent density" [11]
全球网络服务意外中断事件频发,中国运营免受宕机影响
第一财经· 2025-11-19 15:38
Core Viewpoint - The recent service interruption of Cloudflare, affecting major internet platforms like X and ChatGPT, highlights vulnerabilities in modern automated network infrastructures, raising concerns about the balance between efficiency and security [3][5]. Group 1: Service Interruption Details - On November 18, Cloudflare experienced a service outage due to issues with an automatically generated configuration file intended to manage security threats, which became too large and caused system crashes [3]. - Cloudflare manages approximately 20% of global internet traffic and protects websites and applications from traffic surges and cyberattacks [3]. - Previous outages, such as the one involving Amazon Web Services, have also caused significant disruptions to numerous popular websites and applications [3][4]. Group 2: Industry Implications - The frequency of network outages reveals a core contradiction in large-scale network infrastructures, where highly automated systems designed for efficiency may introduce new risks [5]. - Experts suggest that the automation and intelligence in network systems must be carefully evaluated for risks, advocating for more flexible usage strategies [5]. - The future internet landscape is expected to evolve into a complex structure characterized by firewalls, sovereign clouds, and physical isolation, where connection resilience will be more critical than speed [5]. Group 3: Regional Impact - During the Cloudflare outage, operations in China remained unaffected due to the company's collaboration with local partners like JD Cloud to establish domestic data centers [5].
全球网络服务意外中断事件频发,自动化系统竟成风险源
Di Yi Cai Jing· 2025-11-19 14:33
Core Insights - The recent service outage of Cloudflare on November 18 raised concerns about the risks associated with highly automated systems that utilize artificial intelligence technology [1][2] - Cloudflare's outage was attributed to an issue with an automatically generated configuration file intended to manage security threats, which became too large and caused system crashes [1] - The incident highlights a growing trend in the industry where the pursuit of efficiency and speed through automation may introduce new security risks [2] Company-Specific Insights - Cloudflare manages approximately 20% of global internet traffic and protects websites and applications from traffic surges and cyberattacks [1] - During the Cloudflare outage, major platforms like X and ChatGPT were inaccessible, indicating the significant impact of such service disruptions [1] - In China, Cloudflare's operations remained unaffected due to partnerships with local providers like JD Cloud, which help maintain service continuity [3] Industry Insights - The frequency of network outages exposes a fundamental contradiction in modern large-scale network infrastructure, where automation aimed at improving efficiency may become a source of risk [2] - Experts suggest that as automation and intelligence in network systems become more prevalent, companies must carefully assess the risks associated with these technologies and adopt more flexible strategies [2] - The future internet landscape is expected to evolve into a "complex maze" characterized by firewalls, sovereign clouds, and physical isolation, where connection resilience will be more critical than speed [3]
Elle Communications is Agency of Record for FDA-Cleared Neurostimulation Device
Accessnewswire· 2025-11-19 14:00
Core Insights - NET Recovery has announced national treatments utilizing its FDA-cleared neurostimulation device, the NET Device, to address the addiction crisis, particularly focusing on opioid and stimulant use [1] Company Overview - NET Recovery is associated with Elle Communications, a subsidiary of Dolphin (NASDAQ:DLPN) [1] - The NET Device has shown potential in recent research to significantly reduce both opioid and stimulant use following treatment [1] Industry Context - The announcement comes amid record-high overdose deaths and a lack of FDA-approved medications specifically for stimulant addiction [1] - A peer-reviewed study published in Frontiers in Psychiatry supports the efficacy of the NET Device in reducing substance use [1]
Cloudflare CEO Apologizes for 'Unacceptable' Outage and Explains What Went Wrong
CNET· 2025-11-19 13:45
Core Insights - Cloudflare experienced a significant outage on Tuesday, affecting access to numerous websites and services, including major platforms like OpenAI and Spotify [1][3][6] - The outage was attributed to an internal software failure rather than a cyberattack, which initially raised concerns of a "hyper-scale DDoS attack" [4][5] - The incident highlights the risks associated with reliance on centralized internet services, as similar outages have occurred with other major providers like Amazon Web Services [12][13] Company Overview - Cloudflare is a San Francisco-based cloud services and cybersecurity company, utilized by approximately 20% of all websites [2] - The company provides essential internet infrastructure alongside other major players like Amazon Web Services and CrowdStrike [2] Outage Details - The outage began around 3:30 a.m. PT and lasted for over three hours, with most services returning to normal by 6:30 a.m. PT [3][5][11] - During the outage, Downdetector reported over 2.1 million outage reports, with significant numbers from the US, UK, Japan, and Germany [7][8] Financial Impact - The outage could result in direct and indirect losses estimated between $250 million to $300 million, considering the downtime's impact on various services [13] - The incident raises concerns about the fragility of the infrastructure that supports AI and other critical services [14]
BMW welcomes 'positive signals' in Nexperia dispute
Reuters· 2025-11-19 10:23
Core Viewpoint - BMW acknowledges "positive signals" regarding the Nexperia dispute but emphasizes that the situation remains volatile [1] Group 1 - BMW is monitoring the developments in the Nexperia dispute closely [1] - The company expresses cautious optimism about the resolution of the dispute [1] - The ongoing volatility indicates potential risks that could affect BMW's operations [1]