Incident: Cloudflare Outage Disrupts Internet Services for Various Websites

Published Date: 2019-07-02

Postmortem Analysis
Timeline 1. The software failure incident involving Cloudflare happened on June 21, 2022 [Article 129699].
System 1. Cloudflare's internet security and services system [87325, 129699]
Responsible Organization 1. Cloudflare [87325, 129699]
Impacted Organization 1. Many websites and online businesses, including Discord, DoorDash, Fitbit, NordVPN, Peloton, OKX, FTX, and others were impacted by the software failure incident reported in the articles [87325, 129699].
Software Causes 1. A "bad software" update that had been "rolled back" [Article 87325] 2. A network change in some of Cloudflare's data centers [Article 129699]
Non-software Causes 1. Network error caused by a "bad software" update that had been "rolled back" [Article 87325] 2. A network change in some of Cloudflare's data centers [Article 129699]
Impacts 1. Many internet users faced problems accessing websites for about an hour due to a Cloudflare glitch, with some seeing "502 errors" displayed in their browsers [87325]. 2. Services like Downdetector and CoinDesk were affected, with CoinDesk misreporting prices due to bad data received from providers [87325]. 3. Cryptocurrency exchanges such as OKX and FTX were temporarily inaccessible [129699]. 4. Various websites and services relying on Cloudflare experienced disruptions [129699].
Preventions 1. Implementing more robust testing procedures for software updates to catch potential issues before deployment [87325]. 2. Enhancing monitoring and alert systems to quickly identify and respond to network changes that could lead to outages [129699].
Fixes 1. Performing a full post-mortem analysis to understand how the incident occurred and prevent it from happening again [87325]. 2. Rolling back the "bad software" update that caused the issue [87325]. 3. Investigating the network change in data centers that caused a portion of the network to be unavailable and addressing it to prevent future outages [129699].
References 1. Cloudflare company spokesperson [Article 129699] 2. John Graham-Cumming, Cloudflare's chief technology officer [Article 87325]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident has happened again at one_organization: - Cloudflare experienced another outage, disrupting services for various websites and online platforms [Article 129699]. (b) The software failure incident has happened again at multiple_organization: - The incident affected a wide range of sites, including Discord, DoorDash, Fitbit, NordVPN, Peloton, OKX, FTX, and more, indicating that multiple organizations were impacted by the Cloudflare outage [Article 129699].
Phase (Design/Operation) design, operation (a) The software failure incident mentioned in the articles was attributed to a "bad software" update that had been "rolled back" by Cloudflare, causing the internet wobble and disruptions in accessing websites [87325]. This indicates a failure related to the design phase, where contributing factors introduced during system development or updates led to the incident. (b) The incident was also linked to a network change in some of Cloudflare's data centers, which caused a portion of their network to be unavailable, resulting in difficulties for customers in reaching websites and services relying on Cloudflare [129699]. This aspect points towards a failure related to the operation phase, where contributing factors introduced by the operation of the system led to the outage.
Boundary (Internal/External) within_system (a) within_system: The software failure incident was attributed to a "bad software" update that had been "rolled back" by Cloudflare, as mentioned by the company's chief technology officer [87325]. Additionally, Cloudflare reported that a "network change in some of our data centers" caused a portion of their network to be unavailable, indicating an internal system change that led to the outage [129699].
Nature (Human/Non-human) non-human_actions (a) The software failure incident was attributed to a "bad software" update that had been "rolled back" by Cloudflare, causing the disruption in services [87325]. Additionally, Cloudflare mentioned that the outage was caused by a "network change in some of our data centers" which led to a portion of their network being unavailable [129699]. (b) The incident was not the result of an attack, as Cloudflare denied speculation of a distributed denial of service (DDoS) attack and attributed the issue to a "bad software" update that had been rolled back [87325]. The company spokesperson also clarified that the outage was not due to an attack but rather a network change in some data centers [129699].
Dimension (Hardware/Software) software (a) The software failure incident occurring due to hardware: - The incident was initially speculated to be a distributed denial of service (DDoS) attack, which is a type of cyber attack that floods a system with traffic [87325]. - Cloudflare denied that the outage was the result of an attack and instead attributed it to a "network change in some of our data centers" [129699]. (b) The software failure incident occurring due to software: - Cloudflare mentioned that the incident was caused by a "bad software" update that had been "rolled back" [87325]. - The company stated that due to the nature of the incident, customers may have had difficulty reaching websites and services that rely on Cloudflare, indicating a software-related issue [129699].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident was non-malicious. Both articles [87325, 129699] mention that the outage experienced by Cloudflare was not the result of a malicious attack. In Article 87325, Cloudflare denied speculation of a distributed denial of service (DDoS) attack and attributed the incident to a "bad software" update that had been rolled back. Similarly, Article 129699 states that the outage was caused by a network change in some of Cloudflare's data centers, which made a portion of their network unavailable.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident related to Cloudflare's outage was not due to poor decisions but rather an accidental decision. The incident was caused by a network change in some of Cloudflare's data centers, which led to a portion of their network becoming unavailable [129699]. Cloudflare denied that the outage was the result of an attack and attributed it to an unintentional network change that disrupted services for various websites and online services [129699].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident related to development incompetence is evident in Article 87325, where Cloudflare experienced a glitch that caused internet wobbling and disrupted access to many websites. The incident was attributed to a "bad software" update that had been "rolled back," indicating a failure introduced due to a lack of professional competence in managing software updates [87325]. (b) The software failure incident related to accidental factors is highlighted in Article 129699, where Cloudflare suffered an outage due to a network change in some of its data centers. The company clarified that the outage was not the result of an attack but rather an unintended consequence of a network change, indicating a failure introduced accidentally [129699].
Duration temporary (a) The software failure incident reported in the articles was temporary. The incident caused an outage that lasted for about an hour in the first article [87325] and from late Monday to early Tuesday in the second article [129699]. Both articles mention that the issues were resolved within a relatively short period of time after they were identified.
Behaviour crash, value, other (a) crash: The software failure incident described in the articles can be categorized as a crash. Cloudflare experienced an outage that disrupted services for various websites, including Discord, DoorDash, Fitbit, NordVPN, Peloton, OKX, and FTX [129699]. Users faced problems accessing websites, and some received "502 errors" in their browsers, indicating a failure of the system to perform its intended functions [87325]. (b) omission: There is no specific mention of the software failure incident being due to the system omitting to perform its intended functions at an instance(s) in the articles. (c) timing: The incident does not seem to be related to the system performing its intended functions too late or too early. (d) value: The software failure incident did result in incorrect data being provided to users. For example, CoinDesk received bad data from its providers, leading to misreporting of prices [87325]. (e) byzantine: The incident does not exhibit characteristics of the system behaving erroneously with inconsistent responses and interactions. (f) other: The software failure incident could also be categorized as a network error, as Cloudflare mentioned that a "bad software" update had been "rolled back," causing the issue [87325].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence unknown (a) death: People lost their lives due to the software failure (b) harm: People were physically harmed due to the software failure (c) basic: People's access to food or shelter was impacted because of the software failure (d) property: People's material goods, money, or data was impacted due to the software failure (e) delay: People had to postpone an activity due to the software failure (f) non-human: Non-human entities were impacted due to the software failure (g) no_consequence: There were no real observed consequences of the software failure (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? The articles do not mention any consequences such as death, harm, basic needs impact, property loss, or non-human entities being affected due to the Cloudflare software failure incidents. The main consequences discussed were related to service disruptions, website inaccessibility, and incorrect data reporting, causing inconvenience to users and businesses [87325, 129699].
Domain information, finance, other (a) The software failure incident affected the information industry as it disrupted services for websites like CoinDesk, a news site specializing in cryptocurrencies [87325]. (h) The incident also impacted the finance industry as cryptocurrency exchanges such as OKX and FTX were temporarily inaccessible due to the Cloudflare outage [129699]. (m) The incident could be related to other industries as well, considering the wide range of affected sites that included Discord, DoorDash, Fitbit, NordVPN, Peloton, and more [129699].

Sources

Back to List