Incident: Fastly Software Bug Triggers Global Internet Outage.

Published Date: 2021-06-08

Postmortem Analysis
Timeline 1. The software failure incident happened on June 8, 2021 [115152, 115230, 115231, 115354, 115436, 115479, 115776].
System 1. Fastly's network due to a bad software update [115230, 115231, 115350, 115354, 115356, 115436, 115479, 115776]
Responsible Organization 1. A single Fastly customer triggered the software failure incident by changing their settings, which exposed a bug in Fastly's software update [115230, 115231, 115350, 115354, 115436, 115479, 115776].
Impacted Organization 1. Fastly [115230, 115231, 115354, 115436, 115479, 115776] 2. Websites and apps around the world [115231, 115354, 115776] 3. Large websites with substantial traffic [115350] 4. Amazon Web Services (AWS) [115353]
Software Causes 1. A bad software update introduced a bug that could be triggered by a customer configuring their service under specific circumstances, ultimately causing 85% of Fastly's network to return errors [Article 115230, Article 115231, Article 115354, Article 115436, Article 115479, Article 115776]. 2. The bug in Fastly's code, introduced in mid-May, lay dormant until triggered by a customer's valid configuration change on June 8, leading to the network failure [Article 115354, Article 115436]. 3. The outage was caused by a service configuration change by one of Fastly's customers that triggered a bug hidden in Fastly's network, which had been lying dormant since a software update deployment on May 6 [Article 115436].
Non-software Causes 1. The failure incident was triggered by a single Fastly customer changing their settings, which exposed a bug in a software update issued in mid-May [115776]. 2. The outage was caused by a service configuration change by one of Fastly's customers that triggered a bug hidden in Fastly's network [115436]. 3. An error in a physical link between data centers in Newark and Chicago led to an outage that took almost two hours to fix fully at Cloudflare, another CDN like Fastly [115353]. 4. A single error on a physical link between Newark and Chicago caused a connection to fail at Cloudflare, leading to traffic overloading a connection between Atlanta and Washington DC, which ultimately caused the entire system to go down [115350].
Impacts 1. The software failure incident caused a major internet blackout affecting high-profile websites like Amazon, Reddit, the Guardian, and the New York Times [115776]. 2. Fastly's network outage led to 85% of its network returning errors, impacting access to various online platforms and services across dozens of countries [115230, 115231, 115479]. 3. The outage resulted in users receiving Error: 503 messages when trying to access sites, including vital services like the UK government's gov.uk web properties [115436]. 4. The failure brought down some websites entirely and broke specific sections of other services, such as the servers for Twitter hosting emojis [115356]. 5. The outage highlighted the dangers of over-centralization and lack of resilience in internet infrastructure, impacting organizations' digital experiences, revenues, and reputations [115353].
Preventions 1. Implementing more rigorous testing procedures to detect bugs in software updates before deployment [115230, 115436, 115479]. 2. Enhancing monitoring systems to quickly identify and isolate issues when they occur [115436, 115776]. 3. Conducting thorough post-mortem analyses after incidents to understand root causes and improve processes [115230, 115436, 115479]. 4. Increasing redundancy and resiliency in the network infrastructure to mitigate the impact of software failures [115152]. 5. Anticipating potential failure scenarios and proactively addressing them to prevent service disruptions [115230, 115436].
Fixes 1. Implementing a permanent fix across the network [115230, 115436, 115479] 2. Reviewing processes and practices to detect bugs earlier [115230] 3. Disabling the service configuration that triggered disruptions [115231] 4. Conducting a complete post-mortem of processes and practices followed during the incident [115436] 5. Deploying a bug fix across the network [115436, 115479]
References 1. Fastly [115152, 115231, 115350, 115353, 115354, 115356, 115436, 115479, 115776] 2. Cloudflare [115350, 115353] 3. Google [115231] 4. Amazon [115231, 115356, 115479] 5. CNN [115231, 115356] 6. The New York Times [115152, 115231, 115356, 115479] 7. The Guardian [115356, 115479] 8. Reddit [115152, 115479] 9. Spotify [115152] 10. WIRED [115152] 11. Twitch [115356] 12. Hulu [115356] 13. Financial Times [115356] 14. CNET [115436] 15. eBay [115436] 16. Pinterest [115436] 17. Gov.uk [115436] 18. Twitter [115436] 19. BBC [115436]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident happened again at Fastly. Fastly experienced a software failure incident due to a bug triggered by a specific customer configuration change, causing a massive outage affecting a significant portion of its network [115230, 115231, 115354, 115436, 115479, 115776]. (b) Similar incidents have happened before at other organizations or with their products and services. The articles mention that similar software failures have temporarily brought down parts of larger online platforms like Google and Amazon in the past due to bad software updates or configuration errors [115230, 115231, 115350, 115353, 115356].
Phase (Design/Operation) design, operation (a) The software failure incident was primarily attributed to a bad software update that Fastly applied on May 12, which introduced a bug that could be triggered by a customer configuring their service under specific circumstances [Article 115230]. This incident highlights how a software bug introduced during the development phase (design) can lead to significant network errors and outages. (b) The outage was triggered when a customer made a legitimate configuration change that exposed the bug in the software update issued by Fastly in mid-May, causing 85% of the network to return errors [Article 115776]. This aspect of the incident points to how operational factors, such as customer settings changes and system operation, played a role in the software failure.
Boundary (Internal/External) within_system (a) within_system: The software failure incident was primarily caused by a bad software update that Fastly applied on May 12, which introduced a bug that could be triggered by a specific customer configuration under specific circumstances [Article 115230, Article 115231, Article 115354, Article 115436, Article 115479, Article 115776]. This internal software bug led to 85% of Fastly's network returning errors, impacting a significant portion of the internet [Article 115152, Article 115230, Article 115231, Article 115354, Article 115436, Article 115479, Article 115776]. (b) outside_system: The incident was not attributed to a malicious attack but rather to a software bug triggered by a specific customer configuration change [Article 115350]. The failure was not due to an external attack but rather a result of an internal software issue within Fastly's system.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: - The software failure incident was caused by a bad software update that introduced a bug triggered by a specific customer configuration under specific circumstances [Article 115230]. - Fastly identified a service configuration that triggered disruptions across its servers, leading to the outage [Article 115231]. - The outage was ultimately caused by a single customer updating their settings, which exposed a bug in a software update issued to customers [Article 115354]. - Fastly mentioned that the bug in the software update was not triggered until one unidentified customer carried out settings changes that caused 85% of the network to return errors [Article 115479]. (b) The software failure incident occurring due to human actions: - The incident was triggered when a customer pushed a valid configuration change that included specific circumstances that triggered the bug, leading to the outage [Article 115231]. - Fastly acknowledged that a customer changing their settings had exposed a bug in a software update, causing the network to return errors [Article 115776].
Dimension (Hardware/Software) software (a) The software failure incident occurring due to hardware: - There is no specific mention of the software failure incident being caused by hardware issues in the provided articles. (b) The software failure incident occurring due to software: - The software failure incident was primarily attributed to a bad software update that Fastly applied on May 12, which introduced a bug that could be triggered by specific customer configurations [Article 115230]. - Fastly identified a service configuration that triggered disruptions across its servers, indicating a software-related issue [Article 115231]. - The outage was caused by a service configuration change by one of Fastly's customers that triggered a bug hidden in Fastly's network, which was a result of a software update deployment by Fastly [Article 115436]. - Fastly mentioned that the incident was caused by a bug in its software that was triggered when one of its customers changed their settings, highlighting a software-related problem [Article 115479]. - The major internet blackout was blamed on a software bug that was triggered when one of Fastly's customers changed their settings, exposing a bug in a software update issued to customers in mid-May [Article 115776].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident related to the Fastly outage was non-malicious. The incident was caused by a software bug triggered by a specific customer configuration change, which led to 85% of Fastly's network returning errors [115230, 115231, 115354, 115436, 115479, 115776]. There is no evidence or indication in the articles that the failure was the result of a malicious attack [115350, 115356]. (b) The software failure incident was not attributed to a malicious attack but rather to a non-malicious software bug introduced by a software update and triggered by a customer configuration change [115230, 115231, 115354, 115436, 115479, 115776]. The incident was described as a major internet blackout caused by a software bug exposed when a customer changed their settings [115776].
Intent (Poor/Accidental Decisions) poor_decisions (a) poor_decisions: Failure due to contributing factors introduced by poor decisions - The software failure incident was caused by a bad software update that Fastly applied on May 12, which introduced a bug that could be triggered by a customer configuring their service under specific circumstances [Article 115230]. - Fastly acknowledged that they should have anticipated the outage caused by the specific conditions triggered by the software update [Article 115230]. - The outage was triggered by a service configuration change by one of Fastly's customers that revealed a bug hidden in Fastly's network, which had been lying dormant since a software update deployment by Fastly on May 6 [Article 115436]. - The outage was attributed to a software bug that was triggered when one of Fastly's customers changed their settings, exposing the bug in a software update issued to customers in mid-May [Article 115776]. (b) accidental_decisions: Failure due to contributing factors introduced by mistakes or unintended decisions - The software failure incident was described as an accidental error caused by a bad software update that introduced a bug triggered by a specific customer configuration under specific circumstances [Article 115230]. - Fastly mentioned that the bug was triggered accidentally by a customer pushing a valid configuration change that included the specific circumstances that triggered the bug [Article 115231]. - The incident was characterized as an accidental error caused by a single, unnamed Fastly customer inadvertently triggering the bug during a valid configuration change, leading to 85% of the company's network returning errors [Article 115436]. - Fastly stated that the bug was exposed when a customer quite legitimately changed their settings, unintentionally causing the software failure incident [Article 115776].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident occurring due to development incompetence: - Fastly experienced a major internet blackout due to a software bug triggered by a customer changing their settings, exposing a bug in a software update issued in mid-May [115776]. - Fastly acknowledged that they should have anticipated the outage caused by a bad software update and that they provide mission-critical services that should be treated with utmost sensitivity and priority [115230]. - The outage was attributed to a bad software update introduced by Fastly, which caused 85% of its network to return errors [115231]. - The incident was described as a wake-up call about the dangers of over-centralization in internet infrastructure, highlighting the lack of resilience due to a single infrastructure provider's configuration error [115353]. (b) The software failure incident occurring accidentally: - Fastly's outage was triggered by a bad software update that introduced a bug, which was triggered by a customer configuring their service under specific circumstances [115230]. - The outage was caused by a service configuration change by one of Fastly's customers that triggered a bug hidden in Fastly's network, which had been lying dormant since a software update deployment in May [115436]. - Fastly noticed the outage within a minute of it occurring and engineers worked out the cause shortly after, indicating that the triggering of the bug was accidental [115479]. - The incident was described as a rare but not unheard of goof that has temporarily brought down parts of even larger online platforms in the past [115231].
Duration temporary (a) The software failure incident was temporary, not permanent. The incident was triggered by specific circumstances, such as a bug introduced in a software update that was triggered by a customer configuration change. Fastly detected the disruption within minutes, identified and isolated the cause, and disabled the configuration. The network started recovering within an hour, with 95% of the network operating as normal within 49 minutes of the incident [115231, 115354, 115479, 115776]. (b) The incident was not a permanent failure caused by contributing factors introduced by all circumstances. It was a temporary failure caused by specific triggers, such as a software bug triggered by a customer configuration change. The incident was resolved within hours, and Fastly was able to recover its network quickly after identifying and addressing the specific cause of the failure [115231, 115354, 115479, 115776].
Behaviour crash, omission, value, other (a) crash: The incident involved a crash where Fastly's network experienced errors, causing major websites and services to become inoperable for almost an hour on Tuesday morning [Article 115353]. (b) omission: The software failure incident resulted in an omission where Fastly's network omitted to perform its intended functions due to a bad software update that introduced a bug triggered by a specific customer configuration, leading to 85% of the network returning errors [Article 115230]. (c) timing: The timing of the software failure incident was related to the system performing its intended functions incorrectly but not too late or too early. The bug triggered by a specific customer configuration under specific circumstances caused 85% of Fastly's network to return errors [Article 115354]. (d) value: The software failure incident involved a failure in the system performing its intended functions incorrectly. A bad software update introduced a bug that could be triggered by a customer configuring their service under specific circumstances, leading to 85% of Fastly's network returning errors [Article 115230]. (e) byzantine: There is no specific mention of the software failure incident exhibiting a byzantine behavior in the provided articles. (f) other: The software failure incident also exhibited behavior where the system took down its own network with a bad software update, causing disruptions across its servers due to a service configuration that was identified and disabled [Article 115231].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human (a) death: People lost their lives due to the software failure - No information in the provided articles indicates that people lost their lives due to the software failure incident. [115152, 115230, 115231, 115350, 115353, 115354, 115356, 115436, 115479, 115776] (b) harm: People were physically harmed due to the software failure - No information in the provided articles indicates that people were physically harmed due to the software failure incident. [115152, 115230, 115231, 115350, 115353, 115354, 115356, 115436, 115479, 115776] (c) basic: People's access to food or shelter was impacted because of the software failure - No information in the provided articles indicates that people's access to food or shelter was impacted due to the software failure incident. [115152, 115230, 115231, 115350, 115353, 115354, 115356, 115436, 115479, 115776] (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incident caused significant disruptions to various online platforms and services, including Reddit, Amazon, Twitch, Spotify, Hulu, the Guardian’s website, the BBC, the New York Times, CNN, gov.uk, and the White House website. The outage also led to financial losses for companies like Amazon, with potential revenue impacts. [115152, 115230, 115231, 115353, 115354, 115356, 115436] (e) delay: People had to postpone an activity due to the software failure - The software failure incident resulted in websites going offline, affecting users' ability to access various online services and platforms. Users experienced Error: 503 messages when trying to access sites, leading to delays in accessing information and services. [115350, 115356, 115436] (f) non-human: Non-human entities were impacted due to the software failure - The software failure incident impacted various online platforms, services, and websites, causing disruptions and outages. Specific sections of services like Twitter's servers hosting emojis were also affected. [115356] (g) no_consequence: There were no real observed consequences of the software failure - The software failure incident had significant consequences, including widespread outages affecting numerous countries and major online platforms. Financial losses were also incurred by companies like Amazon due to the outage. [115152, 115230, 115231, 115350, 115353, 115354, 115356, 115436] (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - No theoretical consequences were discussed in the provided articles. [115152, 115230, 115231, 115350, 115353, 115354, 115356, 115436, 115479, 115776] (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - No other consequences beyond the options provided were mentioned in the articles. [115152, 115230, 115231, 115350, 115353, 115354, 115356, 115436, 115479, 115776]
Domain information, entertainment, government (a) The software failure incident affected the production and distribution of information as major websites and services, including news providers like The Guardian and the New York Times, were rendered inoperable [115353, 115479]. (b) The transportation industry was not directly impacted by the software failure incident. (c) The incident did not directly affect the extraction of natural resources. (d) The software failure incident did not involve sales transactions. (e) The construction industry was not directly involved in the software failure incident. (f) The incident did not directly impact the manufacturing industry. (g) The utilities sector, which includes power, gas, steam, water, and sewage services, was not directly affected by the software failure incident. (h) The finance industry, involving the manipulation and movement of money for profit, was not directly involved in the software failure incident. (i) The incident did not directly impact the knowledge industry, which includes education, research, and space exploration. (j) The health industry, encompassing healthcare, health insurance, and food industries, was not directly affected by the software failure incident. (k) The entertainment industry, covering arts, sports, hospitality, and tourism, was indirectly impacted as major websites like Reddit, Amazon, and entertainment platforms experienced outages [115231, 115436]. (l) The government sector, including politics, defense, justice, taxes, and public services, was directly affected as government websites like gov.uk were rendered inaccessible [115353, 115436]. (m) The software failure incident was related to the information industry, with Fastly's outage affecting the distribution of online content and services [115152, 115231, 115350, 115354, 115356, 115436, 115479, 115776].

Sources

Back to List