Incident: Fastly Content Delivery Network Outage Affects Major Websites

Published Date: 2021-06-08

Postmortem Analysis
Timeline 1. The software failure incident happened on June 8, 2021 [Article 115269].
System 1. Fastly's content delivery network (CDN) service [115269]
Responsible Organization 1. Fastly [115269]
Impacted Organization 1. UK government's pages [115269] 2. Amazon [115269] 3. Spotify [115269] 4. Reddit [115269] 5. PayPal [115269] 6. Twitch [115269] 7. Stack Overflow [115269] 8. GitHub [115269] 9. Hulu [115269] 10. HBO Max [115269] 11. Quora [115269] 12. Vimeo [115269] 13. Shopify [115269] 14. Stripe [115269] 15. CNN [115269] 16. The Guardian [115269] 17. The New York Times [115269] 18. BBC [115269] 19. Financial Times [115269]
Software Causes 1. The software cause of the failure incident was a service configuration issue with Fastly, a content delivery network (CDN) company, which triggered disruptions across their Points of Presence (POPs) globally [115269].
Non-software Causes 1. Physical problem or hardware failure rather than a software-related issue [115269]
Impacts 1. Numerous popular websites, including Amazon, Spotify, Reddit, gov.uk, PayPal, Twitch, Stack Overflow, GitHub, Hulu, HBO Max, Quora, Vimeo, Shopify, Stripe, CNN, The Guardian, The New York Times, BBC, and Financial Times, experienced outages [Article 115269]. 2. Users worldwide reported problems accessing web pages, leading to frustration and venting on social media [Article 115269]. 3. Passengers trying to fill out locator forms on UK.Gov to enter the UK from Portugal and abroad were affected by the outage [Article 115269]. 4. The outage caused websites to display error messages such as 'Error 503 Service Unavailable' and 'connection failure' [Article 115269]. 5. The outage impacted various services, including streaming sites like Twitch and Hulu, and disrupted the completion of passenger locator forms required by British border officials [Article 115269]. 6. The outage affected websites like Squarespace, Shopify, Vimeo, Imgur, Tidal, Weightwatchers, Kickstarter, and UK chemist Boots [Article 115269]. 7. The Guardian, Reddit, French newspaper Le Monde, and other websites were also hit by the issue [Article 115269]. 8. The outage was attributed to Fastly, a content delivery network (CDN) company, which later identified and fixed the service configuration issue [Article 115269]. 9. The incident highlighted the reliance of major websites on content delivery networks like Fastly, emphasizing the potential impact of such outages [Article 115269].
Preventions 1. Implementing redundancy and failover mechanisms in the content delivery network (CDN) infrastructure to ensure that if one component fails, there are backup systems in place to maintain service [115269]. 2. Conducting regular maintenance and testing of the CDN services to identify and address any potential issues before they cause widespread outages [115269]. 3. Diversifying reliance on a single CDN provider by using multiple CDN services or having backup plans in case of a failure from a specific provider [115269].
Fixes 1. Fastly identified a service configuration that triggered disruptions across their network and disabled that configuration to fix the issue [115269]. 2. Fastly mentioned that their global network was coming back online after addressing the problem [115269]. 3. Websites that were affected by the Fastly outage may have removed their dependency on Fastly to get back online [115269]. 4. The issue was likely due to a physical problem or hardware failure rather than a software-related one, according to a software testing expert [115269].
References 1. Fastly - The articles gather information about the software failure incident from Fastly, the content delivery network (CDN) company responsible for the outage [115269].

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: - Fastly, the content delivery network (CDN) company responsible for the outage, has faced similar incidents before. This incident caused a massive internet outage affecting numerous popular websites, including the UK government's pages, Amazon, Spotify, Reddit, PayPal, Twitch, and many others [115269]. (b) The software failure incident having happened again at multiple_organization: - The outage caused by Fastly affected not only the UK government and popular websites like Amazon and Spotify but also impacted other services such as Twitch, Reddit, news websites like the BBC, Guardian, CNN, and the New York Times [115269].
Phase (Design/Operation) design (a) The software failure incident in the articles was primarily due to the design phase. The incident was caused by Fastly, a content delivery network (CDN) company, experiencing a service configuration issue that triggered disruptions across its Points of Presence (POPs) globally [115269]. This design-related failure impacted numerous popular websites that relied on Fastly's services, leading to a widespread outage affecting users worldwide. The issue was identified as a problem with Fastly's service configuration, highlighting a failure introduced during the system development or system updates. (b) The software failure incident did not appear to be primarily due to the operation phase or misuse of the system. The outage was attributed to a service configuration issue within Fastly's CDN services, indicating a design-related failure rather than one caused by the operation or misuse of the system [115269].
Boundary (Internal/External) within_system (a) within_system: The software failure incident was caused by Fastly, a content delivery network (CDN) company, experiencing a service configuration issue that triggered disruptions across their Points of Presence (POPs) globally [115269]. This internal issue within the Fastly system led to the outage affecting numerous websites that rely on Fastly's services for content delivery. Additionally, the outage was attributed to a physical or hardware failure rather than a software-related problem [115269]. (b) outside_system: The software failure incident was not primarily caused by factors originating from outside the system. The outage was a result of Fastly's internal service configuration issue, which impacted the delivery of content to websites worldwide [115269].
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurred due to non-human actions. The outage was caused by the US firm Fastly, a content delivery network (CDN) company, experiencing a service configuration issue that triggered disruptions across its Points of Presence (POPs) globally [115269]. (b) The software failure incident was not directly attributed to human actions. However, it was mentioned that some of the affected websites may have removed their dependency on Fastly to get back online, indicating potential human intervention to mitigate the impact of the failure [115269].
Dimension (Hardware/Software) hardware (a) The software failure incident occurring due to hardware: - The software testing expert mentioned that the issue was probably due to a physical problem rather than a software-related one, indicating a hardware failure [115269]. - Toby Stephenson, the chief technology officer at Neuways, highlighted the reliance of many big websites on content delivery networks like Fastly, suggesting that hardware failures in such networks can lead to outages [115269]. (b) The software failure incident occurring due to software: - The software testing expert mentioned that it's more likely to be a physical issue or a hardware failure, implying that software-related issues were less likely [115269].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident related to the outage of numerous websites, including popular ones like Amazon, Spotify, Reddit, and government websites, was non-malicious. The incident was caused by a failure in the content delivery network (CDN) service provided by Fastly, a company that helps users view website content more quickly. Fastly identified a service configuration issue that triggered disruptions across their Points of Presence (POPs) globally, leading to the outage affecting millions of users accessing various websites [115269]. The outage was not intentional but rather a technical failure within the CDN service that impacted the functioning of multiple websites.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident related to the Fastly outage was not due to poor decisions but rather an accidental issue. The outage was caused by a service configuration error at Fastly, a content delivery network (CDN) company, which triggered disruptions across their Points of Presence (POPs) globally [115269]. The issue was not a result of poor decisions but rather an unintended mistake that led to the outage affecting numerous popular websites.
Capability (Incompetence/Accidental) accidental (a) The software failure incident does not seem to be related to development incompetence. The issue was attributed to a physical problem or hardware failure rather than a software-related one [115269]. (b) The software failure incident was accidental, as it was caused by a service configuration issue at Fastly, a content delivery network (CDN) company, which triggered disruptions across their Points of Presence (POPs) globally [115269].
Duration temporary The software failure incident described in the articles was temporary. The outage caused by Fastly's service disruption lasted for around an hour before websites started gradually coming back online with slow loading times [115269]. Fastly identified a service configuration issue that triggered disruptions across their network globally and disabled that configuration to bring their global network back online [115269]. Users reported error messages like 'Error 503 Service Unavailable' and 'connection failure' during the outage [115269]. Websites such as the UK government's, Amazon, Spotify, Reddit, PayPal, Twitch, and news websites like BBC, Guardian, CNN, and the New York Times were affected by the temporary software failure incident [115269].
Behaviour crash, omission, other (a) crash: The software failure incident in the articles can be categorized as a crash. Many websites, including popular ones like Amazon, Spotify, Reddit, gov.uk, PayPal, Twitch, and others, experienced an outage where visitors received error messages like 'Error 503 Service Unavailable' and 'connection failure' [115269]. (b) omission: The software failure incident can also be categorized as an omission. Users reported problems trying to access web pages, with some sites being offline entirely and others showing specific errors like not displaying emojis [115269]. (c) timing: The software failure incident does not seem to be related to timing issues as there is no indication that the system was performing its intended functions too late or too early. (d) value: The software failure incident does not seem to be related to the system performing its intended functions incorrectly. (e) byzantine: The software failure incident does not exhibit characteristics of a byzantine failure where the system behaves erroneously with inconsistent responses and interactions. (f) other: The other behavior exhibited in this software failure incident is a widespread outage caused by a failure in the Fastly content delivery network (CDN), impacting numerous websites globally and preventing users from accessing various online platforms [115269].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay The consequence of the software failure incident described in the articles is mainly related to the delay caused by the outage. Users across the globe reported problems trying to access web pages, including popular websites like Amazon, Spotify, Reddit, gov.uk, PayPal, Twitch, and many others [115269]. The outage resulted in users receiving error messages such as 'Error 503 Service Unavailable' and 'connection failure' [115269]. Additionally, passengers trying to fill out locator forms on UK.Gov to enter the UK from Portugal and abroad were affected by the outage, causing frustration and delays in their travel plans [115269]. Therefore, the primary consequence of the software failure incident was the delay experienced by users trying to access various websites and services during the outage.
Domain information, finance, other (a) The failed system was related to the information industry, specifically the production and distribution of information. The outage affected numerous popular websites, including news websites like BBC, Guardian, CNN, and The New York Times, as well as other information-sharing platforms like Reddit, Stack Overflow, and GitHub [Article 115269]. (h) Additionally, the finance industry was impacted by the software failure incident. PayPal, a prominent online payment platform, was among the websites experiencing issues during the outage [Article 115269]. (m) The failed system also had implications for other industries beyond the options provided. For example, the outage affected e-commerce platforms like Amazon and Shopify, which could be categorized under the retail industry [Article 115269].

Sources

Back to List