Incident: CDN Outage at Akamai Technologies and Fastly Impacts Internet Services

Published Date: 2021-06-09

Postmortem Analysis
Timeline 1. The software failure incident happened on June 9, 2021 [Article 115237].
System The software failure incident mentioned in the article was related to an outage at the cloud service provider Akamai Technologies. The specific system that failed in this incident was the Distributed Denial-of-Service (DDoS) mitigation software of Akamai, which caused the outage affecting around 500 of its customers [115237]. Therefore, the system that failed in the software failure incident was: 1. Distributed Denial-of-Service (DDoS) mitigation software of Akamai Technologies [115237].
Responsible Organization 1. Akamai Technologies was responsible for causing the software failure incident that led to the outage affecting websites like Southwest Airlines, United Airlines, Commonwealth Bank of Australia, and the Hong Kong Stock Exchange [115237].
Impacted Organization 1. Southwest Airlines 2. United Airlines 3. Commonwealth Bank of Australia 4. Hong Kong Stock Exchange 5. Reddit 6. CNN 7. Amazon 8. UK government website 9. Various other websites relying on Akamai and Fastly CDNs [CNN Business]
Software Causes 1. The failure incident was caused by a software bug in Fastly's network, which took out around 85% of the company's network [115237]. 2. Akamai's outage was caused by an issue with its DDOS mitigation software, affecting around 500 of its customers [115237].
Non-software Causes 1. The outage at Akamai Technologies was caused by an issue with its DDOS mitigation software, affecting around 500 of its customers [115237]. 2. The outage at Fastly was due to a software bug that appeared as part of a normal update, taking out around 85% of the company's network [115237].
Impacts 1. The software failure incident led to the outage of websites for Southwest Airlines, United Airlines, Commonwealth Bank of Australia, the Hong Kong Stock Exchange, and others, causing them to go dark temporarily [115237]. 2. The outage highlighted the fragility of the internet and raised concerns about cyber risks to critical digital infrastructure [115237]. 3. The incident served as a reminder of the concentration in the CDN space, where a small number of major providers could become big targets for cyberattacks [115237]. 4. The software failure incident impacted around 500 of Akamai's customers due to an issue with its DDOS mitigation software, causing an outage [115237]. 5. The outage caused by the software failure incident affected a significant portion of Fastly's network, taking out around 85% of the company's network temporarily [115237]. 6. The incident underscored the importance of CDNs in preventing distributed denial-of-service attacks and maintaining the security and stability of websites [115237].
Preventions 1. Implementing more robust testing procedures to catch software bugs before they cause outages [115237]. 2. Diversifying reliance on a single CDN provider by contracting with multiple CDN operators to mitigate the impact of failures [115237]. 3. Antitrust regulation of the CDN industry to promote competition and reduce the risk of a single provider being targeted in an attack [115237].
Fixes 1. Antitrust regulation of the CDN industry to promote competition and reduce reliance on a few major providers [115237]. 2. Promoting the growth of more CDN alternatives to diversify the market and reduce the risk of widespread outages [115237].
References 1. Nick Merrill, research fellow at UC Berkeley’s Center for Long-Term Cybersecurity [115237] 2. David Vaskevitch, former Microsoft Chief Technology Officer [115237] 3. Doug Madory, director of internet analysis at network analytics firm Kentik [115237] 4. Nick Rockwell, senior vice president of engineering and infrastructure at Fastly [115237]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: The article mentions that for the second time in 10 days, a giant chunk of the internet briefly broke due to an outage at a company most people have probably never heard of. This incident was similar to another recent outage caused by a similar company called Fastly [Article 115237]. (b) The software failure incident having happened again at multiple_organization: The article highlights that the recent outage at Akamai Technologies was nearly identical to another outage caused by Fastly, which affected major sites including Reddit, CNN, Amazon, and a UK government website. This indicates that similar incidents have occurred at multiple organizations relying on these cloud service providers [Article 115237].
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase can be seen in the article where it mentions that Fastly experienced an outage due to a software bug that appeared as part of a normal update, which briefly took out around 85% of the company's network [115237]. This indicates that the failure was due to contributing factors introduced during system development or system updates. (b) The software failure incident related to the operation phase is evident in the article where Akamai mentioned that around 500 of its customers were affected by an issue with its DDOS mitigation software that caused the outage [115237]. This points to a failure due to contributing factors introduced by the operation of the system.
Boundary (Internal/External) within_system (a) within_system: The software failure incidents related to Fastly and Akamai were primarily caused by internal factors within the systems. Fastly experienced an outage due to a software bug that appeared as part of a normal update, temporarily taking out around 85% of the company's network [115237]. Similarly, Akamai faced an issue with its DDOS mitigation software that caused its outage, affecting around 500 of its customers [115237]. (b) outside_system: The articles do not provide specific information about the software failure incidents being caused by contributing factors originating from outside the system.
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: The software failure incidents related to the outages at Fastly and Akamai were primarily caused by non-human actions. In Fastly's case, a software bug that appeared as part of a normal update briefly took out around 85% of the company's network [115237]. Similarly, Akamai mentioned that around 500 of its customers were affected by an issue with its DDOS mitigation software that caused its outage [115237]. (b) The software failure incident occurring due to human actions: There is no specific mention in the articles about the software failure incidents at Fastly and Akamai being directly caused by human actions. The incidents were attributed to software bugs and issues with DDOS mitigation software, which are non-human factors [115237].
Dimension (Hardware/Software) software (a) The software failure incidents reported in the articles are primarily due to contributing factors that originate in software. For example, the outage at Akamai was caused by an issue with its DDOS mitigation software [115237]. Similarly, Fastly experienced a software bug that took out a significant portion of its network [115237]. These incidents highlight how software issues can lead to widespread outages affecting various websites and services relying on these providers.
Objective (Malicious/Non-malicious) non-malicious (a) The articles discuss the possibility of malicious attacks on content delivery networks (CDNs) like Fastly and Akamai. These attacks could be orchestrated by cybercriminals or government actors targeting the CDNs, which are considered as potential centralized points on the internet vulnerable to such attacks [115237]. (b) On the non-malicious side, the articles mention that the recent outages at Fastly and Akamai were caused by software bugs in their systems. Fastly experienced an outage due to a software bug that appeared during a normal update, affecting around 85% of its network. Akamai, on the other hand, had an issue with its DDOS mitigation software that caused its outage, affecting around 500 of its customers [115237]. These incidents highlight how software failures can occur due to unintentional factors within the systems.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The articles do not specifically mention the software failure incident being related to poor decisions. However, they highlight the risks associated with the concentration in the CDN space, where a small number of major providers could become big targets for attacks. This concentration raises concerns about the potential vulnerabilities and risks to the internet's digital infrastructure [115237]. (b) The software failure incidents discussed in the articles were attributed to software bugs in the case of Fastly and Akamai. Fastly experienced a software bug that appeared as part of a normal update, briefly taking out around 85% of the company's network. Akamai, on the other hand, had an issue with its DDOS mitigation software that caused its outage affecting around 500 of its customers. These incidents were described as accidental decisions or mistakes that led to the failures [115237].
Capability (Incompetence/Accidental) accidental (a) The software failure incidents related to the outages at Fastly and Akamai were not attributed to development incompetence but rather to software bugs that occurred during normal updates. Fastly experienced a software bug that briefly took out around 85% of its network, while Akamai had an issue with its DDOS mitigation software that caused its outage [115237]. (b) The software failure incidents at Fastly and Akamai were accidental in nature. Fastly's software bug appeared as part of a normal update, and Akamai's issue with its DDOS mitigation software was accidental, leading to the outages experienced by their customers [115237].
Duration temporary The software failure incident discussed in the articles is categorized as temporary. Both the Fastly and Akamai outages were short-lived, with Fastly restoring its service quickly and Akamai resolving the issue within four hours. These incidents were caused by specific contributing factors such as a software bug in Fastly's case and an issue with Akamai's DDOS mitigation software. The articles emphasize that while occasional failures and outages are inevitable in technology, the measure of success lies in how quickly major internet firms can recover from such rare outages [115237].
Behaviour crash, omission, other (a) crash: The articles describe a software failure incident related to a crash where the system lost state and did not perform its intended functions. Both Fastly and Akamai experienced outages due to software bugs that caused significant portions of their networks to go down, affecting numerous websites [115237]. (b) omission: The incident also involved omission failures where the systems omitted to perform their intended functions at instances. For example, Fastly's software bug briefly took out around 85% of the company's network, impacting a large number of customers [115237]. (c) timing: The timing of the software failure incident was not specifically mentioned in the articles. (d) value: The incident did not involve failures due to the system performing its intended functions incorrectly. (e) byzantine: The incident did not involve failures due to the system behaving erroneously with inconsistent responses and interactions. (f) other: The other behavior observed in this software failure incident was the potential risk of the internet's huge reliance on just a few CDNs becoming the target of an attack, which could lead to significant disruptions in internet services [115237].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human, theoretical_consequence (property) The software failure incident related to the outage at Akamai Technologies impacted websites for companies like Southwest Airlines, United Airlines, Commonwealth Bank of Australia, and the Hong Kong Stock Exchange, causing them to go dark [115237]. This outage could have potentially led to financial losses for these companies and disrupted online services for their customers.
Domain transportation, finance, government (a) The software failure incident affected various industries, including airlines like Southwest Airlines and United Airlines [115237]. These airlines are part of the transportation industry, which was impacted by the outage at the cloud service provider Akamai Technologies. (h) The finance industry was also affected by the software failure incident as the Commonwealth Bank of Australia experienced website downtime due to the issue at Akamai Technologies [115237]. (l) The government sector was impacted by the software failure incident as well. The outage affected the Hong Kong Stock Exchange, which is a critical component of the financial infrastructure in Hong Kong [115237].

Sources

Back to List