Incident: Amazon Web Services (AWS) Suffers Third Outage in a Month

Published Date: 2021-12-22

Postmortem Analysis
Timeline 1. The software failure incident of Amazon's AWS outage occurred on December 22, 2021 [Article 122030].
System 1. Amazon Web Services (AWS) [Article 122030]
Responsible Organization 1. Amazon Web Services (AWS) - The software failure incident was caused by AWS experiencing a power outage at a data center in Northern Virginia, leading to connectivity issues and disrupting various online services [122030].
Impacted Organization 1. Online giants such as Slack and Epic Games were impacted by the AWS outage [122030].
Software Causes 1. Glitch in automated software leading to unexpected behavior overwhelmed AWS networking devices and hit computer systems on the East Coast [Article 122030]. 2. Network congestion due to internal engineering incorrectly moving more traffic than expected to parts of the AWS backbone affecting connectivity [Article 122030].
Non-software Causes 1. Power outage at a data center in Northern Virginia [Article 122030] 2. Malfunctioning network devices [Article 122030]
Impacts 1. The software failure incident at Amazon Web Services (AWS) led to the disruption of a wide range of online services critical to everyday life, affecting companies like Slack and Epic Games [Article 122030]. 2. The outage highlighted the vulnerabilities of an increasingly interconnected web, showcasing how a single failure in a high-profile provider can have huge implications on countless organizations and users [Article 122030]. 3. The incident underscored the risks associated with consolidating the Internet's capabilities into a few major providers, as a single glitch can lead to wide-ranging ripple effects, impacting thousands of companies and millions of users [Article 122030]. 4. Users experienced connectivity issues during the outage, with services like video-streaming service Hulu and the investment site Fidelity being affected [Article 122030].
Preventions 1. Implementing more robust backup systems and redundant software to handle unexpected glitches and failures [122030]. 2. Conducting thorough testing of changes before deployment and closely monitoring systems afterward to quickly identify and address any issues [122030]. 3. Ensuring automatic ways to back out changes in case of problems to prevent widespread disruptions [122030]. 4. Investing in adequate infrastructure and resources to accommodate growth and prevent system overload [122030].
Fixes 1. Implementing more robust backup systems and redundant software to ensure failover mechanisms are in place [122030]. 2. Conducting thorough testing of changes before deployment and closely monitoring systems afterward to quickly identify and address any issues [122030]. 3. Investing in adequate infrastructure and resources to accommodate growth and prevent future outages [122030].
References 1. Amazon Web Services status page [Article 122030] 2. AWS engineers' postmortem report [Article 122030] 3. Company statements from Amazon [Article 122030] 4. Users on Downdetector [Article 122030]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: The article reports that Amazon Web Services (AWS) suffered its third outage in a month, with previous outages occurring two weeks ago and last week [122030]. These incidents highlight the recurring nature of software failures within the same organization, AWS. (b) The software failure incident having happened again at multiple_organization: The article mentions that AWS is not the only cloud provider facing challenges, as more companies are considering using multiple cloud systems simultaneously due to the potential risks associated with relying on a single provider [122030]. This indicates that software failure incidents are not unique to AWS but are also a concern for other cloud service providers.
Phase (Design/Operation) operation The software failure incident reported in the articles seems to be more related to the operation phase rather than the design phase. The incidents were attributed to issues such as power outages at data centers, malfunctioning network devices, network congestion, and data center power issues, which are more aligned with operational challenges and the day-to-day functioning of the cloud services rather than issues stemming from the design or development phases [122030].
Boundary (Internal/External) within_system (a) The software failure incident related to the Amazon Web Services (AWS) outage can be categorized as within_system. The articles mention that the outages were caused by glitches in automated software, network congestion due to internal engineering decisions, and power issues at data centers [122030]. These factors are internal to the AWS system and infrastructure, leading to service disruptions within the system itself.
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: The software failure incidents reported in the articles are primarily attributed to non-human actions such as power outages at data centers, malfunctioning network devices, glitches in automated software, unexpected network congestion, and data center power issues. These non-human factors have led to connectivity issues, service disruptions, and outages affecting various online services and platforms relying on Amazon Web Services [122030].
Dimension (Hardware/Software) hardware (a) The software failure incident occurring due to hardware: The software failure incident reported in the article was attributed to a power outage at a data center in Northern Virginia, which triggered connectivity issues and disrupted a wide range of online services provided by Amazon Web Services (AWS) [Article 122030]. This power outage was a hardware-related issue that led to the failure of the software services hosted on the affected servers. (b) The software failure incident occurring due to software: The article does not specifically mention any software-related contributing factors that led to the software failure incident. It primarily focuses on the hardware-related issue of a power outage at the data center causing connectivity issues and service disruptions. Therefore, there is no direct information provided in the articles about software-related contributing factors to the software failure incident.
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident related to the Amazon Web Services (AWS) outage in Northern Virginia was non-malicious. The outage was triggered by a power outage at a data center, which disrupted a wide range of online services and highlighted the vulnerabilities of an interconnected web [Article 122030]. The article mentions that the outage was caused by a power outage at a data center in Northern Virginia, leading to connectivity issues for various online services. There is no indication in the article that the outage was caused by malicious intent or actions.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The articles do not provide specific information indicating that the software failure incident was due to poor decisions. However, they do mention that the recent AWS outages were caused by glitches in automated software, unexpected behavior overwhelming networking devices, network congestion due to internal engineering errors, and data center power issues [122030]. These incidents suggest that the failures were more likely due to technical issues rather than poor decisions.
Capability (Incompetence/Accidental) accidental (a) The software failure incident occurring due to development incompetence: The articles do not specifically mention the software failure incident occurring due to development incompetence. Therefore, it is unknown. (b) The software failure incident occurring accidentally: The articles mention that the recent AWS outages, including the one on Wednesday, were attributed to various technical issues such as a glitch in automated software, network congestion, and data center power issues. These incidents were not explicitly linked to intentional actions but rather described as unexpected events that overwhelmed the systems [122030].
Duration temporary The software failure incident reported in the articles is temporary. The incidents mentioned in the articles describe specific circumstances that led to the outages, such as a glitch in automated software overwhelming networking devices, network congestion due to internal engineering errors, and data center power issues triggering connectivity problems. These incidents were not permanent failures but rather temporary disruptions that were resolved within a certain timeframe [Article 122030].
Behaviour omission, value, other (a) crash: The articles mention that the AWS outages resulted in disruptions to a wide range of online services, such as work chat rooms of Slack and the gaming store of Epic Games, due to a power outage at a data center in Northern Virginia [Article 122030]. (b) omission: The outages caused by malfunctioning network devices knocked offline Amazon’s Ring doorbells and Roomba vacuums, indicating an omission in performing their intended functions [Article 122030]. (c) timing: The article does not specifically mention any failures related to timing. (d) value: The outages caused by glitches in automated software and network congestion led to unexpected behavior and incorrect movement of traffic, resulting in the system performing its intended functions incorrectly [Article 122030]. (e) byzantine: The articles do not mention any failures related to the system behaving erroneously with inconsistent responses and interactions. (f) other: The other behavior observed in the software failure incident is the potential inadequacy of some backup systems to handle the task, as suggested by Steven Bellovin, a computer science professor at Columbia University [Article 122030].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, non-human (d) property: People's material goods, money, or data was impacted due to the software failure The software failure incident related to Amazon's AWS outages resulted in disruptions to a wide range of online services critical to everyday life, affecting companies and millions of users. The outages led to service interruptions impacting thousands of companies and users, with users reporting trouble accessing various sites such as Hulu and Fidelity [Article 122030]. Additionally, the outages caused issues for companies like Amazon's Ring doorbells and Roomba vacuums in previous incidents [Article 122030]. These disruptions indicate that people's material goods, money, and data were impacted due to the software failure.
Domain information, finance, entertainment (a) The software failure incident reported in the articles affected various online services critical to everyday life, such as work chat rooms, gaming stores, video-streaming services, investment sites, and more. This indicates that the failed system was intended to support the production and distribution of information [Article 122030]. (b) The articles do not specifically mention any impact on transportation systems due to the software failure incident. (c) The articles do not specifically mention any impact on natural resources extraction due to the software failure incident. (d) The articles do not specifically mention any impact on sales transactions due to the software failure incident. (e) The articles do not specifically mention any impact on construction activities due to the software failure incident. (f) The articles do not specifically mention any impact on manufacturing processes due to the software failure incident. (g) The articles do not specifically mention any impact on utilities services (power, gas, steam, water, sewage) due to the software failure incident. (h) The software failure incident did not directly involve manipulating or moving money for profit, but it did affect financial services like investment sites [Article 122030]. (i) The articles do not specifically mention any impact on knowledge-related activities (education, research, space exploration) due to the software failure incident. (j) The articles do not specifically mention any impact on the health industry (healthcare, health insurance, food) due to the software failure incident. (k) The software failure incident affected entertainment services like gaming stores, video-streaming services, and potentially other entertainment platforms [Article 122030]. (l) The articles do not specifically mention any impact on government-related services (politics, defense, justice, taxes, public services) due to the software failure incident. (m) The failed system was related to the cloud-computing industry, which falls under the broader category of technology and IT services, not covered in the provided options [Article 122030].

Sources

Back to List