Incident: Amazon Web Services Outage Impacts Multiple Major Websites and Apps

Published Date: 2011-04-21

Postmortem Analysis
Timeline 1. The software failure incident mentioned in Article 107398 happened on 2020-11-25. 2. The software failure incident mentioned in Article 5289 happened on 2011-04-21. 3. The software failure incident mentioned in Article 122627 happened on 2021-12-08. 4. The software failure incident mentioned in Article 121931 happened on 2021-12-07. 5. The software failure incident mentioned in Article 107682 happened on 2020-11-27.
System 1. Amazon Web Services (AWS) [107398, 5289, 122627, 121931, 107682]
Responsible Organization 1. An employee at Amazon mistyped a command, causing a massive AWS outage [Article 107398]. 2. Amazon experienced network issues that led to an outage, affecting multiple major websites and apps [Article 122627]. 3. A networking event at Amazon's northern Virginia site triggered a shortage of capacity, impacting new EBS volume creation and recovery [Article 5289]. 4. The root cause of the outage was related to issues with the application programming interface (API) [Article 121931]. 5. The outage on Amazon's cloud service was caused by a large-scale outage on AWS [Article 107682].
Impacted Organization 1. Ring security camera service, iRobot’s Roomba vacuum cleaner app, services from design technology firm Autodesk, and the publishing systems of news outlets such as The Washington Post were impacted by the software failure incident reported in Article 107398. 2. Quora, Reddit, and other customers relying on Amazon EC2 services were affected by the software failure incident reported in Article 5289. 3. Websites and apps for Disney Plus, Robinhood, Barclays, Slack, Ring, Prime Music, Alexa, Chime, TicketMaster, Google, McDonald's, Venmo, Cash App, My Social Security, iRobot, Kindle, InstaCart, GoDaddy, Associated Press, Coinbase, CapitalOne, Roku, IMD, Paramount+, Canvas, and other services were impacted by the software failure incident reported in Article 122627. 4. Flickr, Adobe, Roombas, Rokus, Ring doorbells, and other smart household appliances were affected by the software failure incident reported in Article 107682.
Software Causes 1. The software cause of the failure incident in Article 107398 was related to a major outage in Amazon Web Services (AWS) in its eastern U.S. operations, affecting various applications and services relying on AWS, such as Ring security camera service, iRobot’s Roomba vacuum cleaner app, and services from design technology firm Autodesk [107398]. 2. The software cause of the failure incident in Article 5289 was a partial failure at Amazon Web Services' cloud-computing infrastructure, specifically impacting the Elastic Compute Cloud (EC2) service at Amazon's northern Virginia site, leading to delays and errors when connecting to servers over a network [5289]. 3. The software cause of the failure incident in Article 122627 was network issues that led to an outage affecting multiple major websites and apps, including Disney Plus, Robinhood, Barclays, Slack, Ring, Prime Music, Alexa, and Chime services, due to problems with Amazon Web Services (AWS) [122627]. 4. The software cause of the failure incident in Article 121931 was network issues that led to a widespread outage of Amazon and its services, impacting websites and software services, including Amazon Music, Prime Video, Alexa, Ring, and various other services relying on Amazon Web Services (AWS) [121931]. 5. The software cause of the failure incident in Article 107682 was a large-scale outage on Amazon's cloud service, specifically impacting the U.S. East-1 region, causing smart household appliances like Roombas, Rokus, and Ring doorbells to malfunction, as well as affecting websites and services like Flickr, Adobe, and the Washington Post [107682].
Non-software Causes 1. The failure incident in Article 107398 was caused by a major outage in Amazon's cloud computing service, specifically in its eastern U.S. operations, affecting various services and applications [107398]. 2. The failure incident in Article 5289 was caused by a partial failure at Amazon Web Services' cloud-computing infrastructure, particularly the Elastic Compute Cloud (EC2) service at Amazon's northern Virginia site [5289]. 3. The failure incident in Article 122627 was due to network issues that led to an outage affecting multiple major websites and apps, including Amazon's own operations and resources running from the US-East Region (Virginia) [122627]. 4. The failure incident in Article 121931 was caused by network issues that led to an outage affecting Amazon and all its services, including Amazon Music, Prime Video, Alexa, Ring, and Amazon Web Services [121931]. 5. The failure incident in Article 107682 was caused by a large-scale outage on Amazon's cloud service, impacting various smart household appliances like Roombas, Rokus, and Ring doorbells, as well as websites and services [107682].
Impacts 1. The software failure incident at Amazon Web Services (AWS) caused major outages in its eastern U.S. operations, affecting services for Web-connected security cameras, software applications for product design, and news publishing systems [107398]. 2. The outage impacted various companies and services, including the Amazon-owned Ring security camera service, iRobot's Roomba vacuum cleaner app, services from design technology firm Autodesk, the publishing systems of news outlets like The Washington Post, Netflix, Kellogg's, Airbnb, Roku, Shipt delivery service, Flickr, and more [107398]. 3. The outage led to issues such as the inability to log in, watch videos, activate new accounts, process orders, and access services like streaming media, delivery apps, and photo storage [107398]. 4. The outage affected Amazon's own operations, causing disruptions in warehouse activities, delivery services, and internal systems, leaving workers idle and resorting to activities like singing karaoke [122627, 121931]. 5. Users experienced malfunctions in smart household appliances like Roombas, Rokus, and Ring doorbells due to the outage [107682]. 6. The outage impacted Amazon's ability to post updates on its Service Health Dashboard [107682]. 7. The outage caused frustration among users who were unable to set up new devices like Roombas, schedule cleanings, or operate their smart appliances [107682]. 8. The outage raised concerns about potential impacts on Black Friday sales [107682].
Preventions 1. Implementing redundancy and failover mechanisms in the AWS infrastructure to ensure high availability and minimize the impact of outages [107682]. 2. Conducting regular training and implementing strict controls to prevent human errors, such as mistyped commands, which have caused outages in the past [5289]. 3. Enhancing monitoring and alerting systems to quickly identify and address issues before they escalate, ensuring timely resolution of problems [122627]. 4. Diversifying data center locations and spreading operations across multiple regions to reduce the impact of localized outages [107398]. 5. Investing in robust testing and quality assurance processes to detect and address potential issues in software applications before they impact users [107398].
Fixes 1. Increasing redundancy in the system to ensure continuity of services during outages [107682]. 2. Implementing measures to prevent similar incidents from occurring in the future [107682]. 3. Monitoring and addressing network issues promptly to minimize downtime [122627]. 4. Adding capacity to affected regions to speed up recovery processes [5289]. 5. Providing regular updates and communication to customers during outages [107398, 5289, 122627]. 6. Conducting post-mortem analysis to understand the root cause of the problem and prevent recurrence [5289]. 7. Enhancing control planes for critical services like Elastic Block Storage to avoid inundation during outages [5289]. 8. Investing in robust infrastructure and network devices to ensure stability and reliability of services [122627]. 9. Implementing failover mechanisms to redirect traffic and maintain service availability [unknown].
References 1. Amazon spokesperson Mary Camarata [107398] 2. Amazon's AWS status page [107398] 3. Amazon spokesperson Richard Rocha [122627] 4. Amazon's health dashboard for Amazon Web Service [121931] 5. Twitter users [107682] 6. iRobot [107682]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: - Article 107682 reports a large-scale outage on Amazon's cloud service, causing widespread havoc on websites and software services, including Roombas, Rokus, Ring doorbells, and other smart household appliances to stop functioning. This incident is similar to a previous outage mentioned in Article 107398, where Amazon's AWS suffered a major outage in its eastern U.S. operations, impacting services like Ring security camera system and iRobot's Roomba vacuum cleaner app. This indicates a recurring issue with Amazon's cloud services [107682, 107398]. (b) The software failure incident having happened again at multiple_organization: - Article 5289 discusses a partial failure at Amazon Web Services' cloud-computing infrastructure that brought down some Internet operations, including the websites of Quora and Reddit. This incident is similar to the outages reported in Article 107398, where AWS disruptions affected services like Roku, Shipt delivery, and Flickr. Additionally, Article 121931 reports another outage on Amazon Web Services that took down major websites and apps for various companies like Disney Plus, Robinhood, Barclays, and Slack. These incidents show that AWS outages have impacted multiple organizations and services [5289, 107398, 121931].
Phase (Design/Operation) design (a) In the incident reported in Article 107398, a major outage on Amazon Web Services (AWS) impacted various software applications and services, including the Ring security camera service, iRobot's Roomba vacuum cleaner app, services from design technology firm Autodesk, and the publishing systems of news outlets like The Washington Post. The outage was attributed to a failure in the AWS cloud computing service, affecting data delivery and user authorization processes [107398]. (b) The incident reported in Article 107682 highlighted how the outage of AWS caused smart household appliances like Roombas, Rokus, and Ring doorbells to malfunction. Users experienced issues with setting up their devices, scheduling cleanings, and accessing specific functionalities due to the AWS outage. The operation of these smart appliances was impacted by the failure of the AWS service [107682].
Boundary (Internal/External) within_system (a) The software failure incident reported in the articles can be categorized as within_system. The incidents were caused by issues within the Amazon Web Services (AWS) cloud infrastructure itself, leading to outages that affected various services and websites relying on AWS [107398, 5289, 122627, 121931, 107682]. The failures were related to problems such as networking events triggering re-mirroring of data volumes, capacity shortages in specific availability zones, control plane issues, and network device issues within the AWS infrastructure. These internal issues within the AWS system led to widespread disruptions for customers and users relying on AWS services.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incidents occurring due to non-human actions are as follows: - Article 107398 reports a major outage in Amazon's cloud computing service, AWS, affecting various applications and services due to a technical issue within the AWS data centers in Northern Virginia. The outage was caused by a non-human factor, specifically a networking event that triggered re-mirroring of EBS volumes, leading to a shortage of capacity and impacting new EBS volume creation [107398]. - Article 122627 describes a network issue that led to an outage affecting multiple major websites and apps due to network device issues within AWS. The outage was resolved after Amazon worked towards recovery of the impaired services, indicating a non-human factor contributing to the failure [122627]. (b) The software failure incidents occurring due to human actions are as follows: - Article 5289 highlights a partial failure at Amazon Web Services' cloud-computing infrastructure caused by a networking event that led to problems with data mirroring and control plane issues. The incident was triggered by a human error in the networking event, impacting the AWS operations for the U.S. East Coast [5289]. - Article 121931 discusses an outage that took down multiple major websites and apps, including Amazon's own services, due to network issues within AWS. The outage was likely related to issues with the application programming interface (API), indicating a potential human error in the configuration that led to the failure [121931].
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - Article 107398 mentions that the AWS outage was limited to the collection of data centers in Northern Virginia, indicating a hardware-related issue affecting the region [107398]. - Article 122627 reports that Amazon workers mentioned the company's warehouses were at a standstill during the outage, suggesting hardware-related issues impacting operations [122627]. (b) The software failure incident occurring due to software: - Article 107398 highlights that the AWS outage affected various software applications and services, such as Ring security camera system, iRobot's Roomba vacuum cleaner app, and services from design technology firm Autodesk, indicating a software-related failure [107398]. - Article 107682 mentions that the outage caused smart household appliances like Roombas, Rokus, and Ring doorbells to stop functioning, pointing towards a software-related issue impacting these devices [107682].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incidents reported in the articles do not indicate any malicious intent behind the failures. Instead, they highlight non-malicious factors contributing to the failures. - Article 107398 mentions that the AWS outage in Amazon's cloud computing service was caused by human error when an employee mistyped a command, leading to a massive outage affecting various services [107398]. - Article 5289 reports a partial failure at Amazon Web Services' cloud-computing infrastructure, specifically the Elastic Compute Cloud (EC2) service, due to a networking event triggering re-mirroring of EBS volumes, causing a shortage of capacity in one of the availability zones [5289]. - Article 122627 discusses network issues that led to an outage affecting multiple major websites and apps, with Amazon workers struggling to resolve the AWS outage, indicating a non-malicious technical issue [122627]. - Article 121931 describes a global outage affecting Amazon's services, including Amazon Web Services, leading to widespread disruptions, but there is no mention of malicious intent behind the outage [121931]. - Article 107682 highlights a large-scale outage on Amazon's cloud service that caused various smart appliances to malfunction, with the root cause being non-malicious factors such as human error and technical issues [107682].
Intent (Poor/Accidental Decisions) accidental_decisions (a) poor_decisions: The software failure incidents reported in the articles were not explicitly attributed to poor decisions. The incidents were mainly caused by technical issues, outages, and network problems within Amazon Web Services (AWS) leading to widespread disruptions in various services and applications. The failures were not directly linked to poor decisions made by individuals or the company. (b) accidental_decisions: The software failure incidents reported in the articles were primarily caused by accidental decisions or mistakes. For example, in Article 5289, an AWS outage was triggered by a "networking event" that led to problems with data mirroring, impacting new EBS volume creation. Additionally, in Article 107398, a massive AWS outage two years prior was caused by human error when an employee mistyped a command, leading to significant portions of the system going down. These incidents point to failures resulting from accidental decisions or mistakes rather than intentional actions ([5289, 107398]).
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident occurring due to development_incompetence: - Article 107398 reports on a major outage of Amazon Web Services (AWS) in its eastern U.S. operations, impacting various services and applications. The outage was caused by human error when an employee mistyped a command, leading to a significant portion of the system going down [107398]. - Article 107682 describes a large-scale outage on Amazon's cloud service that caused widespread havoc on websites and software services. The outage impacted the US East-1 region and resulted in various smart household appliances like Roombas and Ring doorbells malfunctioning. The issue also affected Amazon's ability to post updates to its own Service Health Dashboard [107682]. (b) The software failure incident occurring accidentally: - Article 5289 discusses a partial failure at Amazon Web Services' cloud-computing infrastructure that brought down some Internet operations, including the websites of Quora and Reddit. The problems were attributed to a "networking event" that triggered re-mirroring of EBS volumes, leading to a shortage of capacity in one of the US-EAST-1 Availability Zones [5289]. - Article 122627 reports on an outage that took down multiple major websites and apps due to network issues on Amazon Web Services. The outage affected services like Disney Plus, Robinhood, Barclays, and Slack, and was resolved after network device issues were addressed [122627].
Duration temporary (a) The software failure incidents reported in the articles were temporary. - In Article 107398, it is mentioned that Amazon's cloud computing service suffered a major outage, but Amazon was actively working to resolve the issue and expected full recovery to take a few hours [107398]. - In Article 5289, Amazon experienced a partial failure in its cloud-computing infrastructure, and Amazon was making progress in resolving the problems but was still having troubles [5289]. - In Article 122627, Amazon resolved network issues that led to an outage, and the company was actively working towards recovery of any impaired services [122627]. - In Article 121931, Amazon and all its services came back online after crashing, and the company stated they had identified the root cause of the problem and were working to fix it [121931]. - In Article 107682, the outage of Amazon Web Services caused various smart appliances to malfunction, but by early Thursday morning, service had been widely restored [107682].
Behaviour crash, omission, timing, value, other (a) crash: The articles describe instances of crashes in the software failure incidents reported. For example, in Article 107398, it is mentioned that the AWS outage caused services like Ring security camera system, iRobot's Roomba vacuum cleaner app, and Autodesk's applications to fail, indicating a crash in the system's operations. Additionally, in Article 107682, it is highlighted that the outage caused Roombas, Rokus, Ring doorbells, and other smart household appliances to stop functioning, indicating a crash in the system's operations [107398, 107682]. (b) omission: The incidents also involved instances of omission where the system omitted to perform its intended functions at certain instances. For instance, in Article 107398, it is mentioned that the outage affected services like New account activation, the mobile app for streaming media service Roku, and the Target-owned Shipt delivery service, indicating an omission in the system's operations [107398]. (c) timing: The timing of the software failure incidents is evident in the articles. For example, in Article 107398, it is stated that the outage caused delays and errors in connecting to servers, impacting services like Quora and Reddit. This indicates a timing issue where the system performed its functions, but too late or too early [107398]. (d) value: The incidents also involved failures related to the system performing its intended functions incorrectly. For instance, in Article 107682, it is mentioned that the outage caused issues with Roombas, Ring doorbells, and other smart appliances, leading to incorrect functioning of these devices [107682]. (e) byzantine: There is no specific mention of the software failure incidents exhibiting a byzantine behavior in the articles. (f) other: The behavior of the software failure incidents in the articles can also be categorized as other, as they involved a combination of crashes, omissions, timing issues, and incorrect functioning of the system. The incidents led to widespread havoc on websites and software services, impacting various services and applications, as described in the articles [107398, 107682].

IoT System Layer

Layer Option Rationale
Perception processing_unit, network_communication (a) sensor: The incidents reported in the articles do not specifically mention any failures related to sensors in the cyber physical system. (b) actuator: The articles do not provide information about failures related to actuators in the cyber physical system. (c) processing_unit: The software failure incidents reported in the articles are related to the processing unit of the cyber physical system. For example, in Article 107398, an outage on Amazon Web Services impacted various software applications and services, affecting companies like iRobot, Ring, and Autodesk, which rely on AWS for their operations. (d) network_communication: The software failure incidents reported in the articles are also related to network communication errors. For instance, in Article 122627, Amazon experienced network issues that led to an outage affecting multiple major websites and apps due to problems with Amazon Web Services, impacting services like Disney Plus, Robinhood, Barclays, and Slack. (e) embedded_software: The articles do not specifically mention any failures related to embedded software in the cyber physical system.
Communication unknown The software failure incidents reported in the articles do not specifically mention whether the failures were related to the communication layer of the cyber physical system that failed. Therefore, it is unknown whether the failures were at the link_level or connectivity_level.
Application TRUE The software failure incidents reported in the articles were related to the application layer of the cyber physical system that failed due to contributing factors introduced by bugs, operating system errors, unhandled exceptions, and incorrect usage. This is evident from the following information: 1. In Article 107398, it is mentioned that Amazon's cloud computing service suffered a major outage, impacting services for Web-connected security cameras, software applications for designing products, and various other applications. The outage was caused by a root cause that Amazon identified and was working to resolve, indicating issues at the application layer [107398]. 2. Article 107682 reports a large-scale outage on Amazon's cloud service that caused smart household appliances like Roombas and Ring doorbells to malfunction. The outage impacted the US East-1 region and affected the ability to post updates to Amazon's Service Health Dashboard, pointing towards issues at the application layer [107682]. Therefore, based on the information from the articles, it can be concluded that the software failure incidents were related to the application layer of the cyber physical system that failed due to bugs, errors, and other contributing factors.

Other Details

Category Option Rationale
Consequence property, delay, non-human (a) death: People lost their lives due to the software failure - There were no reports of people losing their lives due to the software failure incidents described in the articles [107398, 5289, 122627, 121931, 107682]. (b) harm: People were physically harmed due to the software failure - There were no reports of people being physically harmed due to the software failure incidents described in the articles [107398, 5289, 122627, 121931, 107682]. (c) basic: People's access to food or shelter was impacted because of the software failure - There were no reports of people's access to food or shelter being impacted due to the software failure incidents described in the articles [107398, 5289, 122627, 121931, 107682]. (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incidents impacted various services and applications, including security cameras, vacuum cleaner apps, design software, news outlets, delivery services, media streaming services, and more, affecting businesses and users [107398, 5289, 122627, 121931, 107682]. (e) delay: People had to postpone an activity due to the software failure - Users experienced delays and disruptions in using services such as security cameras, vacuum cleaners, media streaming, delivery services, and more due to the software failures [107398, 5289, 122627, 121931, 107682]. (f) non-human: Non-human entities were impacted due to the software failure - Non-human entities such as smart household appliances like Roombas, Rokus, Ring doorbells, and other smart devices were impacted by the software failures, causing malfunctions and disruptions in their functionality [107398, 107682]. (g) no_consequence: There were no real observed consequences of the software failure - The software failures described in the articles had observable consequences on various services, applications, and devices, impacting users and businesses [107398, 5289, 122627, 121931, 107682]. (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - The articles did not discuss potential consequences of the software failures that did not occur [107398, 5289, 122627, 121931, 107682]. (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - There were no other consequences of the software failure incidents mentioned in the articles [107398, 5289, 122627, 121931, 107682].
Domain information, utilities, finance, entertainment, government (a) The failed system was intended to support the information industry, specifically affecting services for Web-connected security cameras, software applications for designing products, and publishing systems of news outlets like The Washington Post [107398]. (g) The failed system also impacted utilities, as it affected the Target-owned Shipt delivery service, causing them to manage capacity due to the outage [107398]. (k) The entertainment industry was affected by the software failure incident, with services like Roku experiencing issues with account activation and the mobile app due to the AWS outage [107398]. (l) The government sector was impacted by the software failure incident, as news outlets reported that the outage affected their operations, including the Wall Street Journal and the Chicago Tribune [107398]. (m) The failed system also affected the finance industry, with services like Venmo, Cash App, and CapitalOne experiencing outages due to the AWS outage [122627].

Sources

Back to List