Incident: Amazon Web Services Outage Impacts Multiple Services Globally

Published Date: 2021-12-07

Postmortem Analysis
Timeline 1. The software failure incident happened on December 7, 2021. [Article 121987, Article 122437, Article 122641]
System 1. Amazon Web Services (AWS) [121987, 122437, 122641] 2. Application Programming Interface (API) [121987, 122437, 122641] 3. Network devices [121987, 122641] 4. Streaming platforms like Netflix and Disney+ [122641] 5. Robinhood trading app [122641] 6. Amazon.com Inc's e-commerce website [122641] 7. Ring security cameras [121987, 122437, 122641] 8. Mobile banking app Chime [121987, 122437, 122641] 9. Robot vacuum cleaner maker iRobot [121987, 122437, 122641]
Responsible Organization 1. Network devices and application programming interface (API) issues were responsible for causing the software failure incident at Amazon [121987, 122437, 122641].
Impacted Organization 1. Amazon services, including its website, Prime Video, and applications that use Amazon Web Services (AWS) [Article 121987, Article 122437, Article 122641] 2. Ring security cameras [Article 121987, Article 122437, Article 122641] 3. Mobile banking app Chime [Article 121987, Article 122437, Article 122641] 4. Robot vacuum cleaner maker iRobot [Article 121987, Article 122437, Article 122641] 5. Delivery operations [Article 121987, Article 122437] 6. Amazon Music [Article 122437] 7. Alexa [Article 122437] 8. Kindle [Article 122437] 9. Instacart [Article 122437] 10. Venmo [Article 122437] 11. GoDaddy [Article 122437] 12. Associated Press [Article 122437] 13. Coinbase [Article 122437] 14. CashApp [Article 122437] 15. CapitalOne [Article 122437] 16. Roku [Article 122437] 17. Netflix [Article 122641] 18. Disney+ [Article 122641] 19. Robinhood [Article 122641]
Software Causes 1. The software causes of the failure incident were related to issues with the application programming interface (API) [121987, 122437, 122641]. 2. The outage was linked to problems with network devices [121987, 122641].
Non-software Causes 1. Network device issues were a cause of the failure incident [121987, 122437]. 2. The outage was linked to problems related to application programming interface (API) [121987, 122437]. 3. The outage was related to network devices and API protocols for building and integrating application software [122641].
Impacts 1. Several Amazon services, including its website, Prime Video, and applications that use Amazon Web Services (AWS), went down for thousands of users, impacting delivery operations and causing disruptions in warehouse operations [Article 121987]. 2. The outage affected various other services and platforms such as Ring security cameras, mobile banking app Chime, robot vacuum cleaner maker iRobot, Internet Movie Database (IMDb), language learning provider Duolingo, dating site Tinder, and presale tickets for Adele's performances in Las Vegas [Article 121987]. 3. The outage also impacted streaming platforms like Netflix and Disney+, trading app Robinhood, and e-commerce website Amazon.com Inc, affecting consumers shopping ahead of Christmas [Article 122641]. 4. Users experienced issues with Amazon Music, which some consumers pay $16 a month to access [Article 122437]. 5. The outage caused frustration among users, with some expressing their disappointment on social media platforms like Twitter [Article 122437].
Preventions 1. Implementing redundancy and failover mechanisms in the network devices to prevent widespread outages like the one experienced by Amazon [121987, 122437, 122641]. 2. Conducting regular testing and monitoring of APIs to identify and address potential issues before they lead to service disruptions [121987, 122437, 122641]. 3. Diversifying service providers or utilizing multiple cloud service providers to reduce dependency on a single provider like AWS [121987, 122437, 122641]. 4. Enhancing communication and coordination with delivery partners and other stakeholders to mitigate the impact of outages on operations [121987, 122437]. 5. Improving incident response and recovery procedures to minimize downtime and ensure faster restoration of services [121987, 122437, 122641].
Fixes 1. Resolving network device issues and issues related to application programming interface (API) [121987, 122437, 122641] 2. Working towards recovery of any impaired services [121987] 3. Identifying the root cause of the problem and actively working towards recovery [122437] 4. Full recovery across services [122641]
References 1. Amazon's service health dashboard [121987, 122437] 2. Downdetector [121987, 122437, 122641] 3. Spokespersons from Amazon [121987, 122437] 4. Social media pages of affected companies like Ring, Chime, iRobot [121987, 122437, 122641] 5. Bloomberg [122437] 6. Twitter [122437] 7. Users reporting issues [121987, 122641] 8. Analytics firm Kentik [122641] 9. Reuters [122641]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: - Amazon has experienced 27 outages over the past 12 months related to its services [Article 121987, Article 122641]. - In July, Amazon experienced a disruption in its online stores service, which lasted for nearly two hours and affected more than 38,000 users [Article 121987]. - Amazon experienced a similar issue in July, when its services were disrupted for nearly two hours and at the peak of the disruption, more than 38,000 user reports indicated issues with Amazon's online stores [Article 122437]. (b) The software failure incident having happened again at multiple_organization: - In June, websites including Reddit, Amazon, CNN, PayPal, Spotify, Al Jazeera Media Network, and the New York Times were hit by a widespread hour-long outage linked to U.S.-based content delivery network provider Fastly Inc, a smaller rival of AWS [Article 121987, Article 122641]. - The outage has impacted a wide variety of service providers worldwide, among them iRobot, Chime, CashApp, CapitalOne, GoDaddy, the Associated Press, Instacart Kindle, and Roku [Article 122437].
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase: - The software failure incident affecting Amazon's services, including its website, Prime Video, and applications using Amazon Web Services (AWS), was attributed to problems related to the application programming interface (API), which is a set of protocols for building and integrating application software [121987, 122437]. - The outage was linked to network devices and issues with the API, indicating a design-related problem in the system development [122641]. (b) The software failure incident related to the operation phase: - The outage affected delivery operations, with Amazon's warehouse operations using AWS experiencing disruptions, leading to delivery delays and standstills for Amazon drivers [121987]. - The outage impacted delivery service partners, leaving vans idle and disrupting communication with delivery drivers, highlighting operational challenges during the incident [122437].
Boundary (Internal/External) within_system, outside_system (a) within_system: The software failure incident related to the Amazon outage was primarily due to issues within the system. Amazon reported that the outage was likely caused by problems related to the application programming interface (API) [121987, 122437, 122641]. Additionally, the outage was linked to network device issues within the US-East-1 Region [121987, 122437]. Amazon mentioned that they had identified the root cause of the problem within their system and were actively working towards recovery [122437]. The outage affected various Amazon services, including Prime Video, Amazon Music, Ring, Alexa, and more, all of which are part of Amazon's ecosystem [121987, 122437, 122641]. (b) outside_system: The software failure incident also had contributing factors originating from outside the system. For example, the outage impacted a wide variety of service providers worldwide, indicating external dependencies on Amazon Web Services (AWS) [122437]. The outage disrupted streaming platforms like Netflix and Disney+, as well as apps like Robinhood, which rely on AWS for their services [122641]. Additionally, the outage affected delivery operations, indicating external dependencies on AWS for Amazon's warehouse operations [121987].
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: - The software failure incident affecting Amazon's cloud services, streaming platforms like Netflix and Disney+, Robinhood, and various apps was related to network devices and linked to application programming interface (API) issues [122641]. - Amazon experienced an outage likely due to problems related to API, which is a set of protocols for building and integrating application software [121987]. - The outage was probably due to issues related to API in the US-East-1 Region [121987]. (b) The software failure incident occurring due to human actions: - There is no specific mention of the software failure incident being caused by human actions in the provided articles.
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - The outage at Amazon's cloud services on Tuesday was related to network devices, indicating a hardware issue [Article 122641]. - Amazon experienced API and console issues in the US-East-1 Region, which were likely due to problems related to hardware components [Article 121987]. (b) The software failure incident occurring due to software: - The outage at Amazon's cloud services was linked to application programming interface (API), which is a set of protocols for building and integrating application software [Article 122641]. - Amazon reported API and console issues in the US-East-1 Region, suggesting software-related problems [Article 121987].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in the articles appears to be non-malicious. The incidents were related to network devices and issues with application programming interfaces (APIs) [121987, 122437, 122641]. There is no indication in the articles that the failures were caused by malicious intent or actions aimed at harming the system. Instead, the incidents seem to be technical in nature, affecting various services and platforms that rely on Amazon Web Services (AWS) due to issues with APIs and network devices.
Intent (Poor/Accidental Decisions) poor_decisions (a) poor_decisions: - The software failure incident related to Amazon's cloud services outage was likely due to issues related to application programming interface (API), which is a set of protocols for building and integrating application software [121987, 122437, 122641]. - The outage was linked to network devices and API problems, indicating that decisions related to the API implementation or network configuration may have contributed to the failure [121987, 122641]. - The outage affected various services and platforms, including Amazon's website, Prime Video, Ring security cameras, mobile banking app Chime, robot vacuum cleaner maker iRobot, Netflix, Disney+, and more, showcasing the widespread impact of the failure [121987, 122437, 122641]. (b) accidental_decisions: - The articles do not explicitly mention any accidental decisions or unintended mistakes that directly led to the software failure incident.
Capability (Incompetence/Accidental) accidental (a) The articles do not provide specific information indicating the software failure incident was due to development incompetence. (b) The software failure incidents reported in the articles were accidental in nature. The incidents were attributed to issues related to network devices and application programming interface (API), which are technical factors rather than intentional actions or incompetence [121987, 122437, 122641].
Duration temporary (a) The software failure incident was temporary as it caused disruptions for a certain period but was not permanent. The incident affected various Amazon services, including the website, Prime Video, and applications using Amazon Web Services (AWS) [121987, 122437]. Users reported issues with accessing these services, and Amazon acknowledged the problem related to network devices and API protocols. The outage lasted for several hours, impacting delivery operations and various other services that rely on AWS [121987, 122437]. Amazon stated that they were actively working towards recovery but did not provide an estimated time for full restoration [122437]. (b) The software failure incident was not permanent as it was caused by specific circumstances related to network devices and API protocols. The outage affected multiple services and platforms, including Amazon's e-commerce website, streaming platforms like Netflix and Disney+, and apps like Robinhood [122641]. Users experienced disruptions in accessing these services, and Amazon mentioned that many services had already recovered, indicating that the issue was being addressed [122641]. The outage was linked to network devices and API problems, which were being worked on for full recovery across services [122641].
Behaviour crash, omission, value, other (a) crash: - The articles report instances of Amazon's services, including its website, Prime Video, and applications using AWS, going down for thousands of users, indicating a system crash [121987, 122437, 122641]. - Users reported issues with Amazon, Prime Video, and other services, suggesting a crash in the system [122641]. (b) omission: - The outage caused delivery operations to be affected, with Amazon's warehouse operations using AWS experiencing disruptions, leaving facilities at a standstill [121987]. - Three delivery service partners mentioned that an Amazon app used to communicate with delivery drivers was down, leading to idle vans with no communication from the company [122437]. (c) timing: - The outage occurred during Amazon's critical Christmas shopping season, potentially creating lasting log-jams at a time when there was already a critical crunch on the supply chain [122437]. - The crash happened just 18 days before Christmas, impacting users trying to purchase gifts [122437]. (d) value: - The outage disrupted streaming platforms like Netflix and Disney+, indicating a failure in providing the correct service [122641]. (e) byzantine: - There is no specific mention of the system behaving erroneously with inconsistent responses and interactions in the articles. (f) other: - The outage was linked to issues related to application programming interface (API), which is a set of protocols for building and integrating application software [121987, 122437, 122641]. - The outage was related to network devices, further contributing to the failure incident [122641].

IoT System Layer

Layer Option Rationale
Perception processing_unit, network_communication (a) sensor: Failure due to contributing factors introduced by sensor error: - The articles do not mention any specific sensor errors contributing to the software failure incidents. Therefore, it is unknown whether the failures were related to sensor errors. (b) actuator: Failure due to contributing factors introduced by actuator error: - The articles do not mention any specific actuator errors contributing to the software failure incidents. Therefore, it is unknown whether the failures were related to actuator errors. (c) processing_unit: Failure due to contributing factors introduced by processing error: - The software failure incidents reported in the articles were related to issues with Amazon Web Services (AWS) and its associated services. The failures were attributed to problems related to the application programming interface (API) and network devices, indicating processing errors at the processing unit level [121987, 122437, 122641]. (d) network_communication: Failure due to contributing factors introduced by network communication error: - The articles highlight that the software failures were linked to network devices and issues related to network communication, indicating that the failures were indeed related to network communication errors [121987, 122437, 122641]. (e) embedded_software: Failure due to contributing factors introduced by embedded software error: - The articles do not mention any specific embedded software errors contributing to the software failure incidents. Therefore, it is unknown whether the failures were related to embedded software errors.
Communication connectivity_level The software failure incident reported in the news articles was related to the connectivity level of the cyber physical system that failed. The incident was attributed to issues related to network devices and the application programming interface (API), which is a set of protocols for building and integrating application software [121987, 122437, 122641]. This indicates that the failure was due to contributing factors introduced by the network layer rather than the physical layer of the system.
Application TRUE The software failure incidents reported in the provided articles were related to issues with the application layer of the cyber physical system. This was indicated by problems related to the application programming interface (API), which is a set of protocols for building and integrating application software [121987, 122437, 122641]. The failures were linked to network devices and API issues, which are typical of application layer failures caused by bugs, errors, and incorrect usage within the software system.

Other Details

Category Option Rationale
Consequence delay, non-human (a) death: People lost their lives due to the software failure - No information about any deaths caused by the software failure incident was mentioned in the articles [121987, 122437, 122641]. (b) harm: People were physically harmed due to the software failure - No information about physical harm to individuals due to the software failure incident was provided in the articles [121987, 122437, 122641]. (c) basic: People's access to food or shelter was impacted because of the software failure - The software failure incident did not directly impact people's access to food or shelter [121987, 122437, 122641]. (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incident affected various services and platforms, disrupting operations and causing inconvenience to users, but there was no specific mention of significant property loss or financial impact on individuals [121987, 122437, 122641]. (e) delay: People had to postpone an activity due to the software failure - The software failure incident led to delays in various services, including delivery operations, online shopping, and streaming services, impacting users' ability to carry out activities as usual [121987, 122437, 122641]. (f) non-human: Non-human entities were impacted due to the software failure - Various non-human entities were affected by the software failure incident, including Amazon's warehouse operations, Ring security cameras, mobile banking app Chime, robot vacuum cleaner maker iRobot, and other services relying on Amazon Web Services (AWS) [121987, 122437, 122641]. (g) no_consequence: There were no real observed consequences of the software failure - The software failure incident had real observed consequences, including service disruptions, delays, and operational issues for various platforms and services [121987, 122437, 122641]. (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - There were no specific potential consequences discussed in the articles that did not occur as a result of the software failure incident [121987, 122437, 122641]. (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - No other specific consequences beyond the options provided were mentioned in the articles [121987, 122437, 122641].
Domain information, transportation, sales, finance, entertainment, other (a) The software failure incident affected the entertainment industry as it disrupted services like Amazon Prime Video, Netflix, Disney+, and IMDb, impacting users' ability to stream movies and TV shows [121987, 122437, 122641]. (b) The transportation industry was impacted as the outage affected Amazon's delivery operations, leaving vans idle and disrupting the communication with delivery drivers, potentially causing log-jams in the supply chain during the critical Christmas shopping season [122437]. (d) The sales industry was affected as the outage disrupted Amazon's e-commerce website, impacting users' ability to make purchases online ahead of Christmas [122641]. (h) The finance industry was also impacted as the outage affected services like the mobile banking app Chime, potentially disrupting financial transactions for users [121987, 122437, 122641]. (m) In addition to the industries mentioned above, the software failure incident also affected other industries such as technology (iRobot), media (IMDb), trading (Robinhood), and content delivery (Netflix), showcasing the widespread impact of the outage across various sectors [121987, 122437, 122641].

Sources

Back to List