Incident: Amazon Web Services Outage Impacts Customers and Amazon Operations

Published Date: 2021-12-07

Postmortem Analysis
Timeline 1. The software failure incident happened on December 7, as reported in Article 122029.
System 1. Amazon Web Services (AWS) in its eastern U.S. operations [122029]
Responsible Organization 1. Amazon's cloud computing unit was responsible for causing the software failure incident reported in Article 122029. [122029]
Impacted Organization 1. Companies and customers using Amazon Web Services [122029] 2. Smartsheet, a provider of collaboration software [122029] 3. Asana, a provider of project management services [122029] 4. Amazon's vast warehouse operations [122029] 5. Amazon's Ring home security business [122029]
Software Causes 1. The software causes of the failure incident were related to an impairment of several network devices, which affected the Amazon Web Services (AWS) data centers in the eastern United States [122029].
Non-software Causes 1. Impairment of several network devices [Article 122029]
Impacts 1. Companies and customers were unable to use web services provided by Amazon's cloud computing unit, leading to service unavailability and disruptions [Article 122029]. 2. AWS data centers in the eastern United States experienced issues, affecting various services such as collaboration software provided by Smartsheet and project management services offered by Asana [Article 122029]. 3. Amazon's own warehouse operations, which rely on AWS, saw disruptions in computer systems, impacting their operations [Article 122029]. 4. Amazon's Ring home security business faced problems with its app functionality and live camera views due to the AWS outage, affecting customer experience [Article 122029].
Preventions 1. Implementing robust network redundancy and failover mechanisms to prevent network device impairments from causing widespread outages [122029]. 2. Conducting thorough testing and validation of any capacity additions or system changes to avoid triggering errors that could overwhelm the network of servers [122029]. 3. Enhancing incident response technology to ensure timely updates and communication during outages to minimize customer impact [122029].
Fixes 1. Implementing better network device redundancy and failover mechanisms to prevent impairments like the one that caused the outage in the AWS data centers [Article 122029].
References 1. Amazon Web Services health dashboard [Article 122029] 2. Spokesperson Richard Rocha [Article 122029] 3. Warehouse worker (anonymously) [Article 122029] 4. Ring spokeswoman Emma Daniels [Article 122029] 5. Amazon spokeswoman Kristin Brown [Article 122029]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: The article mentions that a year ago, AWS experienced a major outage that took down large swaths of the Web, including Ring, iRobot, and The Washington Post [122029]. This indicates that Amazon Web Services (AWS) has experienced similar incidents in the past within the same organization. (b) The software failure incident having happened again at multiple_organization: The article does not provide specific information about similar incidents happening at other organizations. Therefore, it is unknown if similar incidents have occurred at multiple organizations.
Phase (Design/Operation) design (a) The software failure incident in the article can be attributed to the design phase. The incident was caused by an impairment of several network devices in Amazon's cloud-computing technology, affecting Internet-connected services and leading to an outage in AWS data centers in the eastern United States [122029]. This impairment of network devices points to a design flaw or issue introduced during the development or system updates of the technology.
Boundary (Internal/External) within_system (a) within_system: The software failure incident reported in the article was primarily within the system. The outage was attributed to technical problems within Amazon's cloud-computing technology in its eastern U.S. operations, specifically an impairment of several network devices within the AWS data centers [122029]. The issues extended to monitoring and incident response technology, causing delays in providing updates and affecting various services provided by AWS customers, including collaboration software and project management services. Additionally, Amazon's own Ring home security business experienced problems with its app and camera connections, all related to the AWS outage [122029].
Nature (Human/Non-human) non-human_actions (a) The software failure incident in the Amazon Web Services outage was primarily due to non-human actions. The root cause of the issue was identified as an impairment of several network devices in the AWS data centers in the eastern United States [122029]. The incident also affected monitoring and incident response technology, which delayed the ability to provide updates on the situation. Additionally, the outage impacted various services provided by AWS customers, such as Smartsheet and Asana, due to the AWS outage [122029]. (b) Human actions were not explicitly mentioned as contributing factors to the software failure incident in the Amazon Web Services outage reported in the article. The outage was attributed to technical problems and impairments in network devices within AWS data centers, indicating a non-human factor as the primary cause of the failure [122029].
Dimension (Hardware/Software) hardware (a) The software failure incident occurring due to hardware: - The article mentions that the Amazon Web Services outage was caused by an impairment of several network devices, indicating a hardware-related issue [122029]. (b) The software failure incident occurring due to software: - The article does not specifically mention any software-related contributing factors to the outage.
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in Article 122029 was non-malicious. The outage experienced by Amazon's cloud computing unit was due to technical problems in its eastern U.S. operations, specifically an impairment of several network devices. The root cause of the issue was not attributed to any malicious intent but rather to technical issues within the system [122029].
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident related to the Amazon Web Services outage in the eastern U.S. on December 7 was not explicitly attributed to poor decisions. The outage was primarily caused by technical problems, specifically an impairment of several network devices, which affected AWS data centers and services. The incident was described as delaying the ability to provide updates and causing disruptions to various services, including those of AWS customers like Smartsheet and Asana [122029]. (b) The software failure incident was more aligned with accidental decisions or mistakes rather than poor decisions. The outage was attributed to technical issues and an impairment of network devices, leading to disruptions in services and operations. The incident did not highlight any specific poor decisions as the root cause of the failure [122029].
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident occurring due to development incompetence: The article mentions a previous major outage experienced by AWS a year ago, where the failure was attributed to an operating system configuration error that overwhelmed Amazon's network of servers. This incident was caused by a relatively small addition of capacity that triggered a series of errors due to an operating system configuration issue, indicating a failure related to development incompetence [Article 122029]. (b) The software failure incident occurring accidentally: The article does not provide specific information indicating that the software failure incident was accidental.
Duration temporary (a) The software failure incident described in the article was temporary. The outage affected Amazon's cloud computing services in its eastern U.S. operations, causing significant technical problems and taking chunks of Internet-connected services offline. The article mentions that by late afternoon, some issues had been resolved, and the company was still "working towards full recovery across services" [Article 122029]. This indicates that the failure was not permanent but rather temporary in nature.
Behaviour crash, other (a) crash: The software failure incident in the article can be categorized as a crash. The Amazon Web Services (AWS) suffered significant technical problems in its eastern U.S. operations, leading to chunks of Internet-connected services being taken offline [Article 122029]. (b) omission: The incident does not specifically mention a failure due to the system omitting to perform its intended functions at an instance(s). (c) timing: The incident does not specifically mention a failure due to the system performing its intended functions correctly, but too late or too early. (d) value: The incident does not specifically mention a failure due to the system performing its intended functions incorrectly. (e) byzantine: The incident does not specifically mention a failure due to the system behaving erroneously with inconsistent responses and interactions. (f) other: The behavior of the software failure incident in the article can be categorized as a crash, where the system lost state and was unable to perform its intended functions as expected.

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human The consequence of the software failure incident reported in the article was primarily related to property and delay. The software failure impacted people's material goods, money, and data as well as caused delays in various services and operations. 1. Property Impact: The software failure disrupted Amazon's vast warehouse operations, which also use AWS, leading to computer systems being disrupted at one Midwest warehouse [Article 122029]. 2. Delay: Various services provided by companies like Smartsheet and Asana were unavailable due to the AWS outage, causing delays in collaboration and project management services [Article 122029].
Domain information, finance, other (a) The software failure incident affected the information industry as it disrupted various online services and platforms used for collaboration, project management, and home security, which rely on Amazon Web Services (AWS) for their operations [122029]. (h) Additionally, the incident impacted the finance industry indirectly as disruptions in AWS services could have affected financial institutions or services relying on AWS for their infrastructure [122029]. (m) The software failure incident could also be related to other industries not explicitly mentioned in the options, such as e-commerce, technology, and cloud computing, given the widespread impact on various sectors relying on AWS for their digital operations [122029].

Sources

Back to List