Incident: Amazon Web Services Outage: Data Loss and Service Disruptions.

Published Date: 2011-04-25

Postmortem Analysis
Timeline 1. The software failure incident happened in April 2011 [Article 5256]. 2. The software failure incident happened on Friday night in June 2012 [Article 12493].
System 1. Amazon's cloud services - Amazon Web Services (AWS) [5256, 12493]
Responsible Organization 1. Lightning storm in Virginia caused the power to fail at the Amazon Web Services center, leading to the software failure incident [12493]. 2. A networking glitch at Amazon Web Services initiated a cascade of problems, contributing to the software failure incident [5256].
Impacted Organization 1. Chartbeat, a company that monitors the online presence of websites, including CNNMoney, was impacted by the software failure incident [5256]. 2. Websites such as the New York Times, Foursquare, Propublica, Reddit, Quora, and Hootsuite were also impacted by the software failure incident [5256]. 3. Well-known sites like Netflix, Pinterest, and Instagram were impacted by the software failure incident [12493].
Software Causes 1. Networking glitch that kicked off a cascade of problems [5256] 2. Lightning storm causing power failure at the Amazon Web Services center in Northern Virginia [12493]
Non-software Causes 1. Lightning strike in Virginia causing power failure at Amazon Web Services center [Article 12493] 2. Backup generator failure at the data center [Article 12493]
Impacts 1. Many commercial websites using Amazon's cloud services experienced outages, including well-known sites like the New York Times, Foursquare, Propublica, Reddit, Quora, Hootsuite, Netflix, Pinterest, and Instagram [5256, 12493]. 2. Historical data was missing for some websites, and some customers' data storage had not yet been restored [5256]. 3. Customers and businesses relying on Amazon Web Services faced disruptions in accessing data storage and computation services [12493]. 4. The interruption highlighted the risks associated with cloud computing and the potential impact on businesses and consumers [12493]. 5. The incident caused renewed scrutiny of companies' dependence on cloud computing and raised questions about the reliability of cloud services [12493]. 6. Some customers, like small start-ups, reported system failures and had to work to restore service, impacting their operations and potentially affecting their customers [12493].
Preventions 1. Implementing more robust redundancy features and disaster recovery mechanisms within the cloud infrastructure could have prevented the software failure incident [12493]. 2. Conducting regular stress testing and simulations to identify and address potential vulnerabilities in the system before they lead to a widespread outage [5256, 12493]. 3. Establishing clear communication channels and protocols to promptly inform customers about the nature of the issue, steps being taken to resolve it, and any potential impact on their data or services [5256, 12493]. 4. Diversifying data storage locations across different geographic regions to minimize the impact of localized incidents like lightning strikes or power failures [12493]. 5. Investing in more reliable backup power systems to ensure continuous operation in case of power outages at data centers [12493].
Fixes 1. Implementing more robust redundancy features in the cloud infrastructure to ensure data is backed up and can be quickly restored in case of failures [12493]. 2. Conducting a detailed post mortem analysis to identify the root causes of the incident and implement preventive measures to avoid similar failures in the future [5256, 12493]. 3. Enhancing communication with customers during outages to provide timely updates and information about the status of their data and services [5256, 12493].
References 1. Amazon's cloud services customers, such as Chartbeat, New York Times, Foursquare, Propublica, Reddit, Quora, Hootsuite [Article 5256] 2. Amazon Web Services customers, including Netflix, Pinterest, Instagram, Intercontinental Hotels, Fox Entertainment, Unilever, Spotify, 187 government agencies, small start-ups [Article 12493]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: - Amazon experienced a similar incident in April 2011 when a networking glitch caused a cascade of problems, leading to the crash of Amazon Web Services and affecting over 70 sites [5256]. - In July 2012, Amazon faced another major failure when a lightning storm caused a power outage at the Amazon Web Services center in Virginia, disrupting services for well-known sites like Netflix, Pinterest, and Instagram [12493]. (b) The software failure incident having happened again at multiple_organization: - The July 2012 incident involving Amazon Web Services also highlighted the risks faced by businesses and consumers as they increasingly rely on cloud services, impacting various companies beyond just Amazon [12493].
Phase (Design/Operation) design (a) The software failure incidents mentioned in the articles were primarily due to issues related to the design and development phases of the systems. In Article 5256, it is mentioned that Amazon's cloud services crashed due to a networking glitch, which kicked off a cascade of problems. Amazon promised to conduct a detailed post mortem to dig deeply into the root causes of the event [5256]. Similarly, in Article 12493, it is highlighted that a lightning storm caused the power to fail at the Amazon Web Services center in Northern Virginia, leading to a disruption in service. The data center's backup generator also failed for reasons Amazon was still unsure of, indicating a design or development flaw in the backup system [12493]. These incidents point towards failures introduced during the design and development phases of the systems.
Boundary (Internal/External) within_system (a) within_system: The software failure incidents reported in the articles were primarily within the system. In both Article 5256 and Article 12493, the failures were attributed to issues within Amazon's cloud computing service, specifically Amazon Web Services (AWS). The incidents were caused by factors such as a networking glitch in Article 5256 and a lightning storm causing power failure and backup generator failure in Article 12493. These internal system issues led to the disruption of services for various websites and companies relying on AWS for data storage and computation [5256, 12493].
Nature (Human/Non-human) non-human_actions (a) The software failure incidents reported in the articles were primarily due to non-human actions. In Article 5256, the failure was attributed to a networking glitch that kicked off a cascade of problems in Amazon's cloud services, affecting over 70 sites [5256]. Similarly, in Article 12493, the failure was caused by a lightning storm that led to a power failure at the Amazon Web Services center in Virginia, impacting well-known sites like Netflix, Pinterest, and Instagram [12493]. (b) There is no specific mention of the software failure incidents being caused by human actions in the articles.
Dimension (Hardware/Software) hardware, unknown (a) The software failure incidents reported in the articles were primarily due to hardware issues. In Article 5256, it is mentioned that a networking glitch caused a cascade of problems in Amazon's cloud services, leading to the outage of more than 70 sites, including popular ones like the New York Times, Foursquare, and Reddit. Additionally, in Article 12493, it is highlighted that a lightning storm in Virginia caused the power to fail at an Amazon Web Services center, leading to the disruption of services for well-known sites like Netflix, Pinterest, and Instagram. The failure of the data center's backup generator further exacerbated the situation, emphasizing the hardware-related nature of the incident [5256, 12493]. (b) While hardware issues were the primary contributing factors to the software failure incidents discussed in the articles, there is no explicit mention of software-related factors causing the failures. The incidents were mainly attributed to hardware failures such as networking glitches and power outages at Amazon's data centers, rather than software bugs or faults [unknown].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incidents reported in the articles are non-malicious. In both Article 5256 and Article 12493, the failures were caused by external factors such as a networking glitch and a lightning storm affecting Amazon's cloud computing service. There is no indication in the articles that these incidents were caused by malicious intent or actions by individuals. Instead, they were unforeseen events that led to disruptions in service for various websites and companies relying on Amazon Web Services. [Cite: Article 5256, Article 12493]
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incidents reported in the articles were not explicitly attributed to poor decisions. However, the incidents were caused by factors such as a networking glitch and a lightning storm that led to power failures at Amazon's cloud computing service centers. These incidents were more accidental in nature rather than being directly linked to poor decisions [5256, 12493].
Capability (Incompetence/Accidental) accidental (a) The software failure incidents reported in the articles were not directly attributed to development incompetence. The failures were mainly due to external factors such as a lightning storm causing power failure at Amazon's data center in Virginia [12493]. (b) The software failure incidents in the articles were accidental in nature. For example, the interruption in Amazon's cloud computing service was caused by a lightning storm that led to power failure at the data center, and the backup generator also failed for reasons Amazon was unsure of [12493].
Duration temporary (a) The software failure incidents described in the articles were temporary. In both incidents, the Amazon Web Services (AWS) experienced disruptions due to external factors such as a networking glitch in one case and a lightning storm causing power failure in another. These incidents led to the unavailability of services for hours, affecting well-known websites like Netflix, Pinterest, Instagram, New York Times, Foursquare, Propublica, Reddit, Quora, and Hootsuite. However, in both cases, Amazon worked to restore service to impacted customers and aimed to share more details about the events in the coming days [5256, 12493].
Behaviour crash, omission, other (a) crash: - Article 5256 reports a crash incident where Amazon's cloud services crashed, taking down more than 70 sites, including popular ones like the New York Times, Foursquare, and Reddit. The incident caused a networking glitch that led to a cascade of problems, resulting in a system crash [5256]. (b) omission: - Article 12493 mentions that on Friday night, lightning in Virginia took out part of Amazon's cloud computing service, affecting well-known sites like Netflix, Pinterest, and Instagram. These sites were not accessible for hours, indicating an omission in performing their intended functions [12493]. (c) timing: - There is no specific mention of a timing-related failure in the provided articles. (d) value: - There is no specific mention of a value-related failure in the provided articles. (e) byzantine: - There is no specific mention of a byzantine-related failure in the provided articles. (f) other: - The other behavior observed in the incidents described in the articles could be categorized as a system failure due to external factors such as a lightning storm causing power failure at the Amazon Web Services center in Northern Virginia. This external factor led to a disruption in service, impacting various websites and services relying on Amazon's cloud infrastructure [12493].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human, theoretical_consequence (a) death: There is no mention of any deaths resulting from the software failure incidents reported in the articles [5256, 12493]. (b) harm: There is no mention of any physical harm to individuals resulting from the software failure incidents reported in the articles [5256, 12493]. (c) basic: There is no mention of people's access to food or shelter being impacted due to the software failure incidents reported in the articles [5256, 12493]. (d) property: The software failure incidents did impact people's material goods, money, or data. For example, the outage affected well-known sites like Netflix, Pinterest, Instagram, New York Times, Foursquare, Propublica, Reddit, Quora, and Hootsuite, causing them to be inaccessible for hours [5256, 12493]. (e) delay: People did have to postpone activities due to the software failure incidents. For instance, some customers experienced service disruptions and had to wait for the restoration of services [5256, 12493]. (f) non-human: Non-human entities were impacted due to the software failure incidents. Various websites and online services were affected, leading to disruptions in their operations [5256, 12493]. (g) no_consequence: There were observed consequences of the software failure incidents, such as service disruptions and data loss, so the option of 'no_consequence' does not apply [5256, 12493]. (h) theoretical_consequence: The articles mention potential consequences discussed, such as the impact on businesses and consumers relying on cloud services, the exposure to risks, and disruptions in the cloud computing services [12493]. (i) other: There are no other specific consequences mentioned in the articles beyond those covered in the options (a) to (h) [5256, 12493].
Domain information, utilities (a) The software failure incident affected the information industry, specifically websites and online services that rely on Amazon's cloud computing services, such as the New York Times, Foursquare, Propublica, Reddit, Quora, Hootsuite, Netflix, Pinterest, Instagram, and Chartbeat [5256, 12493]. (g) The incident also impacted utilities as it disrupted the power supply at the Amazon Web Services center in Northern Virginia, causing thousands of computer servers to go offline due to a lightning storm and subsequent backup generator failure [12493]. (m) The software failure incident is related to the cloud computing industry, which could be considered as a separate category from the options provided [5256, 12493].

Sources

Back to List