Incident: Amazon Outage: Power Failure Causes Cloud Software Bugs and Delays

Published Date: 2012-07-03

Postmortem Analysis
Timeline 1. The software failure incident happened on a Friday night [13063]. 2. Published on 2012-07-03. 3. The incident likely occurred on Friday night, June 29, 2012.
System 1. Backup generators at one of the data centers failed to provide stable voltage as they were brought into service. 2. Elastic Load Balancers (ELB) software had a bug that caused the service to get overwhelmed. 3. Relational Database Service had a bug that prevented a "small number" of databases from recovering properly. 4. Elastic Block Store services ran into bottlenecks during the restoration process. [13063]
Responsible Organization 1. The power outage at a single Northern Virginia data center was responsible for causing the software failure incident [13063].
Impacted Organization 1. Netflix 2. Instagram 3. Pinterest [Cited from Article 13063]
Software Causes 1. A bug in Amazon's Elastic Load Balancers (ELB) software overwhelmed the service, causing delays in processing requests and impacting internet traffic redirection [13063]. 2. Another bug in Amazon's Relational Database Service prevented a small number of databases from recovering properly after the power outage, requiring manual intervention to restart failover systems [13063].
Non-software Causes 1. A 20-minute power outage at a single Northern Virginia data center due to a large voltage spike on the grid [13063]. 2. Diesel-powered generators at one of the data centers failed to provide stable voltage when brought into service [13063]. 3. Battery backups at the data center started to fail after the power outage [13063]. 4. Abrupt power outage led to the data center going dark [13063]. 5. Backup generators took time to restore power due to several bugs in Amazon's cloud software [13063].
Impacts 1. The software failure incident led to a significant outage that affected popular websites like Netflix, Instagram, and Pinterest [13063]. 2. The failure of the backup generators and bugs in Amazon's cloud software caused delays in rebooting affected servers, impacting the availability of services for customers [13063]. 3. The Elastic Load Balancers (ELB) software bug caused the service to get overwhelmed, leading to delays in processing requests and affecting internet traffic redirection for customers [13063]. 4. A bug in Amazon's Relational Database Service prevented a small number of databases from recovering properly after the power outage, requiring manual intervention to restart failover systems [13063]. 5. Bottlenecks were encountered in restoring services like the Elastic Block Store due to the software failure incident, highlighting challenges in recovering from power failures in cloud environments [13063].
Preventions 1. Regularly testing and ensuring the reliability of backup power systems, such as generators, to prevent failures during power outages [13063]. 2. Conducting thorough testing and quality assurance of cloud software, like Elastic Load Balancers (ELB) and Relational Database Service, to identify and address bugs before they impact services during critical situations [13063]. 3. Implementing robust failover systems and procedures to ensure quick recovery and minimal downtime for services affected by power outages or other incidents [13063].
Fixes 1. Repair and retest the failing generators, and replace them if necessary [13063]. 2. Identify and fix bugs in the Elastic Load Balancers (ELB) software to prevent service overload during outages [13063]. 3. Address bugs in the Relational Database Service to ensure proper recovery from power outages [13063]. 4. Resolve bottlenecks in restoring services like the Elastic Block Store to improve recovery efficiency [13063].
References 1. Amazon's detailed explanation [13063]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization The software failure incident at Amazon, as reported in Article 13063, involved issues with their cloud software, specifically bugs in their Elastic Load Balancers (ELB) software and Relational Database Service. These bugs caused service disruptions during the power outage incident. This incident highlights the challenges Amazon faced in maintaining their services during unexpected events like power outages. The article does not mention similar incidents happening again at other organizations or with their products and services, so the information about multiple organizations is unknown.
Phase (Design/Operation) design (a) The software failure incident in the article was primarily related to the design phase. Amazon experienced a major outage affecting popular websites like Netflix, Instagram, and Pinterest due to a 20-minute power outage at a data center in Northern Virginia. The incident was exacerbated by bugs in their cloud software, specifically in the Elastic Load Balancers (ELB) software and the Relational Database Service, which caused issues in redirecting internet traffic and recovering databases properly after the power outage. These bugs were contributing factors introduced during the system development phase [13063]. (b) There is no specific information in the article indicating that the software failure incident was related to the operation phase or misuse of the system.
Boundary (Internal/External) within_system (a) The software failure incident described in the article was primarily within the system. Amazon's outage was caused by a series of internal issues such as the failure of backup generators, bugs in their cloud software (Elastic Load Balancers and Relational Database Service), and bottlenecks in restoring services like Elastic Block Store [13063]. These internal factors contributed to the outage affecting popular websites like Netflix, Instagram, and Pinterest.
Nature (Human/Non-human) non-human_actions (a) The software failure incident was primarily caused by non-human actions, specifically a 20-minute power outage at a single Northern Virginia data center. The failure started with a large voltage spike on the grid, leading to issues with the backup power generators at the data center. The generators failed to provide stable voltage when brought into service, and there were subsequent power fluctuations and battery backup failures. Additionally, bugs in Amazon's cloud software, such as the Elastic Load Balancers (ELB) software and the Relational Database Service, exacerbated the situation by causing service disruptions and database recovery issues [13063]. (b) While the article does not specifically mention any human actions contributing to the software failure incident, it does highlight the importance of Amazon's response and efforts to improve their services and processes following the outage. The company mentioned that they would spend time improving their understanding of the event and making further changes to enhance their services. Additionally, Amazon stated they would repair, retest, and potentially replace the failing generators that were tested just six weeks prior to the incident [13063].
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurred due to hardware issues, specifically a 20-minute power outage at a single Northern Virginia data center. The generators at the data center failed to provide stable voltage when brought into service, leading to a cascading failure of systems [13063]. (b) The software failure incident also had contributing factors originating in software. Bugs in Amazon's Elastic Load Balancers (ELB) software caused the service to get overwhelmed, leading to delays in processing requests. Additionally, a bug in Amazon's Relational Database Service prevented some databases from recovering properly after the power outage. These software bugs compounded the impact of the hardware failure [13063].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in the article was non-malicious. The incident was primarily caused by a 20-minute power outage at a single Northern Virginia data center, which led to a chain of events affecting various services and systems within Amazon's infrastructure. The failure was attributed to issues with backup power generators, battery backups, and bugs in Amazon's cloud software, such as Elastic Load Balancers (ELB) and Relational Database Service. There is no indication in the article that the failure was a result of malicious intent [13063].
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions The software failure incident at Amazon's data center was primarily caused by a combination of poor decisions and accidental decisions: (a) poor_decisions: The incident highlighted poor decisions made in the design and implementation of Amazon's cloud software. For example, a bug in their Elastic Load Balancers (ELB) software caused the service to get overwhelmed, impacting customers who needed it to redirect internet traffic [13063]. (b) accidental_decisions: On the other hand, the failure also involved accidental decisions or mistakes, such as the generators failing to provide stable voltage when brought into service, and the data center's battery backups starting to fail after the power outage [13063].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident related to development incompetence is evident in the article as Amazon faced issues with their cloud software and services during the outage. Bugs in their Elastic Load Balancers (ELB) software caused the service to get overwhelmed, leading to delays in processing requests. Additionally, a bug in Amazon's Relational Database Service prevented some databases from recovering properly. These issues highlight the impact of software bugs on the overall system performance and customer experience [13063]. (b) The accidental nature of the software failure incident is also apparent in the article. The outage was triggered by a 20-minute power outage at a single data center in Northern Virginia, caused by a large voltage spike on the grid. The failure of the backup generators to provide stable voltage, along with subsequent power fluctuations, led to the data center going dark. These unplanned events resulted in a cascading failure affecting various services and systems, emphasizing the accidental nature of the incident [13063].
Duration temporary The software failure incident described in the article was temporary. The outage was caused by a 20-minute power outage at a single Northern Virginia data center, which led to a series of issues with backup power, generators, battery backups, and software bugs. The incident lasted for a total of about three hours, during which Amazon technicians had to reboot affected servers and address various issues in their cloud software [13063].
Behaviour crash, omission, value, other (a) crash: The software failure incident involved a crash as the data center's battery backups started to fail, leading to the data center going dark [13063]. (b) omission: The software failure incident also involved omission as a bug in Amazon's Elastic Load Balancers (ELB) software caused the service to get overwhelmed, leading to requests taking a very long time to complete [13063]. (c) timing: The software failure incident did not specifically mention timing-related failures. (d) value: The software failure incident involved a value-related failure as a bug in Amazon's Relational Database Service kept a "small number" of databases from recovering properly from the power outage [13063]. (e) byzantine: The software failure incident did not exhibit byzantine behavior. (f) other: The software failure incident also involved a failure due to the system losing state and not performing its intended functions, which aligns with the definition of a crash [13063].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human (d) property: People's material goods, money, or data was impacted due to the software failure The software failure incident at Amazon's data center resulted in significant consequences related to property. The outage caused popular websites like Netflix, Instagram, and Pinterest to go down, impacting users' access to these services [13063]. Additionally, the failure of Amazon's Elastic Load Balancers (ELB) software led to overwhelmed services and delayed processing of requests, affecting customers who relied on these services to redirect internet traffic [13063]. Furthermore, a bug in Amazon's Relational Database Service prevented a "small number" of databases from recovering properly from the power outage, requiring manual intervention to restart failover systems [13063]. These issues highlight the tangible impact on people's material goods, money, and data as a result of the software failure incident.
Domain information, utilities (a) The software failure incident affected the information industry as it disrupted popular websites like Netflix, Instagram, and Pinterest [13063]. (b) Not mentioned in the article. (c) Not mentioned in the article. (d) Not mentioned in the article. (e) Not mentioned in the article. (f) Not mentioned in the article. (g) The software failure incident was related to the utilities industry as it was caused by a power outage at a data center in Northern Virginia [13063]. (h) Not mentioned in the article. (i) Not mentioned in the article. (j) Not mentioned in the article. (k) Not mentioned in the article. (l) Not mentioned in the article. (m) Not mentioned in the article.

Sources

Back to List