Incident: Twitter Outage: Global Service Disruption Due to Capacity Issues

Published Date: 2016-01-19

Postmortem Analysis
Timeline 1. The software failure incident of Twitter occurred on Tuesday morning, as reported in Article 39579. 2. Published on 2016-01-19. 3. The incident occurred on January 19, 2016. Therefore, the software failure incident happened on January 19, 2016.
System 1. Twitter's API (application programming interface) [39579] 2. Twitter's image handling service [39579] 3. Twitter's home timelines service [39579]
Responsible Organization 1. Twitter's own infrastructure and architecture limitations [39579]
Impacted Organization 1. Users worldwide were impacted by the Twitter outage incident reported in Article 39579. [39579]
Software Causes 1. The software causes of the Twitter outage included issues with the application programming interface (API) leading to service disruptions and performance issues [39579].
Non-software Causes 1. Over capacity issues leading to network failure during major events [39579] 2. Architecture limitations preventing easy expansion of server capacity [39579]
Impacts 1. Twitter experienced a total outage followed by serious access problems lasting over an hour, affecting users worldwide [39579]. 2. Access to the service failed over the web, mobile, and API, with error messages indicating the network was "over capacity" and suffering an "internal error" [39579]. 3. The majority of the service returned to normality by 10:00 am GMT, but the company's image handling service and home timelines still suffered issues [39579]. 4. Twitter's own status board confirmed the outage, with four of the five public APIs down, causing a "service disruption" [39579]. 5. Users were sporadically able to access the service, but the site's status fluctuated, and several services remained down more than two hours after the outage began [39579]. 6. The company communicated about the outage through a tweet from its @support account, which was later emailed to the Guardian due to Twitter being down [39579]. 7. Twitter's architecture limitations prevented easy expansion of capacity by adding servers, leading to frequent collapses under user load during major events [39579].
Preventions 1. Implementing a more scalable architecture that allows for easy expansion of capacity during peak usage times could have prevented the software failure incident [39579]. 2. Conducting regular load testing and capacity planning to ensure the system can handle spikes in traffic without crashing could have helped prevent the outage [39579]. 3. Improving the error handling mechanisms within the software to provide more informative and actionable error messages to users could have helped mitigate the impact of the failure incident [39579].
Fixes 1. Implementing a more scalable architecture to handle increased user load during peak times could help prevent similar outages in the future [39579]. 2. Conducting thorough load testing and capacity planning to ensure the system can handle spikes in traffic without crashing [39579]. 3. Improving the error handling mechanisms to provide more informative and user-friendly error messages to users when issues occur [39579].
References 1. Twitter's own status board [39579] 2. Twitter's developer-facing monitoring [39579] 3. Twitter's @support account [39579]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident of Twitter experiencing a total outage followed by serious access problems lasting over an hour has happened before within the same organization. The article mentions that in the early days of the service, Twitter outages were common enough that the company’s “over capacity” error message gained a nickname: the fail whale. This indicates that Twitter has faced similar incidents of outages in the past [39579]. (b) The software failure incident of Twitter experiencing a total outage followed by serious access problems lasting over an hour is not explicitly mentioned to have happened at other organizations in the provided article. Therefore, there is no information to suggest that similar incidents have occurred at multiple organizations.
Phase (Design/Operation) design, operation (a) The software failure incident with Twitter's outage can be attributed to issues related to system development and updates. The article mentions that Twitter's architecture at the time prevented easy expansion of capacity by adding servers to its back end, leading to the service frequently collapsing under the weight of its users during major events. This limitation in the system's design contributed to the outage experienced by users [39579]. (b) The software failure incident can also be linked to operational factors. The article highlights that access to Twitter began failing over the web, mobile, and its API, with error messages indicating the network was over capacity and suffering internal errors. Additionally, the company's developer-facing monitoring confirmed that several public APIs were down, indicating operational issues affecting the service [39579].
Boundary (Internal/External) within_system (a) The software failure incident with Twitter experiencing a total outage and serious access problems was primarily within the system. The article mentions that Twitter's own status board confirmed the outage, and the company's developer-facing monitoring indicated that four of the five public APIs were down, suffering a "service disruption" [39579]. This indicates that the failure originated from within the system itself, affecting various components like APIs and services provided by Twitter.
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: - The article describes how Twitter experienced a total outage followed by serious access problems, with error messages indicating the network was "over capacity" and suffering an "internal error" [39579]. - The company's own status board confirmed the outage, and the developer-facing monitoring indicated that four of the five public APIs were down, suffering a "service disruption" [39579]. - The service's architecture was mentioned as a factor contributing to the failure, as it prevented easy expansion of capacity by simply adding servers to the back end, leading to collapses under the weight of users during major events [39579]. (b) The software failure incident occurring due to human actions: - The article does not provide specific information indicating that the software failure incident was directly caused by human actions.
Dimension (Hardware/Software) software (a) The software failure incident related to hardware: The article does not mention any specific hardware-related issues contributing to the Twitter outage. It primarily focuses on the service's architecture and capacity challenges that led to the outage. (b) The software failure incident related to software: The article highlights that the Twitter outage was primarily caused by issues within the software itself. Users experienced error messages indicating the network was "over capacity" and suffering from "internal error." Additionally, the company's APIs were down, with some upgraded to "performance issues." This indicates that the software components of Twitter were experiencing failures leading to the outage [39579].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in Article 39579 was non-malicious. The outage experienced by Twitter was due to internal errors and overcapacity issues within the system, leading to serious access problems for users worldwide. There is no indication in the article that the failure was caused by malicious intent or actions aimed at harming the system. The incident was attributed to technical issues and limitations in the service's architecture that prevented easy expansion of capacity [39579].
Intent (Poor/Accidental Decisions) unknown (a) The software failure incident related to Twitter's outage on Tuesday morning was not explicitly attributed to poor decisions. The incident was mainly described as a total outage followed by serious access problems lasting over an hour, with error messages indicating the network was over capacity and suffering internal errors. The article mentioned the company's architecture as a contributing factor, as it prevented easy expansion of capacity by adding servers to the backend, leading to collapses under the weight of users during major events [39579]. (b) The software failure incident related to Twitter's outage on Tuesday morning was not explicitly attributed to accidental decisions. The incident was mainly described as a total outage followed by serious access problems lasting over an hour, with error messages indicating the network was over capacity and suffering internal errors. The article did not mention specific mistakes or unintended decisions that led to the outage [39579].
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident related to development incompetence is evident in the article as it mentions how Twitter's architecture prevented the company from easily expanding capacity by adding servers to its back end, leading to frequent collapses under the weight of its users during major events. This limitation in the architecture points towards a lack of professional competence in designing a scalable system to handle user load efficiently [39579]. (b) The software failure incident related to accidental factors is highlighted in the article when it mentions that the service began failing over the web, mobile, and API with error messages indicating the network was "over capacity" and suffering an "internal error." These issues seem to have occurred unexpectedly, indicating accidental contributing factors leading to the outage [39579].
Duration temporary (a) The software failure incident described in the article was temporary. The article mentions that Twitter experienced a total outage followed by serious access problems lasting over an hour. The access to the service began failing at 8:20 am GMT, and by 10:00 am, the majority of the service had returned to some semblance of normality. However, Twitter continued to sporadically fail throughout the day, indicating that the failure was not permanent but rather temporary [39579].
Behaviour crash (a) crash: The software failure incident described in the article is related to a crash. Twitter suffered a total outage followed by serious access problems lasting over an hour, with the service failing over the web, mobile, and its API [39579]. (b) omission: The incident does not specifically mention a failure due to the system omitting to perform its intended functions at an instance(s). (c) timing: The incident does not specifically mention a failure due to the system performing its intended functions correctly, but too late or too early. (d) value: The incident does not specifically mention a failure due to the system performing its intended functions incorrectly. (e) byzantine: The incident does not specifically mention a failure due to the system behaving erroneously with inconsistent responses and interactions. (f) other: The behavior of the software failure incident can be categorized as a crash, where the system lost its state and did not perform its intended functions as users worldwide experienced a total outage and serious access problems lasting over an hour [39579].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay (a) unknown (b) unknown (c) unknown (d) unknown (e) Users had to delay their activities due to the Twitter outage [39579]. (f) unknown (g) unknown (h) The article mentions that Twitter suffered a total outage followed by serious access problems lasting over an hour, impacting users' ability to access the service [39579]. (i) unknown
Domain information (a) The failed system in this incident was intended to support the information industry. The article mentions that Twitter, the system that experienced the outage, is a platform for sharing information and communication [39579].

Sources

Back to List