Incident: Facebook Outage Caused by Configuration Change Impacting Multiple Services

Published Date: 2015-01-27

Postmortem Analysis
Timeline 1. The software failure incident mentioned in the article occurred on January 27, 2015. [Article 32754]
System 1. Facebook's configuration systems [32754] 2. Facebook's server [32754] 3. Facebook's database cluster [32754]
Responsible Organization 1. Facebook engineers were responsible for causing the software failure incidents mentioned in the articles [32754, 32754, 32754]. 2. The software failure incidents were caused by server errors, network maintenance issues, and changes in configuration systems introduced by Facebook [32754, 32754, 32754].
Impacted Organization 1. Users of Facebook and Instagram, as well as other services like Tinder, AOL messenger, and Hipchat that rely on Facebook for logins were impacted by the software failure incident [32754]. 2. Users of Skype were also affected by the outage incident [32754]. 3. Brands like Nestle's KitKat took advantage of the outage to make jokes and poke fun at Facebook's expense [32754].
Software Causes 1. The outage on 27 January 2015 was caused by Facebook attempting to change something within its systems which went wrong, not a cyber attack as widely reported [32754]. 2. The outage on 1 August 2014 was caused by another server error [32754]. 3. The outage on 19 June 2014 was due to an issue that prevented people from posting to Facebook for a brief period of time, with the cause not elaborated upon by Facebook [32754]. 4. The issue on 21 October 2013 was a "read-only error" that prevented users from posting status updates for more than four hours [32754]. 5. The disruption in 2010 was due to a fiendishly complex networking problem caused by Facebook's engineers, which was resolved by turning the site off and then on again [32754]. 6. In 2007, Facebook was purposefully taken offline by its engineers to fix a bug identified earlier that day [32754].
Non-software Causes 1. The outage on 27 January 2015 was caused by Facebook attempting to change something within its systems which went wrong, not a cyber attack as widely reported [32754]. 2. The outage on 1 August 2014 was caused by another server error [32754]. 3. The outage on 21 October 2013 was a "read-only error" caused by Facebook's engineers during network maintenance [32754].
Impacts 1. The software failure incident caused a 50-minute outage on Facebook and Instagram, impacting other services like Tinder, AOL messenger, and Hipchat that rely on Facebook for logins [32754]. 2. Another outage lasting one hour and 40 minutes affected Facebook's site, apps, and services using its login system, including Skype [32754]. 3. A 31-minute outage in June 2014 led users to seek refuge on other social networks like Twitter and Google+ as they were unable to post on Facebook [32754]. 4. In October 2013, a "read-only error" caused by Facebook's engineers prevented users from posting status updates for more than four hours, affecting at least 3,587 other sites [32754]. 5. In 2010, a two-hour disruption on Facebook was resolved by turning the site off and then on again due to a networking problem caused by engineers [32754].
Preventions 1. Implementing a more thorough testing process before making changes to the system could have prevented the software failure incident [32754]. 2. Having better monitoring and alert systems in place to quickly identify and address any issues that arise during system changes could have prevented the software failure incident [32754]. 3. Conducting a detailed analysis of potential risks and impacts before implementing changes to the system could have prevented the software failure incident [32754].
Fixes 1. Turning off and on the system again [32754] 2. Fixing a server error [32754] 3. Resolving a read-only error [32754]
References 1. Facebook spokesperson [Article 32754] 2. Data from IT management firm Compuware [Article 32754]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: 1. Facebook experienced multiple outages over the years due to various issues such as server errors, configuration system changes, and read-only errors caused by its engineers [32754]. 2. In 2010, Facebook suffered a disruption due to a networking problem caused by its engineers, which was resolved by turning off and on the site [32754]. (b) The software failure incident having happened again at multiple_organization: 1. The outage that affected Facebook also impacted other services like Instagram, Tinder, AOL messenger, and Hipchat, which rely on Facebook for logins [32754]. 2. Skype was one of the high-profile services affected by Facebook's outage in August 2014, indicating that the incident had repercussions on other organizations as well [32754].
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase can be seen in the article [32754] where on 27 January 2015, Facebook experienced a 50-minute outage caused by a change introduced within its systems that affected the configuration systems. This change led to trouble accessing Facebook and Instagram, impacting other services like Tinder, AOL messenger, and Hipchat that rely on Facebook for logins. This outage was not a result of a cyber attack but a consequence of a change made during system development. (b) The software failure incident related to the operation phase is evident in the article [32754] where on 21 October 2013, Facebook faced a "read-only error" that prevented users from posting status updates for more than four hours. This issue, not a complete outage but a disruption in operation, was caused by a network maintenance problem introduced during the operation of the system.
Boundary (Internal/External) within_system (a) within_system: - The software failure incident on 27 January 2015 was caused by Facebook attempting to change something within its systems which went wrong, not a cyber attack as widely reported [32754]. - On 21 October 2013, Facebook experienced a "read-only error" caused by its engineers during network maintenance, preventing users from posting status updates for more than four hours [32754]. - In 2010, Facebook suffered a disruption due to a fiendishly complex networking problem caused by its engineers, which was resolved by turning the site off and then on again [32754]. (b) outside_system: - The article does not provide specific incidents or details indicating software failure incidents caused by contributing factors originating from outside the system.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: - The outage on 27 January 2015 was caused by Facebook attempting to change something within its systems which went wrong, not a cyber attack as widely reported. It was stated that this was not the result of a third-party attack but instead occurred after a change that affected the configuration systems [32754]. - The outage on 1 August 2014 was caused by another server error, affecting Facebook's site, apps, and sites and services that use its login system [32754]. - The outage on 19 June 2014 was the longest outage in four years for Facebook, lasting 31 minutes. The issue prevented people from posting to Facebook for a brief period of time, and the cause was not elaborated upon in the statement provided by Facebook [32754]. - The outage on 21 October 2013 was a "read-only error" caused by Facebook's engineers during network maintenance, preventing users from posting status updates for more than four hours [32754]. - In 2010, Facebook suffered a two-hour disruption due to a fiendishly complex networking problem caused by its engineers. The solution was to turn the site off and then on again [32754]. (b) The software failure incident occurring due to human actions: - In 2010, Facebook engineers were to blame for a disruption caused by a complex networking problem [32754]. - On 31 July 2007, Facebook was purposefully taken offline by its engineers to fix a bug identified earlier that day [32754].
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - In the incident on 24 September 2010, Facebook suffered a disruption due to a fiendishly complex networking problem caused by its engineers, specifically a runaway condition at a "database cluster" of computer servers [Article 32754]. - The issue required Facebook to stop all traffic to the affected database cluster, essentially turning off the site to address the hardware-related problem. (b) The software failure incident occurring due to software: - The outage on 27 January 2015 was caused by Facebook attempting to change something within its systems, which went wrong. This was not a cyber attack but a change that affected the configuration systems, indicating a software-related issue [Article 32754]. - Similarly, on 1 August 2014, another outage was caused by a server error, affecting Facebook's site, apps, and services that use its login system, pointing to a software-related problem [Article 32754].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident mentioned in the articles was non-malicious. For example, the outage on 27 January 2015 was caused by Facebook attempting to change something within its systems which went wrong, not a cyber attack as widely reported [32754]. Additionally, the outage on 1 August 2014 was caused by another server error, not a malicious attack [32754]. (b) The software failure incidents were not due to malicious intent but rather technical issues or errors introduced during system changes or maintenance.
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions (a) poor_decisions: - The software failure incident on 27 January 2015 was caused by Facebook attempting to change something within its systems which went wrong, not a cyber attack as widely reported. This indicates a failure due to a decision to make changes that led to the outage [32754]. - On 21 October 2013, Facebook engineers were to blame for a "read-only error" that prevented users from posting status updates for more than four hours. This issue occurred while performing network maintenance, indicating a decision that led to the problem [32754]. (b) accidental_decisions: - The software failure incident on 1 August 2014 was caused by another server error, affecting Facebook's site, apps, and services that use its login system. This suggests a failure due to an accidental decision or mistake that led to the server error [32754]. - In the incident on 24 September 2010, Facebook suffered a disruption due to a complex networking problem caused by its engineers. The solution to the issue was simple - turning the site off and then on again, indicating an accidental decision that led to the disruption [32754].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident occurring due to development incompetence: - In the incident on 21 October 2013, Facebook's engineers were to blame for a "read-only error" that prevented users from posting status updates for more than four hours [32754]. - In the incident on 24 September 2010, Facebook suffered a disruption due to a fiendishly complex networking problem caused by its engineers, which was resolved by turning the site off and then on again [32754]. (b) The software failure incident occurring accidentally: - The outage on 27 January 2015 was caused by Facebook attempting to change something within its systems which went wrong, not a cyber attack as widely reported. This change affected the configuration systems, leading to the outage [32754]. - The incident on 1 August 2014 was caused by another server error, affecting Facebook's site, apps, and sites and services that use its login system [32754].
Duration temporary (a) The articles provide information about temporary software failure incidents, where the duration of the failure was not permanent. For example, on 27 January 2015, Facebook experienced a 50-minute outage due to a change in its systems that affected its configuration systems [32754]. Similarly, on 1 August 2014, there was another outage caused by a server error, lasting one hour and 40 minutes [32754]. These incidents highlight temporary failures that were resolved within a specific timeframe.
Behaviour crash, omission, other (a) crash: - The article mentions a Facebook outage in 2010 where the solution was to turn off and on the site due to a networking problem [Article 32754]. - There was an outage in 2013 where users were unable to post status updates for more than four hours due to a "read-only error" [Article 32754]. (b) omission: - In 2014, Facebook experienced an outage caused by a server error that affected the site, apps, and services using its login system [Article 32754]. - In 2015, an outage occurred when Facebook attempted to change something within its systems, affecting Facebook, Instagram, Tinder, AOL messenger, and Hipchat [Article 32754]. (c) timing: - There is no specific mention of a timing-related failure in the provided article. (d) value: - In 2010, Facebook suffered a two-hour disruption due to a complex networking problem caused by its engineers [Article 32754]. (e) byzantine: - There is no specific mention of a byzantine-related failure in the provided article. (f) other: - In 2007, Facebook was taken offline purposefully by its engineers to fix a bug identified earlier that day [Article 32754].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human (a) death: There is no mention of any deaths resulting from the software failure incidents reported in the articles [32754]. (b) harm: There is no mention of physical harm to individuals resulting from the software failure incidents reported in the articles [32754]. (c) basic: There is no mention of people's access to food or shelter being impacted due to the software failure incidents reported in the articles [32754]. (d) property: The software failure incidents did impact people's material goods, money, or data. For example, in the outage on 1 August 2014, Skype was one of the high-profile services made unavailable for some users who were unable to log in during the outage [32754]. (e) delay: People had to postpone activities due to the software failure incidents. For instance, during the outage on 19 June 2014, users sought refuge on other social networks like Twitter and Google+ when Facebook was down for 31 minutes [32754]. (f) non-human: Non-human entities were impacted due to the software failure incidents. For example, other services like Instagram, Tinder, AOL messenger, Hipchat, and Skype were affected as they rely on Facebook for logins [32754]. (g) no_consequence: There were observed consequences of the software failure incidents reported in the articles [32754]. (h) theoretical_consequence: There were no theoretical consequences discussed in the articles [32754]. (i) other: There were no other consequences described in the articles beyond the impact on services, logins, and user activities.
Domain information, utilities, entertainment (a) The failed system in the incident was related to the information industry as it impacted social media platforms like Facebook and Instagram, which are major platforms for the production and distribution of information [32754]. (g) The incident also affected other services like Tinder, AOL messenger, and Hipchat, which rely on Facebook for logins, indicating a connection to the utilities industry, specifically in terms of online services and communication platforms [32754]. (k) The outage led users to seek refuge on other social networks like Twitter and Google+, highlighting the impact on the entertainment industry as users turned to alternative platforms for leisure and social interaction [32754].

Sources

Back to List