Incident: Facebook Outage: Global Backbone Network Disconnect Incident and Impact

Published Date: 2021-10-05

Postmortem Analysis
Timeline 1. The software failure incident happened on Monday, as mentioned in the article [119718]. Therefore, the incident occurred on Monday before the article was published on Tuesday.
System 1. Facebook's global backbone network capacity system [119718] 2. Audit tool designed to prevent mistakes [119718]
Responsible Organization 1. Facebook's engineers issued a command that unintentionally disconnected Facebook data centers from the rest of the world during routine maintenance, leading to the cascade of problems and the outage [119718]. 2. The audit tool within Facebook's systems encountered a bug and failed to stop the command that caused the outage, contributing to the software failure incident [119718].
Impacted Organization 1. Facebook 2. Instagram 3. WhatsApp 4. Employees of the company [119718]
Software Causes 1. The software causes of the failure incident were: - An error during routine maintenance of Facebook's network of data centers caused a cascade of problems, leading to the outage [Article 119718]. - A command issued during routine maintenance unintentionally disconnected Facebook data centers from the rest of the world, originating within the company’s global backbone network [Article 119718]. - The audit tool designed to prevent mistakes encountered a bug and failed to stop the command that caused the outage [Article 119718].
Non-software Causes 1. The error during routine maintenance of Facebook's network of data centers caused a cascade of problems, leading to the outage [Article 119718]. 2. An unintentional disconnection of Facebook data centers from the rest of the world occurred due to a command issued by engineers [Article 119718]. 3. The outage was triggered by a system managing Facebook's global backbone network capacity, involving tens of thousands of miles of fiber-optic cables [Article 119718]. 4. The audit tool designed to prevent mistakes encountered a bug and failed to stop the command that caused the outage [Article 119718]. 5. Physical and system security measures in place at the data centers delayed engineers from accessing and working on the servers promptly [Article 119718].
Impacts 1. The software failure incident caused a global outage that took down Facebook, Instagram, and WhatsApp for more than six hours, affecting billions of users [Article 119718]. 2. The outage blocked employees from accessing internal tools, impacting their ability to investigate and repair the issue [Article 119718]. 3. Engineers had difficulty accessing the data centers to debug and restart the systems due to physical and system security measures in place [Article 119718]. 4. Despite concerns about a surge in traffic causing further crashes, access to Facebook's services returned relatively quickly due to drills the company had run to prepare for such situations [Article 119718].
Preventions 1. Implementing more robust auditing mechanisms to catch and prevent unintended commands during routine maintenance [119718]. 2. Conducting thorough testing and validation of audit tools to ensure they are functioning properly and effectively stopping erroneous commands [119718]. 3. Enhancing security protocols to allow quicker access for engineers to debug and restart systems in case of failures [119718]. 4. Continuously conducting drills and simulations to prepare for potential outages and ensure quick recovery of services [119718].
Fixes 1. Implementing stricter controls and checks on commands issued during routine maintenance to prevent unintentional disconnections like the one that caused the outage [119718]. 2. Enhancing the audit tool to ensure it effectively detects and stops commands that could lead to network disruptions [119718]. 3. Conducting regular drills and simulations to prepare for potential outages and ensuring quick recovery processes are in place [119718].
References 1. Facebook's blogpost by Santosh Janardhan, vice-president of engineering [Article 119718] 2. Downdetector, a web monitoring firm [Article 119718]

Software Taxonomy of Faults

Category Option Rationale
Recurring unknown (a) The software failure incident of Facebook experiencing a global outage affecting Facebook, Instagram, and WhatsApp due to an error during routine maintenance of its network of data centers is a unique incident for Facebook itself. There is no specific mention in the article that a similar incident has happened before within the same organization. (b) The article does not provide information about a similar incident happening at other organizations or with their products and services.
Phase (Design/Operation) design (a) The software failure incident in the article was primarily due to a design issue. Facebook mentioned that an error during routine maintenance of its network of data centers caused a cascade of problems that took down its platforms for more than six hours. The global outage began when the company’s engineers issued a command that unintentionally disconnected Facebook data centers from the rest of the world. This error originated within the company’s "global backbone" of fiber-optic cables and data centers, indicating a design flaw in the system's architecture [Article 119718]. (b) The software failure incident was not primarily due to operation factors or misuse of the system. The outage was triggered by a command issued during routine maintenance that unintentionally disconnected Facebook data centers globally. The company also mentioned that the audit tool, designed to prevent mistakes, encountered a bug and failed to stop the command that caused the outage, indicating a design flaw rather than an operational issue [Article 119718].
Boundary (Internal/External) within_system (a) within_system: The software failure incident reported in the article was primarily caused by an error during routine maintenance of Facebook's network of data centers. The incident originated within the company's "global backbone" of fiber-optic cables and data centers. An unintentional command issued during maintenance disconnected Facebook data centers globally, leading to the outage affecting Facebook, Instagram, and WhatsApp. Additionally, the audit tool designed to prevent mistakes encountered a bug, which failed to stop the command that caused the outage, further highlighting internal system issues [119718]. (b) outside_system: There is no specific mention in the article of contributing factors originating from outside the system that directly led to the software failure incident.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurred due to non-human actions, specifically an error during routine maintenance of Facebook's network of data centers. The incident was triggered by a system managing the global backbone network capacity, where a command issued unintentionally disconnected Facebook data centers globally. This error was not caused by malicious activity but rather by a mistake during maintenance procedures [Article 119718]. (b) The software failure incident also involved human actions. Facebook's engineers issued a command during routine maintenance that led to the unintentional disconnection of data centers from the rest of the world. Additionally, the audit tool designed to prevent mistakes encountered a bug and failed to stop the command that caused the outage, indicating a human error in the oversight of the audit tool's functionality [Article 119718].
Dimension (Hardware/Software) hardware, software (a) The software failure incident reported in Article 119718 was primarily due to hardware-related factors. Facebook mentioned that an error during routine maintenance of its network of data centers caused a cascade of problems that took down its platforms for more than six hours. The incident began when engineers issued a command that unintentionally disconnected Facebook data centers from the rest of the world, originating within the company’s "global backbone" of fiber-optic cables and data centers. The outage was triggered by a system managing the global backbone network capacity, which effectively disconnected Facebook data centers globally [119718]. (b) The software failure incident also had contributing factors originating in software. Facebook stated that the audit tool designed to prevent mistakes encountered a bug and failed to stop the command that caused the outage. This software-related issue led to the failure of the audit tool to prevent the unintentional disconnection of Facebook data centers from the global network during routine maintenance, contributing to the overall software failure incident [119718].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident related to the Facebook outage was non-malicious. Facebook explicitly stated that the outage was not caused by malicious activity [Article 119718]. The incident was attributed to an error during routine maintenance that led to unintended consequences, causing a cascade of problems that took down Facebook, Instagram, and WhatsApp for billions of users. The error originated within the company's global backbone network capacity, specifically due to a command issued during maintenance that disconnected Facebook data centers globally. Additionally, the audit tool designed to prevent mistakes encountered a bug, which failed to stop the command that caused the outage. The company's response to the incident focused on debugging and restarting the systems, learning from the failure, and improving to prevent such events in the future. (b) There is no indication in the articles that the software failure incident was malicious. The outage was attributed to technical errors and unintended consequences during routine maintenance, rather than any intentional actions to harm the system.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident described in the article was primarily due to accidental decisions. Facebook's vice-president of engineering, Santosh Janardhan, mentioned that the global outage affecting Facebook, Instagram, and WhatsApp was caused by an error during routine maintenance that led to unintentional disconnection of Facebook data centers from the rest of the world. The incident originated within the company's global backbone network and was triggered by a command issued to assess global backbone capacity, which unintentionally took down all connections in the backbone network [Article 119718]. Additionally, the audit tool designed to prevent mistakes encountered a bug and failed to stop the command that caused the outage, indicating that the failure was not due to deliberate poor decisions but rather accidental factors.
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident can be attributed to development incompetence as it was caused by an error during routine maintenance of Facebook's network of data centers. The incident occurred when engineers issued a command that unintentionally disconnected Facebook data centers from the rest of the world, originating within the company’s global backbone of fiber-optic cables and data centers [Article 119718]. (b) Additionally, the incident can also be categorized as accidental, as the error that triggered the outage was unintentional. The command issued during routine maintenance was meant to assess the availability of global backbone capacity but ended up taking down all the connections in the backbone network, effectively disconnecting Facebook data centers globally. This unintended consequence led to the cascade of problems that took down Facebook, Instagram, and WhatsApp for billions of users [Article 119718].
Duration temporary (a) The software failure incident described in the article was temporary. It was caused by an error during routine maintenance of Facebook's network of data centers, which led to a cascade of problems that took down Facebook, Instagram, and WhatsApp for more than six hours. The outage was triggered by a command issued during maintenance that unintentionally disconnected Facebook data centers from the rest of the world. The outage was not permanent as the company was able to restore network connectivity to the data centers and access to its services returned relatively quickly after running drills to prepare for such situations [Article 119718].
Behaviour crash (a) crash: The software failure incident in the article can be categorized as a crash. Facebook, Instagram, and WhatsApp went down for more than six hours due to an error during routine maintenance that caused a cascade of problems, leading to a complete outage of the platforms for billions of users [Article 119718]. (b) omission: The incident does not seem to be primarily related to omission, as it was triggered by an error during routine maintenance that unintentionally disconnected Facebook data centers from the rest of the world, rather than omitting to perform its intended functions [Article 119718]. (c) timing: The software failure incident is not related to timing issues, as there is no indication that the system performed its intended functions too late or too early. The outage was a result of an error during routine maintenance that caused a cascade of problems, leading to the platforms going down for more than six hours [Article 119718]. (d) value: The failure is not directly related to the system performing its intended functions incorrectly. Instead, it was caused by an error during routine maintenance that disconnected Facebook data centers globally, leading to the outage [Article 119718]. (e) byzantine: The incident does not align with a byzantine failure, which involves inconsistent responses and interactions. The outage in this case was triggered by an unintentional disconnection of data centers during routine maintenance, rather than erratic or inconsistent behavior of the system [Article 119718]. (f) other: The behavior of the software failure incident can be categorized as a crash, where the system lost its state and failed to perform any of its intended functions, leading to a complete outage of Facebook, Instagram, and WhatsApp for billions of users [Article 119718].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay The consequence of the software failure incident described in the article was primarily related to the delay and property aspects: - delay: The software failure incident caused a global outage that took down Facebook, Instagram, and WhatsApp for more than six hours, impacting billions of users [119718]. - property: The outage resulted in users losing access to the platforms, including WhatsApp with over 2 billion users, and employees being blocked from internal tools. Additionally, the outage affected the company's ability to investigate and repair the issue promptly [119718].
Domain information (a) The software failure incident affected the information industry as it disrupted the services of Facebook, Instagram, and WhatsApp, which are platforms used for the production and distribution of information to billions of users worldwide [Article 119718].

Sources

Back to List