Incident Details

Incident: Facebook Outage: Global Services Down Due to Configuration Change

Published Date: 2021-10-05

Postmortem Analysis
Timeline	1. The software failure incident of Facebook and its platforms, including Instagram, WhatsApp, and Messenger, happened on October 4, 2021, as reported in Article 119715.
System	1. Configuration change to the backbone routers that coordinate network traffic between the company’s data centres [119715] 2. Domain Name System (DNS) and Border Gateway Protocol (BGP) [119715]
Responsible Organization	1. Facebook - The software failure incident was caused by a configuration change to the backbone routers that coordinate network traffic between Facebook's data centers, leading to a cascading effect that brought down all Facebook services [119715].
Impacted Organization	1. Facebook 2. Instagram 3. WhatsApp 4. Messenger 5. Businesses and individuals relying on Facebook services 6. Cloudflare 7. Users worldwide 8. Mark Zuckerberg's personal wealth [119715]
Software Causes	1. The software cause of the failure incident was a configuration change to the backbone routers that coordinate network traffic between Facebook's data centers, which had a cascading effect, bringing all Facebook services to a halt [119715]. 2. Another software cause was an accidental update to a deep-level routing protocol on the internet by Facebook, which essentially told the system that there were no servers available, leading to the inability to access Facebook and its services [119715].
Non-software Causes	1. Configuration change to the backbone routers that coordinate network traffic between the company’s data centres [119715] 2. Deep-level routing protocol update that affected the internet's Domain Name System (DNS) and Border Gateway Protocol (BGP) [119715]
Impacts	1. Facebook, Instagram, WhatsApp, and Messenger went down globally for close to six hours, impacting billions of users who rely on these platforms for communication and business purposes [Article 119715]. 2. The outage affected not only external services but also Facebook's internal systems, causing disruptions for employees who were locked out of offices and unable to access their own internal communications platform [Article 119715]. 3. The outage led to a drop in Facebook's share price by 4.9%, resulting in CEO Mark Zuckerberg's personal wealth decreasing by $6 billion [Article 119715].
Preventions	1. Implementing stricter change control processes to prevent accidental configuration changes like the one that caused the outage [119715]. 2. Diversifying network infrastructure to avoid a single point of failure for a vast number of online services [119715]. 3. Enhancing monitoring and alerting systems to quickly detect and respond to issues before they escalate [unknown]. 4. Conducting regular audits and testing of critical systems to identify and address potential vulnerabilities [unknown].
Fixes	1. Implement stricter change control processes to prevent accidental configuration changes like the one that caused the outage at Facebook [119715]. 2. Enhance monitoring and alerting systems to quickly detect and respond to network issues or anomalies that could lead to widespread service disruptions [119715]. 3. Diversify infrastructure and avoid single points of failure to minimize the impact of outages on a global scale [119715].
References	1. Facebook's statement on the cause of the outage [119715] 2. Cloudflare's detailed explanation of what happened during the outage [119715] 3. Guardian's UK technology editor, Alex Hern's insights on the outage [119715]

Software Taxonomy of Faults

Category	Option	Rationale
Recurring	one_organization, multiple_organization	(a) The software failure incident having happened again at one_organization: The article mentions that Facebook had its own previous internet outage issues, similar to the recent outage that occurred on Monday and Tuesday [119715]. (b) The software failure incident having happened again at multiple_organization: The article references other outages, including the Cloudflare outage in 2020 and the Fastly outage in June, indicating that similar incidents have occurred at other organizations as well [119715].
Phase (Design/Operation)	design, operation	(a) The software failure incident related to the design phase was due to a configuration change to the backbone routers that coordinate network traffic between Facebook's data centers. This change had a cascading effect, bringing all Facebook services to a halt [119715]. (b) The software failure incident related to the operation phase was exacerbated by the fact that Facebook's own internal systems are run from the same place, making it difficult for employees to diagnose and resolve the problem. Additionally, employees were reportedly unable to access their own communications platform and office due to the security pass system being caught up in the outage [119715].
Boundary (Internal/External)	within_system, outside_system	(a) within_system: The software failure incident was primarily caused by a configuration change to the backbone routers within Facebook's system, which had a cascading effect on all Facebook services [119715]. Additionally, Facebook's own internal systems were affected by the outage, making it challenging for employees to diagnose and resolve the problem [119715]. (b) outside_system: The software failure incident was also influenced by external factors related to the internet infrastructure. Cloudflare explained that the incident involved the Domain Name System (DNS) and Border Gateway Protocol (BGP), which are essential components of the internet's infrastructure [119715]. Facebook accidentally sent an update to a deep-level routing protocol on the internet, causing a disruption in the routing paths to Facebook and all services it runs [119715].
Nature (Human/Non-human)	non-human_actions, human_actions	(a) The software failure incident occurring due to non-human actions: The Facebook outage was caused by a configuration change to the backbone routers that coordinate network traffic between the company’s data centers. This change had a cascading effect, bringing all Facebook services to a halt [119715]. (b) The software failure incident occurring due to human actions: The outage was a result of a configuration change made to the backbone routers, which was likely a human error or oversight. Additionally, Facebook staff were reportedly unable to access their own communications platform, Workplace, due to the outage, indicating the impact of human actions on the incident [119715].
Dimension (Hardware/Software)	hardware, software	(a) The software failure incident occurring due to hardware: - The outage of Facebook and its platforms like Instagram, WhatsApp, and Messenger was caused by a configuration change to the backbone routers that coordinate network traffic between the company’s data centers [119715]. - Facebook sent an update to a deep-level routing protocol on the internet that essentially stated that they no longer had any servers, leading to a disruption in the routing paths to Facebook and its services [119715]. (b) The software failure incident occurring due to software: - The outage was primarily caused by a configuration change to the backbone routers, which is a software-related issue [119715]. - The incident involved sending updates to routing protocols that affected the availability of Facebook and its associated services, indicating a software-related error [119715].
Objective (Malicious/Non-malicious)	non-malicious	(a) The software failure incident related to the Facebook outage was non-malicious. The outage was caused by a configuration change to the backbone routers that coordinate network traffic between the company’s data centers, which had a cascading effect, bringing all Facebook services to a halt. This was not an intentional act to harm the system but rather a mistake that led to the outage [119715].
Intent (Poor/Accidental Decisions)	poor_decisions, accidental_decisions	(a) poor_decisions: The software failure incident involving Facebook going down globally for close to six hours was primarily attributed to a poor decision related to a configuration change to the backbone routers that coordinate network traffic between the company’s data centers. This configuration change had a cascading effect, bringing all Facebook services to a halt [119715]. (b) accidental_decisions: The incident also involved accidental decisions or mistakes, as Facebook accidentally sent an update to a deep-level routing protocol on the internet that essentially stated that they no longer had any servers. This unintentional update disrupted the routing paths not only for Facebook but for everything Facebook runs, leading to the outage [119715].
Capability (Incompetence/Accidental)	development_incompetence, accidental	(a) The software failure incident occurring due to development incompetence: The Facebook outage was attributed to a configuration change to the backbone routers that coordinate network traffic between the company’s data centers. This change had a cascading effect, bringing all Facebook services to a halt. Additionally, Facebook accidentally sent an update to a deep-level routing protocol on the internet that disrupted the paths to Facebook and everything it runs, causing the outage [119715]. (b) The software failure incident occurring accidentally: The outage was also described as accidental, as Facebook essentially told the Border Gateway Protocol (BGP) through a series of updates that the paths to Facebook no longer existed, affecting not just Facebook but all of its services. This accidental disruption led to people being unable to reach Facebook due to the inability to find the path to access it [119715].
Duration	temporary	(a) The software failure incident in this case was temporary. The outage of Facebook and its associated platforms, including Instagram, WhatsApp, and Messenger, lasted close to six hours globally [Article 119715]. The outage was caused by a configuration change to the backbone routers that coordinate network traffic between the company’s data centers, leading to a cascading effect that brought all Facebook services to a halt. The outage was eventually resolved by sending a technical team to manually reset the servers where the problem originated [Article 119715].
Behaviour	crash, omission, timing, value, other	(a) crash: The software failure incident in this case can be categorized as a crash. The outage of Facebook and its associated platforms, including Instagram, WhatsApp, and Messenger, resulted in a complete halt of services for close to six hours. This was caused by a configuration change to the backbone routers that coordinate network traffic between the company’s data centers, leading to a cascading effect that brought all Facebook services to a halt [Article 119715]. (b) omission: The software failure incident can also be categorized as an omission. During the outage, the affected services, including Facebook's internal systems, were not performing their intended functions. Employees were reportedly locked out of offices and unable to access their own internal communications platform [Article 119715]. (c) timing: The software failure incident can be associated with timing issues. The outage lasted for close to six hours, with services beginning to be restored after more than five hours. The duration and severity of the outage meant that systems were being brought back to full capacity slowly [Article 119715]. (d) value: The software failure incident can be linked to value issues. The outage resulted in Facebook and its associated services not performing their intended functions correctly. Users were unable to access the platforms, and businesses that rely on Facebook for various purposes were impacted [Article 119715]. (e) byzantine: The software failure incident does not align with a byzantine behavior as described in the articles. (f) other: The software failure incident can be categorized as a combination of crash, omission, timing, and value issues, as described in the articles. The incident involved a system crash leading to a complete halt of services, omission of intended functions, timing issues in restoring services slowly, and incorrect performance of functions during the outage [Article 119715].

IoT System Layer

Layer	Option	Rationale
Perception	None	None
Communication	None	None
Application	None	None

Other Details

Category	Option	Rationale
Consequence	delay	(a) death: People lost their lives due to the software failure (b) harm: People were physically harmed due to the software failure (c) basic: People's access to food or shelter was impacted because of the software failure (d) property: People's material goods, money, or data was impacted due to the software failure (e) delay: People had to postpone an activity due to the software failure (f) non-human: Non-human entities were impacted due to the software failure (g) no_consequence: There were no real observed consequences of the software failure (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? The articles do not mention any consequences related to death, physical harm, impact on access to food or shelter, impact on material goods, money, or data, or any other direct consequences on individuals or non-human entities. The main consequence discussed is the disruption of services and the impact on businesses and users relying on Facebook's platforms. The outage caused inconvenience, delays in communication and access to services, and financial implications for businesses and Facebook itself. There were no reports of any direct harm or casualties resulting from the software failure incident.
Domain	information, finance	(a) The software failure incident affected the information industry as Facebook, Instagram, WhatsApp, and Messenger, which went down globally, are platforms primarily used for the production and distribution of information [Article 119715]. (h) The finance industry was indirectly impacted by the software failure incident as the outage prompted Facebook's share price to drop 4.9% on Monday, causing founder and CEO Mark Zuckerberg’s personal wealth to drop $6bn [Article 119715]. (m) The software failure incident also had implications beyond the industries listed, as it affected billions of people who rely on Facebook for various purposes, including connecting with friends and family, businesses using it for online sales, and communication through services like WhatsApp [Article 119715].

Sources

Facebook outage: what went wrong and why did it take so long to fix after social platform went down? - The Guardian - Published on: 2021-10-05
Article ID: 119715

Back to List