Incident: Facebook Outage: Global Unavailability for 30 Minutes in 2014

Published Date: 2014-06-19

Postmortem Analysis
Timeline 1. The software failure incident of Facebook being unavailable worldwide for more than 30 minutes occurred on Thursday morning [27389]. Estimation: Step 1: The article mentions that the incident happened on a Thursday morning. Step 2: The article was published on 2014-06-19. Step 3: Based on the information, the software failure incident occurred on Thursday morning, June 19, 2014.
System The system that failed in the Facebook outage incident was: 1. Facebook website 2. Facebook smartphone and tablet apps [27389]
Responsible Organization 1. The software failure incident that caused Facebook to be unavailable worldwide for more than 30 minutes was not attributed to a specific entity in the provided article [27389].
Impacted Organization 1. Users of Facebook worldwide were impacted by the software failure incident [27389].
Software Causes 1. Unknown
Non-software Causes 1. The outage was caused by an issue that prevented people from posting to Facebook for a brief period of time [27389]. 2. The outage led to a significant drop in referral traffic from Facebook to publishers like The Guardian [27389]. 3. Users turned to other social networks like Twitter and Google+ for information during the outage [27389]. 4. Brands attempted "rapid response" publicity stunts during the outage, such as Nestlé making a joke about KitKats and breaks [27389]. 5. Malaysian telco Digi posted a cat meme during the outage [27389].
Impacts 1. Facebook was unavailable worldwide for more than 30 minutes, affecting both the website and the company's smartphone and tablet apps, leading users to complain and seek information on other social networks [27389]. 2. Publishers experienced a significant drop in referral traffic from Facebook during the outage, impacting their website traffic [27389]. 3. Users turned to alternative social networks like Twitter and Google+ during the Facebook outage, causing a shift in traffic patterns [27389]. 4. Brands took advantage of the outage to engage in "rapid response" publicity stunts on social media platforms like Twitter [27389].
Preventions 1. Implementing robust error handling mechanisms to prevent cascading failures like the one experienced in 2010 [27389]. 2. Conducting regular stress testing and performance monitoring to identify and address potential issues before they lead to outages [27389]. 3. Implementing redundancy and failover mechanisms to ensure service availability even in the event of a failure in one part of the system [27389].
Fixes 1. Implementing robust error checking software to prevent similar incidents in the future [27389]. 2. Conducting a thorough investigation to identify the root cause of the outage and implementing measures to address it effectively [27389]. 3. Enhancing the system's redundancy and failover mechanisms to ensure minimal downtime during such incidents [27389].
References 1. Facebook statement regarding the issue [27389] 2. Referral traffic data from The Guardian [27389]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: The article mentions a previous outage in 2010 when an error in error checking software brought down Facebook's main database for two and a half hours. This incident was described as the worst outage in more than four years at that time. The problem was caused by "an unfortunate handling of an error condition" [27389]. (b) The software failure incident having happened again at multiple_organization: The article does not provide specific information about similar incidents happening at other organizations or with their products and services.
Phase (Design/Operation) design (a) The software failure incident related to the design phase can be seen in the article where it mentions a previous outage in 2010 caused by "an unfortunate handling of an error condition" due to an error in error checking software [27389]. This incident was attributed to an automated system for verifying configuration values that ended up causing more damage than it fixed, indicating a failure introduced during the system development phase. (b) The software failure incident related to the operation phase is evident in the same article when it describes how users turned to other social networks and searched for information on what had happened to Facebook during the outage [27389]. This behavior indicates that the failure was influenced by the operation or use of the system by the users.
Boundary (Internal/External) within_system From the provided article [27389], the software failure incident of Facebook being unavailable worldwide for more than 30 minutes was primarily within_system. The outage was caused by an issue within Facebook's system that prevented people from posting, leading to the site and apps becoming unavailable. Facebook quickly resolved the issue and issued a statement apologizing for the inconvenience caused. The outage was attributed to an internal issue within Facebook's system rather than external factors.
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: The Facebook outage in 2014 was attributed to a technical issue that prevented people from posting to Facebook for a brief period of time. The outage lasted for more than 30 minutes, affecting both the website and the company's smartphone and tablet apps. Facebook issued a statement acknowledging the issue and mentioned that they resolved it quickly, although the exact cause of the outage was not specified [27389]. (b) The software failure incident occurring due to human actions: There is no specific mention in the provided article about the software failure incident being caused by human actions. The outage experienced by Facebook in 2014 was primarily described as a technical issue that led to the platform being unavailable for a brief period of time [27389].
Dimension (Hardware/Software) hardware, software (a) The software failure incident related to hardware: - The article mentions that in 2010, Facebook experienced a significant outage due to an error in error checking software that brought down the main database for two and a half hours. This problem was caused by "an unfortunate handling of an error condition," which was related to the software. However, the root cause of the failure was attributed to hardware-related issues as an automated system for verifying configuration values ended up causing more damage than it fixed [27389]. (b) The software failure incident related to software: - The main software failure incident reported in the article, where Facebook was unavailable worldwide for more than 30 minutes in 2014, was primarily attributed to a software issue. The outage affected both the website and the company's smartphone and tablet apps. Facebook acknowledged the issue and mentioned that they resolved it quickly, indicating that the root cause was related to a software problem [27389].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in Article 27389 was non-malicious. The outage experienced by Facebook was attributed to an issue that prevented people from posting, leading to the site being unavailable for more than 30 minutes. Facebook issued a statement acknowledging the problem and stating that it was quickly resolved, indicating that the failure was not caused by malicious intent [27389].
Intent (Poor/Accidental Decisions) unknown From the provided articles, the software failure incident related to the Facebook outage in 2014 does not explicitly mention whether the incident was due to poor decisions or accidental decisions. The outage was attributed to an issue that prevented people from posting to Facebook for a brief period of time, and the exact cause of the outage was not disclosed by Facebook [27389]. Therefore, it is unknown whether the failure was a result of poor decisions or accidental decisions.
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident related to development incompetence is evident in the article as it mentions a previous outage in 2010 caused by "an unfortunate handling of an error condition" due to an error in error checking software [27389]. This incident was attributed to an automated system for verifying configuration values that ended up causing more damage than it fixed, showcasing a lack of professional competence in handling error conditions. (b) The software failure incident related to accidental factors is highlighted in the article as Facebook experienced an issue that prevented people from posting for a brief period of time, leading to the outage. The exact cause of the outage was not known, indicating that it was an accidental failure rather than a deliberate action [27389].
Duration temporary (a) The software failure incident in the article was temporary. Facebook was unavailable worldwide for more than 30 minutes, marking the longest outage on the site in four years. The site collapsed at 8:53 am BST and remained unavailable until 9:24 am BST, when it started working as normal [27389]. This indicates that the failure was temporary and not permanent.
Behaviour crash (a) crash: The software failure incident described in the article is a crash. Facebook was unavailable worldwide for more than 30 minutes, with both the website and the company's smartphone and tablet apps affected. Users were shown an error message indicating something went wrong, and the site remained unavailable until it was fixed [Article 27389]. (b) omission: There is no specific mention of the software failure incident being due to the system omitting to perform its intended functions at an instance(s) in the article. (c) timing: The software failure incident is not described as a timing issue where the system performed its intended functions correctly but too late or too early. (d) value: The software failure incident is not attributed to the system performing its intended functions incorrectly. (e) byzantine: The software failure incident is not characterized by the system behaving erroneously with inconsistent responses and interactions. (f) other: The behavior of the software failure incident in this case is a crash, where the system lost state and did not perform its intended functions, leading to the unavailability of Facebook for more than 30 minutes [Article 27389].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence unknown (a) death: People lost their lives due to the software failure (b) harm: People were physically harmed due to the software failure (c) basic: People's access to food or shelter was impacted because of the software failure (d) property: People's material goods, money, or data was impacted due to the software failure (e) delay: People had to postpone an activity due to the software failure (f) non-human: Non-human entities were impacted due to the software failure (g) no_consequence: There were no real observed consequences of the software failure (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? The consequence of the software failure incident: The articles do not mention any direct consequences such as death, physical harm, impact on basic needs, or property loss due to the Facebook outage. Users experienced inconvenience, and publishers saw a decline in referral traffic from Facebook. Some brands attempted to capitalize on the situation for publicity stunts. The outage led users to turn to other social networks for information [27389].
Domain information, entertainment (a) The software failure incident reported in Article 27389 affected Facebook, a social networking platform that is primarily focused on the production and distribution of information. Users were unable to post on Facebook during the outage, leading to a significant impact on the flow of information on the platform [27389]. (k) Additionally, the outage prompted users to turn to other social networks like Twitter and Google+ for information, indicating that the entertainment industry, which includes social media platforms, was also impacted by the failure [27389].

Sources

Back to List