Incident: Google Outage: Global Services Down for Five Minutes.

Published Date: 2013-08-19

Postmortem Analysis
Timeline 1. The software failure incident happened on August 16, 2013. [21093]
System 1. Google Search 2. YouTube 3. Google Drive 4. Gmail 5. Google Apps Dashboard The above systems/components failed during the software failure incident reported in Article #21093.
Responsible Organization 1. The entity responsible for causing the software failure incident that led to the outage of Google services, including Google Search, YouTube, and Google Drive, remains unknown as Google did not disclose the specific cause of the outage [21093].
Impacted Organization 1. Internet traffic around the world dropped by 40% during the outage, impacting all users relying on Google services [21093]. 2. Google users, including those trying to access Gmail, Google Search, YouTube, and Google Drive, were unable to use these services during the outage [21093].
Software Causes 1. The software cause of the failure incident was not specifically disclosed by Google, as the company did not provide details on what caused the outage [21093].
Non-software Causes 1. Unknown
Impacts 1. A 40% drop in global Internet traffic was observed during the five-minute outage of Google services, affecting users worldwide [21093]. 2. All Google services, including Google Search, YouTube, and Google Drive, were unavailable for those five minutes, disrupting users' access to these platforms [21093]. 3. The outage caused error messages and unexpected behavior for a significant subset of Gmail users, impacting their ability to access the service [21093]. 4. The blackout cost Google an estimated £330,000, highlighting the financial impact of such software failures [21093]. 5. The incident raised concerns about the over-reliance on Google services and the need to diversify and make Google less vital in users' lives to mitigate future disruptions [21093].
Preventions 1. Implementing robust redundancy and failover systems to ensure that if one service or component goes down, there are backups in place to seamlessly take over and prevent a widespread outage [21093]. 2. Conducting regular and thorough testing of the system to identify and address any potential vulnerabilities or weaknesses that could lead to a failure [21093]. 3. Improving communication and transparency with users by providing timely updates and explanations during outages to manage expectations and reduce speculation [21093].
Fixes 1. Implementing robust redundancy and failover systems to ensure high availability of Google's services [21093]. 2. Conducting a thorough root cause analysis to identify the underlying issue that caused the outage and implementing preventive measures to avoid similar incidents in the future [21093]. 3. Enhancing monitoring and alerting systems to quickly detect and respond to any service disruptions or anomalies [21093].
References 1. GoSquared developer Simon Tabor [21093] 2. Google's Apps Dashboard [21093] 3. Phil Dearson, head of strategy for ad agency Tribal Worldwide [21093]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: - Google had experienced a similar blackout in 2009, indicating that such incidents have occurred within the same organization before [21093]. (b) The software failure incident having happened again at multiple_organization: - The article does not mention any other organizations experiencing a similar incident, so there is no information provided about the software failure happening at multiple organizations [21093].
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase: The article mentions that Google suffered an outage affecting all its services, including Google Search, YouTube, and Google Drive. The outage was significant, causing a 40% drop in web traffic globally. The exact cause of the outage was not disclosed by Google, indicating a potential failure related to contributing factors introduced during system development or updates [21093]. (b) The software failure incident related to the operation phase: The article reports that Google's Apps Dashboard showed all its services were hit by the outage, affecting a significant subset of users. Users were able to access Gmail but experienced error messages and unexpected behavior. This indicates a failure related to contributing factors introduced during the operation or misuse of the system [21093].
Boundary (Internal/External) within_system (a) within_system: The software failure incident at Google, where all its services went down for five minutes, was attributed to an internal issue within the system. The outage affected Google Search, YouTube, Google Drive, and other services. The Google Apps Dashboard confirmed that all services were hit by the outage, but the company did not disclose the specific cause of the incident [21093]. (b) outside_system: There is no explicit mention in the article about the software failure incident being caused by contributing factors originating from outside the system.
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: - The article mentions that Google suffered an outage where all its services, including Google Search, YouTube, and Google Drive, were unavailable for five minutes [21093]. - The outage caused a 40% drop in Internet traffic globally, as reported by Web analytics firm GoSquared [21093]. - Google's Apps Dashboard indicated that there was a problem with Gmail affecting a significant subset of users, with error messages and unexpected behavior being observed [21093]. - The outage lasted for a brief period, with service being mostly restored within minutes [21093]. (b) The software failure incident occurring due to human actions: - The article does not provide specific information indicating that the software failure incident was caused by human actions.
Dimension (Hardware/Software) software (a) The software failure incident related to hardware: - The article does not provide specific information attributing the outage to hardware-related issues. It mainly focuses on the impact of the outage, the percentage drop in web traffic, and the services affected during the downtime [21093]. (b) The software failure incident related to software: - The outage affecting Google's services, including Google Search, YouTube, and Google Drive, was attributed to a software issue as indicated by the Apps Dashboard note mentioning a problem with Gmail affecting users with error messages and unexpected behavior [21093].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in Article 21093 was non-malicious. Google suffered an outage that lasted for five minutes, causing a significant drop in web traffic by 40%. The outage affected all of Google's services, including Google Search, YouTube, and Google Drive. The company did not provide specific details on the cause of the outage, but it was not attributed to any malicious activity. The incident was described as a blackout that surprised many, with previous similar incidents occurring in the past, such as in 2009 and with Google Drive a few months prior to this outage. The outage was estimated to have cost Google around £330,000, highlighting the impact of such non-malicious software failures on a large scale [21093].
Intent (Poor/Accidental Decisions) unknown (a) The software failure incident related to the Google outage in the article was not explicitly attributed to poor decisions. The cause of the outage was not disclosed by Google, and there was no mention of poor decisions leading to the failure [21093]. (b) The software failure incident related to the Google outage in the article was not explicitly attributed to accidental decisions or mistakes. The cause of the outage was not disclosed by Google, and there was no mention of accidental decisions leading to the failure [21093].
Capability (Incompetence/Accidental) accidental (a) The software failure incident related to development incompetence is not explicitly mentioned in the provided article [21093]. (b) The software failure incident was accidental as Google suffered an outage late on a Friday night, causing a 40% drop in Internet traffic for five minutes. The outage affected all Google services, including Google Search, YouTube, and Google Drive. The company did not provide details on the cause of the outage, and it was not intentional. This accidental failure led to errors and unexpected behavior for users accessing Gmail during the outage [21093].
Duration temporary (a) The software failure incident in the article was temporary. It lasted for only five minutes, during which all of Google's services, including Google Search, YouTube, and Google Drive, were unavailable [21093]. The outage caused a significant drop in web traffic, and Google's Apps Dashboard confirmed the issue, stating that between 15:51 and 15:52 PDT, 50 per cent to 70 per cent of requests to Google received errors. The service was mostly restored one minute later and entirely restored after four minutes. This indicates a temporary disruption rather than a permanent failure.
Behaviour crash, omission, timing, value (a) crash: The software failure incident described in the article can be categorized as a crash. Google's services, including Google Search, YouTube, and Google Drive, were all unavailable for five minutes, indicating a failure of the system to perform any of its intended functions [21093]. (b) omission: The incident also involved omission as a type of behavior. The Apps Dashboard noted that there was a problem with Gmail affecting a significant subset of users, who were able to access Gmail but were seeing error messages and/or other unexpected behavior, indicating an omission in performing the intended functions [21093]. (c) timing: The timing of the software failure incident is evident in the article. It mentions that between 15:51 and 15:52 PDT, 50 per cent to 70 per cent of requests to Google received errors, with service mostly restored one minute later and entirely restored after four minutes. This indicates a timing issue where the system performed its intended functions but with delays [21093]. (d) value: The incident also involved a failure related to value. Phil Dearson estimated that the blackout had cost Google around £330,000, indicating that the system performed its intended functions incorrectly in terms of the value it provided [21093]. (e) byzantine: There is no specific mention of the software failure incident exhibiting behavior related to a byzantine failure in the provided article. (f) other: The behavior of the software failure incident can be categorized as a combination of crash, omission, timing, and value-related failures, as described in the article [21093].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human, theoretical_consequence (a) death: There is no mention of any deaths resulting from the software failure incident reported in the articles [21093]. (b) harm: There is no mention of any physical harm to individuals resulting from the software failure incident reported in the articles [21093]. (c) basic: There is no mention of people's access to food or shelter being impacted due to the software failure incident reported in the articles [21093]. (d) property: The article mentions an estimate by Phil Dearson, head of strategy for ad agency Tribal Worldwide, that the blackout had cost Google around £330,000, indicating an impact on financial resources [21093]. (e) delay: The outage of Google services for five minutes could have potentially caused delays in accessing information or completing tasks for users during that time [21093]. (f) non-human: The outage affected Google's services such as Google Search, YouTube, and Google Drive, impacting the functionality of these platforms [21093]. (g) no_consequence: The outage of Google services, although significant, did not result in any major observed consequences beyond the temporary unavailability of services [21093]. (h) theoretical_consequence: There were discussions about the potential consequences of being overly reliant on Google services and the need to consider alternatives to reduce dependence on a single provider [21093]. (i) other: There is no mention of any other specific consequences resulting from the software failure incident reported in the articles [21093].
Domain information (a) The software failure incident reported in the articles affected various Google services, including Google Search, YouTube, and Google Drive, which are crucial for the production and distribution of information [21093]. The outage caused a significant drop in web traffic, highlighting the reliance of internet users on these services for accessing information. (b) The transportation industry was not directly impacted by the software failure incident reported in the articles. (c) The natural resources industry was not directly impacted by the software failure incident reported in the articles. (d) The sales industry was not directly impacted by the software failure incident reported in the articles. (e) The construction industry was not directly impacted by the software failure incident reported in the articles. (f) The manufacturing industry was not directly impacted by the software failure incident reported in the articles. (g) The utilities industry was not directly impacted by the software failure incident reported in the articles. (h) The finance industry was not directly impacted by the software failure incident reported in the articles. (i) The software failure incident did not directly impact the knowledge industry, which includes education, research, and space exploration. (j) The software failure incident did not directly impact the health industry, which includes healthcare, health insurance, and food industries. (k) The entertainment industry was not directly impacted by the software failure incident reported in the articles. (l) The government sector was not directly impacted by the software failure incident reported in the articles. (m) The software failure incident was related to the technology industry, specifically affecting Google's services, which fall under the broader category of the technology sector [21093].

Sources

Back to List