Incident: Google Worldwide Outage: Authentication System Failure Impacts Services and Users

Published Date: 2020-12-14

Postmortem Analysis
Timeline 1. The software failure incident happened on December 14, 2020 [Article 109152, Article 108541].
System 1. Google's authentication tools [109152, 108541] 2. Google Suite services including Gmail, Google Calendar, Chat, Meet, Docs, Sheets, and Slides [108541] 3. Google Home and Nest smart home services [108541] 4. Google Assistant [108541] 5. Google Classroom [108541]
Responsible Organization 1. Google [109183, 109152, 108541]
Impacted Organization 1. Google users, including Gmail users and users of other Google services like Google Drive, Google Sheets, Maps, YouTube, Google Calendar, Google Home, and Nest [109183, 109152, 108541] 2. ProtonMail users who were trying to send emails to Gmail addresses [109183] 3. Businesses and workplaces relying on Google Suite for email communication, intra-office messaging, and work-related tasks [108541] 4. Workers using Slack for communication [108541] 5. Users of Google's Smart Home services, including Google Home smart speakers and Nest thermostats and smoke alarms [108541] 6. Parents using Nest indoor security cameras as smart baby monitors [108541] 7. Schoolchildren using Google Classroom for educational purposes [108541]
Software Causes 1. The software causes of the failure incident were related to an internal storage quota issue affecting Google's authentication system, leading to high error rates and service unavailability for users trying to log in to services like Gmail and Google Calendar [Article 109152, Article 108541].
Non-software Causes 1. Internal storage quota issue causing authentication system outage [109152] 2. Failure in the company's authentication tools managing user logins [108541]
Impacts 1. Users experienced error messages, high latency, and unexpected behavior when trying to send emails through Gmail, with some users receiving bounceback messages, impacting email traffic and causing emails to be permanently bounced [109183]. 2. The outage affected hundreds of millions of people worldwide, disrupting Google-owned apps and websites such as Gmail, Google Drive, Google Sheets, Maps, YouTube, and the main search engine [109152]. 3. The outage caused chaos for remote workers, disrupted workplace communication through Google Suite services like Gmail, Google Calendar, Chat, Meet, Docs, Sheets, and Slides [108541]. 4. Users were unable to access Google services that required logging in, such as Gmail and Google Calendar, while third-party services using Google's authentication platform were affected when users tried to sign in or out [108541]. 5. Google's smart home services, including Google Home, Nest thermostats, and security cameras, were rendered inaccessible, impacting users' ability to control their home environment and monitor security [108541]. 6. The outage highlighted the risks of digital concentration, where a single company's failure can disrupt a significant portion of online activity, leading to concerns about dependency on external technology stacks [108541].
Preventions 1. Implementing better storage management practices to ensure that internal storage quotas are regularly monitored and adjusted to prevent issues like the one that caused the outage [109152]. 2. Conducting thorough testing and quality assurance procedures to identify and address potential vulnerabilities in the authentication system that could lead to system crashes [108541]. 3. Diversifying reliance on a single company for critical services by having backup systems or alternative providers in place to mitigate the impact of a widespread outage [108541]. 4. Investing in robust redundancy and fail-safe mechanisms to ensure that essential services can still operate even in the event of a system failure [108541].
Fixes 1. Allocating more storage space to the services that handle authentication to prevent similar issues in the future [109152]. 2. Conducting a thorough follow-up review to ensure the problem cannot recur in the future [109152]. 3. Increasing investment in reliability, testing, and quality assurance to match the growing dependency on technology [108541].
References 1. Google's status dashboard [109183] 2. Encrypted email service ProtonMail [109183] 3. Jake Moore, cybersecurity specialist at ESET [109152] 4. DownDetector outage tracker site [109152] 5. Google spokesperson [109152] 6. Adam Leon Smith, a fellow of BCS, the Chartered Institute for IT [108541]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: - Google experienced a worldwide outage affecting services like Gmail, Google Calendar, and YouTube due to an authentication system failure [Article 108541]. - Prior to this incident, Google had faced another outage affecting Gmail, YouTube, Google Drive, Google Hangouts, and Google Meet [Article 109183]. (b) The software failure incident having happened again at multiple_organization: - Facebook, a rival tech firm, also faced a mass outage affecting Messenger, Facebook, and Instagram [Article 109152]. - Amazon experienced a significant failure at its Virginia data center, impacting various services and websites relying on AWS [Article 108541].
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase: - The incident was caused by a failure in Google's authentication tools, which manage how users log in to services run by both Google and third-party developers. This failure was due to an internal storage quota issue, where the system did not allocate enough storage space to the services handling authentication, leading to a crash [Article 108541]. - The outage affected Google services like Gmail, Google Drive, Google Sheets, Maps, and YouTube due to an internal storage quota issue that impacted the authentication system [Article 109152]. (b) The software failure incident related to the operation phase: - Users were unable to access Google services like Gmail, Google Calendar, and YouTube due to the failure in the company's authentication tools, which manage how users log in to services. This affected the operation of services that require users to log in, such as Gmail and Google Calendar [Article 108541]. - Users experienced issues with logging in to Google services like Gmail, Google Drive, and Google Hangouts, impacting their ability to operate these services effectively [Article 109183].
Boundary (Internal/External) within_system (a) within_system: The software failure incident was primarily caused by an internal storage quota issue within Google's authentication tools, leading to high error rates and the inability of users to access Google services that require logging in [Article 109152]. The root cause of the error was the failure of internal tools to allocate enough storage space to handle authentication services, resulting in a system crash [Article 108541]. (b) outside_system: The software failure incident was not primarily caused by factors originating from outside the system.
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: - Article 109152 reports that the Google outage affecting services like Gmail, Google Calendar, and YouTube was caused by a failure in the company's authentication tools, specifically due to an internal storage quota issue. This issue led to high error rates during the outage period, impacting services that require users to log in [109152]. - Article 108541 also mentions that the Google outage was caused by a failure in the company's authentication tools, specifically an internal storage quota issue. The system crashed due to the storage filling up, which affected services like Gmail and Google Calendar that rely on logging in [108541]. (b) The software failure incident occurring due to human actions: - There is no specific mention in the articles about the software failure incident being caused by human actions.
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - The outage affecting Google services, including Gmail, Google Calendar, and YouTube, was caused by a failure in the company's authentication tools, specifically an internal storage quota issue [Article 108541]. - The outage at Google was due to an internal storage quota issue that caused an authentication system outage for approximately 45 minutes [Article 108541]. (b) The software failure incident occurring due to software: - The disruption in Gmail, including error messages, high latency, and unexpected behavior, was reported to be on Google's side, impacting all email traffic [Article 109183]. - The outage affecting Google services was caused by a failure in the company's authentication tools, which manage how users log in to services run by both Google and third-party developers [Article 108541].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in the articles was non-malicious. The incidents were caused by internal technical issues within Google's authentication system, specifically an internal storage quota issue that led to high error rates and affected services requiring users to log in [109152, 108541]. There is no indication in the articles that the failures were due to malicious intent or actions by individuals seeking to harm the system.
Intent (Poor/Accidental Decisions) accidental_decisions (a) poor_decisions: The software failure incident related to the Google outage was not primarily due to poor decisions but rather an internal storage quota issue that caused an authentication system outage [109152, 108541]. (b) accidental_decisions: The software failure incident was primarily due to an accidental decision related to an internal storage quota issue that led to the authentication system outage affecting Google services [109152, 108541].
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident occurring due to development incompetence: - Article 108541 reports that the Google outage was caused by a failure in the company's authentication tools, specifically due to an internal storage quota issue. The root cause was attributed to the internal tools failing to allocate enough storage space to the services that handle authentication, leading to a system crash [108541]. (b) The software failure incident occurring accidentally: - Article 108541 mentions that the outage was caused by an internal storage quota issue, which led to high error rates during the period. The spokesperson mentioned that the authentication system issue was resolved, and a thorough follow-up review would be conducted to prevent such a recurrence in the future, indicating that the issue was not intentional but accidental [108541].
Duration temporary (a) The software failure incident was temporary as it was not a permanent failure. The incident was caused by a failure in Google's authentication tools due to an internal storage quota issue [Article 108541]. The outage affected various Google services, including Gmail, Google Calendar, and YouTube, but the issue was resolved within a few hours, and all services were restored [Article 108541]. (b) The software failure incident was temporary as it was not a permanent failure. The outage was caused by a failure in the company's authentication tools, specifically due to an internal storage quota issue [Article 108541]. The authentication system issue was resolved within a relatively short period, lasting approximately 45 minutes, and all services were restored [Article 108541].
Behaviour crash, omission, other (a) crash: The software failure incident described in the articles can be categorized as a crash. The incident led to a widespread outage affecting various Google services, including Gmail, Google Calendar, YouTube, and more [Article 108541]. Users were unable to access services like Gmail and Google Calendar entirely due to the failure in the company's authentication tools [Article 108541]. (b) omission: The software failure incident can also be categorized as an omission. Users experienced failures where the system omitted to perform its intended functions at an instance, such as being unable to access Google services that require logging in, like Gmail and Google Calendar [Article 108541]. (c) timing: The software failure incident does not align with a timing failure, where the system performs its intended functions but too late or too early. (d) value: The software failure incident does not align with a value failure, where the system performs its intended functions incorrectly. (e) byzantine: The software failure incident does not align with a byzantine failure, where the system behaves erroneously with inconsistent responses and interactions. (f) other: The software failure incident can be described as a widespread outage that impacted various Google services due to a failure in the company's authentication tools, leading to users being unable to access services that require logging in [Article 108541].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human, other (a) death: People lost their lives due to the software failure - There is no mention of any deaths resulting from the software failure incident in the provided articles. (b) harm: People were physically harmed due to the software failure - There is no mention of any physical harm to individuals due to the software failure incident in the provided articles. (c) basic: People's access to food or shelter was impacted because of the software failure - There is no mention of people's access to food or shelter being impacted by the software failure incident in the provided articles. (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incident caused disruptions in various Google services, affecting millions of users worldwide, including Gmail, Google Drive, Google Sheets, Maps, YouTube, and more [109152]. Users experienced issues such as being unable to access their emails, documents, and other online services during the outage. (e) delay: People had to postpone an activity due to the software failure - The software failure incident led to significant disruptions in various Google services, impacting users' ability to access and use services like Gmail, Google Calendar, YouTube, and Google Home [108541]. Workplaces relying on Google Suite for communication and collaboration faced disruptions, and individuals using Google's smart home services were unable to control devices like smart lights and thermostats. (f) non-human: Non-human entities were impacted due to the software failure - Non-human entities such as Google's smart home devices, including Google Home smart speakers, Nest thermostats, and security cameras, were affected by the software failure incident [108541]. Users were unable to control their smart home devices or access features like streaming footage from security cameras during the outage. (g) no_consequence: There were no real observed consequences of the software failure - The software failure incident resulted in significant disruptions to various Google services, impacting users' ability to access and use essential online tools and services [108541, 109152]. (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - The articles do not mention any potential consequences discussed that did not occur as a result of the software failure incident. (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - The software failure incident caused financial losses for Google, with YouTube estimated to have lost approximately £1.3 million in ad revenue during the outage [109152]. Additionally, the outage highlighted the risks of digital concentration, where an outage at a single company like Google can disrupt a substantial portion of online activity, impacting businesses and individuals reliant on Google services [108541].
Domain information, finance, knowledge (a) The software failure incident affected the information industry, specifically impacting email services like Gmail. Users experienced issues with sending and receiving emails, including bounceback messages and permanent bouncing of emails sent to Gmail users [Article 109183]. (h) The finance industry was indirectly impacted by the software failure incident as it disrupted various Google services, including Google Calendar, which is commonly used for scheduling financial meetings and events [Article 108541]. (m) The software failure incident also had implications for the education industry. Google's Classroom service, which is utilized by schools for online learning and collaboration, failed during the outage, leading to disruptions in educational activities [Article 108541].

Sources

Back to List