Incident: Google Cloud Outage: Global Disruption Due to Configuration Change.

Published Date: 2019-06-07

Postmortem Analysis
Timeline 1. The software failure incident happened on Sunday, as mentioned in the article [85617]. 2. Published on 2019-06-07. 3. Estimated timeline: The incident occurred on Sunday, June 2, 2019.
System 1. Google Cloud's automation software 2. Network control jobs 3. Google's network's control plane 4. Management traffic 5. Tools competing over the congested network 6. Automation software that deschedules jobs during maintenance [85617]
Responsible Organization 1. Google Cloud [85617]
Impacted Organization 1. YouTube 2. Shopify 3. Snapchat 4. Gmail 5. Google Cloud 6. Google search 7. Third-party services like Shopify 8. Google engineers 9. Google Cloud customers 10. Google's network control plane 11. Google's management traffic 12. Google's administrative tools 13. Google's automation software 14. Google's network 15. Google Cloud endpoints around the globe 16. Google Cloud users 17. Google Cloud services 18. Google's tooling 19. Google's customers 20. Facebook (mentioned for comparison) [85617]
Software Causes 1. The software causes of the failure incident were a cascading combination of two misconfigurations and a software bug that occurred during a routine configuration change initiated by Google Cloud [85617].
Non-software Causes 1. The root cause of the outage was a routine configuration change initiated by Google at 2:45 pm ET on Sunday, which led to a cascading combination of two misconfigurations and a software bug [85617].
Impacts 1. YouTube experienced a loss of 2.5 percent of views in a single hour [85617]. 2. Shopify stores were shut down [85617]. 3. Snapchat experienced a blackout [85617]. 4. Millions of people couldn't access their Gmail accounts [85617]. 5. Google search experienced a barely perceptible slowdown in returning results [85617].
Preventions 1. Implementing appropriate safeguards before bringing back the automation software that deschedules jobs during maintenance to prevent a global incident [85617]. 2. Lengthening the amount of time systems stay in "fail static" mode to give engineers more time to fix problems before customers feel the impact [85617]. 3. Enhancing network capacity planning and management to ensure networks can handle stress and avoid collapses like the one experienced [85617].
Fixes 1. Implement appropriate safeguards to prevent similar incidents in the future, such as lengthening the time systems stay in "fail static" mode [Article 85617].
References 1. Google Cloud - The articles gather information about the software failure incident from Google Cloud, which suffered a prolonged outage affecting various services like YouTube, Shopify, Snapchat, and Gmail [85617].

Software Taxonomy of Faults

Category Option Rationale
Recurring multiple_organization (a) The software failure incident having happened again at one_organization: The article does not provide information about a similar incident happening again within the same organization. (b) The software failure incident having happened again at multiple_organization: The article mentions a similar incident happening at Facebook, where the company experienced hours of downtime attributed to a "server configuration change that triggered a cascading series of issues" [85617].
Phase (Design/Operation) design, operation (a) The software failure incident described in the article was primarily due to a combination of misconfigurations and a software bug that occurred during a routine configuration change initiated by Google Cloud engineers [85617]. (b) The operation of the system also played a role in the failure incident as Google's automation software descheduled network control jobs in multiple locations, leading to network congestion and triaging of traffic to preserve latency-sensitive flows [85617].
Boundary (Internal/External) within_system The software failure incident reported in the article [85617] was primarily within_system. The root cause of the outage was a combination of two misconfigurations and a software bug that occurred during a routine configuration change initiated by Google within their own system. The article mentions that Google's automation software descheduled network control jobs in multiple locations, leading to internet-wide gridlock [85617]. Additionally, the article highlights how Google's engineers were hampered in debugging the problem due to failure of tools competing over the congested network, which further emphasizes that the failure was within the system [85617].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurred due to non-human actions, specifically a combination of misconfigurations and a software bug. Google initiated a routine configuration change which led to a cascading combination of two misconfigurations and a software bug, causing network control jobs to be descheduled in multiple locations, resulting in internet-wide gridlock [85617]. (b) The software failure incident was also influenced by human actions. Google's engineers were aware of the problem within two minutes but faced challenges in debugging due to the failure of tools competing over the congested network. The fog of war, as described by an expert, made it difficult for Google to formulate a diagnosis promptly, leading to delays in identifying the impact and communicating with customers. Additionally, the company took steps to prevent similar incidents in the future by adjusting its automation software and lengthening the time systems stay in "fail static" mode, indicating a human response to the failure incident [85617].
Dimension (Hardware/Software) hardware, software (a) The software failure incident reported in the article was primarily due to contributing factors originating in software. The root cause of the outage was a combination of misconfigurations and a software bug that occurred during a routine configuration change initiated by Google on its servers [85617]. (b) The software failure incident was also influenced by hardware-related factors. The software bug and misconfigurations led to network control jobs being descheduled in multiple locations, impacting the capacity and functionality of Google's cloud services. This incident highlights the interplay between software and hardware components in causing system failures [85617].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in the article was non-malicious. The root cause of the outage was explained as a combination of misconfigurations and a software bug that occurred during a routine configuration change initiated by Google [85617]. The incident was not attributed to any malicious activity or intent to harm the system.
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions (a) The software failure incident related to the Google Cloud outage was primarily due to poor decisions made during a routine configuration change. Google initiated a maintenance event intended for a few servers in one geographic region, but a cascading combination of two misconfigurations and a software bug led to network control jobs being descheduled in multiple locations, causing internet-wide gridlock [85617]. (b) Additionally, the incident also involved accidental decisions or unintended consequences. Google's engineers were aware of the problem within two minutes but faced challenges in debugging due to the failure of tools competing over the congested network. The scope and scale of the outage made it difficult to precisely identify the impact and communicate accurately with customers, leading to delays in formulating a diagnosis and implementing a solution [85617].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident related to development incompetence is evident in the Google Cloud outage incident described in Article 85617. The root cause of the outage was attributed to a series of misconfigurations and a software bug that occurred during a routine configuration change initiated by Google engineers. Despite being aware of the problem within two minutes, the debugging process was significantly hampered by the failure of tools competing over the congested network, leading to delays in diagnosis and resolution. This highlights a lack of professional competence in handling the network issues efficiently [85617]. (b) The software failure incident related to accidental factors is also apparent in the Google Cloud outage incident. The outage was not caused by hackers but rather by a routine maintenance event that went awry due to misconfigurations and a software bug. The incident was described as a cascading combination of errors that led to network congestion and service disruptions across various platforms like YouTube, Shopify, Snapchat, and Gmail. The accidental nature of the failure is emphasized by the fact that Google engineers were quick to respond but faced challenges in resolving the issue due to unforeseen consequences of the initial maintenance event [85617].
Duration temporary The software failure incident reported in the article was temporary. The Google Cloud outage lasted for several hours, starting at 2:45 pm ET on a Sunday and continuing until the network started to recover at 6:19 pm ET, with business as usual resuming by 7:10 pm ET [85617]. The incident was caused by a cascading combination of two misconfigurations and a software bug, leading to network congestion and impacting various services like YouTube, Shopify, Snapchat, and Gmail. The outage was not permanent but rather a temporary disruption in service due to specific circumstances within Google's cloud infrastructure.
Behaviour other (a) crash: The software failure incident described in the article was not a crash where the system loses state and does not perform any of its intended functions. The incident involved a prolonged outage in Google Cloud due to a series of misconfigurations and a software bug, leading to network congestion and disruptions in various services [85617]. (b) omission: The incident did not involve the system omitting to perform its intended functions at an instance(s). Instead, the failure was caused by misconfigurations and a software bug that led to network control jobs being descheduled in multiple locations, causing internet-wide gridlock and disruptions in services like YouTube, Shopify, Snapchat, and Gmail [85617]. (c) timing: The failure was not related to the system performing its intended functions correctly but too late or too early. The incident was primarily caused by misconfigurations and a software bug that resulted in network congestion and disruptions in various services, impacting users around the globe [85617]. (d) value: The software failure incident did not involve the system performing its intended functions incorrectly. The outage in Google Cloud was a result of misconfigurations and a software bug that led to network congestion and disruptions in services, affecting millions of users [85617]. (e) byzantine: The incident did not exhibit the behavior of the system behaving erroneously with inconsistent responses and interactions, which is characteristic of a byzantine failure. Instead, the failure was caused by misconfigurations and a software bug that resulted in network congestion and disruptions in various services [85617]. (f) other: The behavior of the software failure incident can be categorized as a network outage caused by a combination of misconfigurations and a software bug in Google Cloud. This led to network congestion, disruptions in services like YouTube, Shopify, Snapchat, and Gmail, and impacted users globally. The incident highlighted the importance of proper configuration management and software testing to prevent such widespread outages [85617].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property (d) property: People's material goods, money, or data was impacted due to the software failure. The software failure incident described in the article resulted in disruptions to various services such as YouTube, Shopify, Snapchat, and Gmail due to a Google Cloud outage. Google Cloud lost nearly a third of its traffic, leading to issues for third-party services like Shopify. YouTube lost 2.5 percent of views in a single hour, and one percent of Gmail users ran into issues. These disruptions indicate that people's access to digital services and data was impacted, highlighting the property-related consequences of the software failure incident [85617].
Domain information (a) The software failure incident reported in the article was related to the information industry. The incident affected various services like YouTube, Shopify, Snapchat, and Gmail, which are all part of the information industry [85617].

Sources

Back to List