Published Date: 2012-12-11
| Postmortem Analysis | |
|---|---|
| Timeline | 1. The software failure incident happened on June 8, 2021, as mentioned in Article [115362]. 2. The incident date can be directly determined from the article itself, which states that the incident occurred on Tuesday morning, June 8, 2021. |
| System | 1. Fastly's service configuration [Article 115235] 2. Google's Sync servers [Article 55174] |
| Responsible Organization | 1. Fastly [Article 115235] 2. Google [Article 55174] |
| Impacted Organization | 1. Gov.uk domain and a string of publishers and other websites [115362] 2. Google's Gmail service and Chrome browser [55174] 3. Websites and apps around the world due to Fastly's content delivery network failure [115235] |
| Software Causes | 1. Fastly identified a service configuration issue that triggered disruptions across its servers, leading to a widespread failure [Article 115235]. 2. Google misconfigured its load-balancing servers, affecting Sync and other Google services, causing Chrome crashes [Article 55174]. |
| Non-software Causes | 1. Lack of resilience in critical government services [Article 115362] 2. Service configuration triggering disruptions across Fastly's servers [Article 115235] |
| Impacts | 1. The software failure incident caused a 45-minute internet outage that knocked out the Gov.uk domain and several other websites, affecting services like booking Covid tests and online government services [Article 115362]. 2. The incident led to a widespread failure of websites and apps around the world due to Fastly's content delivery network disruption, impacting various online platforms like Google and Amazon [Article 115235]. |
| Preventions | 1. Implementing a more robust and thorough testing process for software updates to catch potential issues before deployment could have prevented the software failure incident [115235]. 2. Having a backup or failover system in place to quickly switch over in case of a failure could have minimized the impact of the incident and allowed services to remain online [115362]. |
| Fixes | 1. Fastly identified a service configuration that triggered disruptions across its servers and disabled that configuration, which fixed the software failure incident [Article 115235]. 2. Google engineer Tim Steele confirmed that the crashes affecting Chrome users were due to a problem with Google's Sync servers, which kicked off an error on the browser. Fixing the issue with the Sync servers would address the software failure incident [Article 55174]. | References | 1. Fastly [Article 115362, Article 115235] 2. Google engineer Tim Steele [Article 55174] |
| Category | Option | Rationale |
|---|---|---|
| Recurring | one_organization, multiple_organization | (a) The software failure incident having happened again at one_organization: - Article 55174 reports on a software failure incident involving Google's Gmail service going down due to misconfigured load-balancing servers, which also affected Google's Chrome browser. This incident highlights a recurring issue within Google's services [55174]. (b) The software failure incident having happened again at multiple_organization: - Article 115235 discusses a widespread failure that occurred after Fastly, a major content delivery network, reported a service configuration issue that led to disruptions across its servers. This incident demonstrates that similar software failures have occurred at other organizations beyond Fastly, such as Google and Amazon in the past [115235]. |
| Phase (Design/Operation) | design, operation | (a) The software failure incident related to the design phase can be seen in Article 115235, where Fastly reported a widespread failure due to a service configuration that triggered disruptions across its servers. This disruption was caused by a bad software update, indicating a failure introduced during the system development or update phase. (b) The software failure incident related to the operation phase can be observed in Article 55174, where Google's Chrome browser crashed due to a problem with Google's Sync servers. This issue affected Chrome users who were using the Sync service, causing the browser to abruptly shut down on the desktop. This failure was a result of contributing factors introduced by the operation or misuse of the system. |
| Boundary (Internal/External) | within_system | (a) The software failure incident related to the Fastly outage can be categorized as within_system. The incident was caused by a service configuration error within Fastly's network, specifically a bad software update that triggered disruptions across its servers [Article 115235]. This indicates that the contributing factors leading to the failure originated from within the system itself. |
| Nature (Human/Non-human) | non-human_actions | (a) The software failure incident occurring due to non-human actions: - The incident involving Fastly's widespread failure was caused by a service configuration issue triggered by a bad software update, leading to disruptions across its servers [Article 115235]. - Google's Chrome browser crashing was attributed to a problem with Google's Sync servers, which overwhelmed a backend service causing Chrome to abruptly shut down on the desktop. This issue was not directly caused by human actions but rather by a backend service error [Article 55174]. (b) The software failure incident occurring due to human actions: - There is no specific mention in the articles about the software failure incidents being directly caused by human actions. |
| Dimension (Hardware/Software) | hardware, software | (a) The software failure incident occurring due to hardware: - Article 115235 mentions that Fastly, a major content delivery network, reported a widespread failure due to a service configuration issue that triggered disruptions across its servers. This indicates a hardware-related failure as it was caused by a bad software update that took down Fastly's own network [115235]. (b) The software failure incident occurring due to software: - Article 55174 reports on Google's Gmail service outage and Chrome browser crashes, which were caused by a misconfiguration of Google's load-balancing servers affecting the Sync service. This misconfiguration led to errors in the browser, causing Chrome to abruptly shut down, indicating a software-related failure [55174]. |
| Objective (Malicious/Non-malicious) | malicious, non-malicious | (a) The articles provide information on software failure incidents that can be categorized as malicious: 1. Article 115362 discusses how hostile states like Russia have engaged in sophisticated hacking campaigns against the West, including the use of relatively obscure but widely available software to exploit vulnerabilities. It mentions incidents such as Russian state-sponsored hackers penetrating the Orion IT network management tool made by SolarWinds to steal secrets from US federal agencies. Additionally, it references the WannaCry computer virus believed to be orchestrated by North Korea, which significantly affected parts of the NHS in 2017 [115362]. (b) The articles also touch upon non-malicious software failure incidents: 1. Article 55174 reports on a non-malicious software failure incident involving Google's Chrome browser and Gmail service. The issue was attributed to Google misconfiguring its load-balancing servers, specifically affecting the Sync service that caused Chrome to crash. This incident was described as a rare but not unheard of goof that temporarily brought down parts of even larger online platforms [55174]. These examples illustrate both malicious and non-malicious software failure incidents reported in the articles. |
| Intent (Poor/Accidental Decisions) | accidental_decisions | (a) poor_decisions: The software failure incident related to the Fastly outage was not due to poor decisions but rather a bad software update that Fastly implemented, which triggered disruptions across its servers [Article 115235]. (b) accidental_decisions: The software failure incident related to Google's Chrome browser crashing was due to accidental decisions or mistakes made by Google when they misconfigured their load-balancing servers, affecting the Sync service and causing widespread crashes in the browser [Article 55174]. |
| Capability (Incompetence/Accidental) | development_incompetence, accidental | (a) The software failure incident occurring due to development incompetence: - Article 115362 discusses the internet outage that affected Gov.uk and other websites, attributing the issue to a lack of resilience in critical government services. It mentions how technology that is rapidly becoming fundamental to operations often lacks sufficient resilience, indicating a potential failure due to contributing factors introduced due to a lack of professional competence by humans or development organizations. (b) The software failure incident occurring accidentally: - Article 55174 reports on Google's Gmail service outage and Chrome browser crashes, which were caused by Google misconfiguring its load-balancing servers. This misconfiguration led to widespread issues affecting Sync and other Google services, resulting in crashes. This incident highlights a failure due to contributing factors introduced accidentally. |
| Duration | temporary | (a) The software failure incident mentioned in the articles was temporary. In Article 115362, it is stated that the internet outage caused by Fastly lasted for about 45 minutes before the issue was fixed. Additionally, in Article 115235, it is mentioned that Fastly reported a widespread failure but had identified a service configuration issue and disabled it, resolving the disruptions across its servers. These incidents indicate that the software failure was not permanent but rather temporary in nature [115362, 115235]. |
| Behaviour | crash, omission, timing, other | (a) crash: - Article 115235 reports that Fastly experienced a widespread failure due to a bad software update, which essentially took down its own network, resulting in a crash of the system [115235]. (b) omission: - Article 115362 mentions an internet outage that affected the Gov.uk domain and other websites, causing users to struggle to access services like booking Covid tests online. This indicates an omission in the system's performance as it failed to deliver the intended services [115362]. (c) timing: - Article 115362 discusses the internet outage that lasted for 45 minutes, impacting critical government services and other websites. This delay in service availability suggests a timing issue where the system was not performing its intended functions at the right time [115362]. (d) value: - There is no specific mention of a failure due to the system performing its intended functions incorrectly in the provided articles. (e) byzantine: - There is no specific mention of a failure due to the system behaving erroneously with inconsistent responses and interactions in the provided articles. (f) other: - Article 55174 describes a situation where Google's Chrome browser crashed due to a problem with Google's Sync servers, which caused the browser to abruptly shut down. This behavior could be categorized as a system failure due to a synchronization issue impacting the browser's performance, falling under the "other" category [55174]. |
| Layer | Option | Rationale |
|---|---|---|
| Perception | None | None |
| Communication | None | None |
| Application | None | None |
| Category | Option | Rationale |
|---|---|---|
| Consequence | property, delay, non-human, theoretical_consequence | (a) death: People lost their lives due to the software failure - There is no mention of any deaths resulting from the software failure incident reported in the articles. (b) harm: People were physically harmed due to the software failure - There is no mention of any physical harm to individuals resulting from the software failure incident reported in the articles. (c) basic: People's access to food or shelter was impacted because of the software failure - There is no mention of people's access to food or shelter being impacted due to the software failure incident reported in the articles. (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incident caused disruptions to websites and apps around the world, impacting users' access to online services [Article 115235]. (e) delay: People had to postpone an activity due to the software failure - The software failure incident resulted in a 45-minute internet outage, affecting the Gov.uk domain and other websites, which would have caused delays for users trying to access online services during that time [Article 115362]. (f) non-human: Non-human entities were impacted due to the software failure - The software failure incident affected websites and apps globally, indicating an impact on non-human entities such as servers and network infrastructure [Article 115235]. (g) no_consequence: There were no real observed consequences of the software failure - The software failure incident did have consequences, including disruptions to online services and websites [Article 115235, Article 115362]. (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - The articles discuss potential consequences such as the lack of resilience in critical government services, the vulnerability to cyber threats, and the need for enhanced homeland security measures, but these consequences did not directly result from this specific software failure incident [Article 115362]. (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - There are no other consequences mentioned in the articles beyond those covered in options (a) to (h). |
| Domain | information, government | (a) The software failure incident affected the production and distribution of information as it caused a widespread disruption in various websites and apps due to a major content delivery network failure [Article 115235]. (l) The failed system was related to the government sector as it impacted critical government services, including the Gov.uk domain, and other websites. The incident highlighted a lack of resilience in delivering government services online [Article 115362]. |
Article ID: 115362
Article ID: 55174
Article ID: 115235