Incident: 911 Outage in Multiple States Due to Preventable Software Error

Published Date: 2014-10-20

Postmortem Analysis
Timeline 1. The software failure incident happened on April 9, as mentioned in Article 30824. Therefore, the estimated timeline for the software failure incident is April 2014.
System 1. Intrado's automated system responsible for assigning unique identifying codes to incoming 911 calls [30824]
Responsible Organization 1. Intrado, a Colorado-based company, was responsible for causing the software failure incident that led to the 911 outage affecting millions of people across seven states [30824].
Impacted Organization 1. Emergency services in Washington and parts of North Carolina, South Carolina, Pennsylvania, California, Minnesota, and Florida were impacted by the software failure incident [30824].
Software Causes 1. The software error was caused by a system maintained by a third-party contractor, Intrado, which maxed out at a pre-set limit of 40 million calls for assigning unique identifying codes to incoming calls, leading to a bottleneck and subsequent failures in the 911 infrastructure [30824].
Non-software Causes 1. The software error was caused by a system maintained by a third-party contractor, Intrado, which maxed out at a pre-set limit for assigning unique identifying codes to incoming calls, leading to a bottleneck and subsequent failures in the 911 infrastructure [30824].
Impacts 1. The software failure incident caused emergency services to go dark for more than 11 million people across seven states, including the entire state of Washington, and parts of North Carolina, South Carolina, Pennsylvania, California, Minnesota, and Florida [30824]. 2. The outage affected 81 call dispatch centers, rendering emergency services inoperable in the affected states [30824]. 3. In Washington state alone, 4,500 calls to 911 failed to go through during an eight-hour period [30824]. 4. The software error resulted in a bottleneck and a series of cascading failures in the 911 infrastructure [30824].
Preventions 1. Implementing proper monitoring and alerting systems to ensure that warnings and alerts are promptly flagged for human intervention [30824]. 2. Conducting regular checks and evaluations of critical system components, such as the counter cap, to prevent blockages and bottlenecks [30824]. 3. Increasing the counter cap limit to avoid maxing out and causing the software to stop accepting new calls [30824]. 4. Developing guidelines and protocols to address and mitigate future outages caused by software errors [30824].
Fixes 1. Increasing the counter cap and checking it weekly to prevent blockages from occurring [30824]. 2. Creating a new alarm for when the number of successful calls drops below a certain percentage [30824]. 3. Developing a set of guidelines to help deal with future outages [30824].
References 1. Federal Communications Commission (FCC) [30824] 2. Intrado, a Colorado-based company [30824] 3. Dave Danner, chairman of the Washington state utility commission [30824]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident happened again at one_organization: The article mentions that the software failure incident involving the 911 outage was not an isolated event. It states that similar incidents have occurred in the past, with the article specifically mentioning that in 2014 alone, there were four major outages affecting entire states or multiple states, including incidents in Hawaii and Vermont [30824]. (b) The software failure incident happened again at multiple_organization: The article does not provide specific information about similar incidents happening at other organizations or with their products and services. Therefore, it is unknown if the software failure incident occurred at multiple organizations.
Phase (Design/Operation) design (a) The software failure incident in the article was primarily due to a design issue. The incident was caused by a software error related to the system's design, specifically a coding error in the software responsible for assigning unique identifying codes to incoming 911 calls. This design flaw led to the system maxing out at a pre-set limit of 40 million calls, causing a bottleneck and subsequent failures in the 911 infrastructure [30824]. (b) The software failure incident was not primarily due to operation issues or misuse of the system. The article does not mention any operational errors or misuse of the system as contributing factors to the outage. Instead, the focus is on the preventable software error in the design phase that led to the disruption of 911 services across multiple states [30824].
Boundary (Internal/External) within_system, outside_system The software failure incident related to the 911 outage was primarily within_system. The failure was caused by an entirely preventable software error within the system maintained by a third-party contractor, Intrado. The software responsible for assigning unique identifying codes to incoming calls maxed out at a pre-set limit, leading to a bottleneck and subsequent failures in the 911 infrastructure [30824].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident in the article was primarily due to non-human actions. The outage affecting 911 services for over 11 million people across seven states was caused by an entirely preventable software error. The incident was attributed to a software glitch in the system maintained by a third-party contractor, Intrado, where the software responsible for assigning unique identifying codes to incoming calls maxed out at a pre-set limit, leading to a bottleneck and subsequent failures in the 911 infrastructure [30824]. (b) Human actions also played a role in the software failure incident. The article mentions that despite alarm bells going off an hour into the breakdown, the warnings were not noticed until it was too late. The server categorizing the alerts as "low level" incidents failed to flag them for human attention. Additionally, the report highlighted that Intrado was not able to fully understand the significance and breadth of the problem until later in the incident, indicating a lack of timely human intervention and response to the emerging issues [30824].
Dimension (Hardware/Software) software (a) The software failure incident described in the article was due to contributing factors that originated in software. The Federal Communications Commission's report highlighted that an entirely preventable software error was responsible for causing the 911 service outage affecting millions of people across seven states. The software error occurred in a system maintained by a third-party contractor, Intrado, where the software responsible for assigning unique identifying codes to incoming calls maxed out at a pre-set limit, leading to a bottleneck and subsequent failures in the 911 infrastructure [30824].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in the articles was non-malicious. The outage that affected 911 services in multiple states was attributed to an entirely preventable software error caused by a coding mistake in the system maintained by a third-party contractor, Intrado [30824]. The incident was a result of the software reaching a pre-set limit for assigning unique identifying codes to incoming calls, leading to a bottleneck and subsequent failures in the 911 infrastructure. The failure was not intentional but rather a consequence of a technical flaw in the software system.
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident was primarily due to poor decisions. The Federal Communications Commission's report highlighted that the outage affecting 911 services for millions of people across seven states was caused by an entirely preventable software error. The incident was attributed to a coding error in the software maintained by a third-party contractor, Intrado, which led to the system maxing out at a pre-set limit of 40 million calls, causing a bottleneck and cascading failures in the 911 infrastructure [30824]. Additionally, the report mentioned that warnings about the issue were not noticed in time, and the significance of the problem was not fully understood until hours into the breakdown, indicating a lack of proactive monitoring and response to the software issue [30824].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident in the article was due to development incompetence. The Federal Communications Commission's report highlighted that the outage affecting 911 services in seven states was caused by an entirely preventable software error. The incident was attributed to a coding error in the system maintained by a third-party contractor, Intrado, where the software responsible for assigning unique identifying codes to incoming calls maxed out at a pre-set limit, leading to a bottleneck and subsequent failures in the 911 infrastructure [30824]. (b) The software failure incident was accidental in nature. The article mentions that the software error causing the outage was not intentional but rather a result of the software system maxing out at a pre-set limit due to a coding error. The incident was not a deliberate act but a consequence of the software failing to handle the volume of calls, leading to a breakdown in the 911 services across multiple states [30824].
Duration temporary (a) The software failure incident described in the article was temporary. The outage lasted for six hours, during which emergency services went dark for more than 11 million people across seven states. The incident affected 81 call dispatch centers, rendering emergency services inoperable in all of Washington and parts of North Carolina, South Carolina, Pennsylvania, California, Minnesota, and Florida [30824]. (b) The software failure incident was caused by a preventable software error that occurred due to a specific circumstance - the software responsible for assigning unique identifying codes to incoming calls maxed out at a pre-set limit of 40 million calls. This specific circumstance led to the failure of the routing system to accept new calls, causing a bottleneck and cascading failures in the 911 infrastructure [30824].
Behaviour crash, omission, value (a) crash: The software failure incident described in the article can be categorized as a crash. The 911 service went dark for more than 11 million people across seven states due to a software error that caused the 911 service to drop, rendering emergency services inoperable in multiple states [30824]. (b) omission: The software failure incident can also be categorized as an omission. The software responsible for assigning unique identifying codes to incoming 911 calls maxed out at a pre-set limit, leading to the system not accepting new calls and causing a bottleneck in the 911 infrastructure [30824]. (c) timing: The software failure incident does not align with the timing failure category as the issue was not related to the system performing its intended functions too late or too early. (d) value: The software failure incident can be categorized as a value failure. The software error caused the system to perform its intended functions incorrectly by maxing out at a pre-set limit for assigning unique identifying codes to incoming calls, resulting in the system not accepting new calls and causing a disruption in the 911 service [30824]. (e) byzantine: The software failure incident does not align with the byzantine failure category as there is no mention of inconsistent responses or interactions in the system behavior. (f) other: The software failure incident can be categorized as a crash and omission, as described above.

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence non-human, other (a) death: People lost their lives due to the software failure - There is no mention in the article of any deaths resulting from the software failure incident. [30824] (b) harm: People were physically harmed due to the software failure - The article does not mention any physical harm caused to individuals due to the software failure incident. [30824] (c) basic: People's access to food or shelter was impacted because of the software failure - The software failure incident did not impact people's access to food or shelter as per the information provided in the article. [30824] (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incident did not mention any impact on people's material goods, money, or data. [30824] (e) delay: People had to postpone an activity due to the software failure - The software failure incident did not mention any specific activities that people had to postpone due to the outage. [30824] (f) non-human: Non-human entities were impacted due to the software failure - The software failure incident primarily affected the 911 emergency services and call dispatch centers, impacting the ability to receive and respond to emergency calls. [30824] (g) no_consequence: There were no real observed consequences of the software failure - The software failure incident had significant consequences, including rendering emergency services inoperable for millions of people across multiple states. [30824] (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - The article does not mention any potential consequences discussed that did not occur as a result of the software failure incident. [30824] (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - The primary consequence of the software failure incident was the disruption of 911 services for over 11 million people across seven states, leading to failed emergency calls and the inability to reach help during the outage. Additionally, the incident highlighted vulnerabilities in transitioning to Internet Protocol-supported technologies for emergency services.
Domain information (a) The failed system was intended to support the industry of information. The software error that caused the 911 service outage was related to a system maintained by a third-party contractor, Intrado, which operates a routing service for 911 calls, directing them to the most appropriate public safety answering point (PSAP) [30824].

Sources

Back to List