Incident: National Weather Service Internet Outage Impacts Forecasting and Communication

Published Date: 2021-03-30

Postmortem Analysis
Timeline 1. The software failure incident happened on March 30, 2021. [112079]
System 1. National Weather Service's Internet infrastructure system failed, leading to a major, systemwide Internet outage [112079]. 2. NWS Chat system failed multiple times, impacting real-time communication during severe weather events [112079].
Responsible Organization 1. The National Weather Service's information technology infrastructure struggled to address systemic, long-standing issues, leading to the software failure incident [112079].
Impacted Organization 1. The National Weather Service (NWS) was impacted by the software failure incident reported in the news article [112079].
Software Causes 1. Bandwidth shortage leading to proposed and implemented limits on data download [112079] 2. Launch of a radar website that functioned inadequately [112079] 3. Flood at the data center in Silver Spring, Md., impacting access to key ocean buoy observations [112079] 4. Multiple outages to NWS Chat, impacting critical information dissemination during severe weather events [112079]
Non-software Causes 1. Bandwidth shortage leading to proposed limits on data download for users [112079]. 2. Flood at the data center in Silver Spring, Md., causing access issues to key ocean buoy observations [112079].
Impacts 1. The software failure incident at the National Weather Service led to a major, systemwide Internet failure, making forecasts and warnings inaccessible to the public and limiting data available to meteorologists [112079]. 2. The outage resulted in the Weather Service's flagship website, weather.gov, being down, cutting off access to forecasts and warnings [112079]. 3. Meteorologists and Weather Service constituents took to Twitter to complain about the outage, highlighting chronic issues with the agency's Internet services and expressing frustration over the impact on public safety and response capabilities [112079]. 4. The failure of the NWS Chat system, a critical communication tool during severe weather events, led to unreliable communication and forced some offices to consider alternative platforms like Slack [112079]. 5. The Weather Service faced challenges in maintaining stable and reliable information dissemination infrastructure, impacting its ability to fulfill its mission of protecting life and property [112079].
Preventions 1. Implementing a more robust and reliable information dissemination infrastructure to prevent systemic failures like the one experienced by the National Weather Service [112079]. 2. Upgrading the data server and network architecture to handle increasing demands from users and prevent bandwidth shortages [112079]. 3. Transitioning critical applications to cloud service providers like Amazon Web Services, Microsoft, or Google Cloud to improve stability and scalability of the system [112079].
Fixes 1. Upgrading data server and network architecture using congressionally appropriated funds [112079]. 2. Transitioning applications to cloud service providers like Amazon Web Services, Microsoft, and Google Cloud [112079].
References 1. National Weather Service's central operations center 2. Meteorologists and Weather Service constituents on Twitter 3. Weather Service office in Birmingham, Ala. 4. Warning coordination meteorologist John De Block 5. Weather Service director of public affairs Susan Buchanan 6. Meteorologist James Spann 7. Meteorologist Josh Johnson 8. Systems analyst Daryl Herzmann 9. Senior lecturer in meteorology Troy Kimmel 10. Former acting head of the National Oceanic Atmospheric Administration Neil Jacobs

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: The National Weather Service has encountered numerous problems with its Internet services in recent months, including bandwidth shortages, inadequate radar websites, floods at data centers, and multiple outages to NWS Chat [112079]. These issues have been ongoing for years, dating back to at least 2013, indicating a systemic problem within the organization's information technology infrastructure. (b) The software failure incident having happened again at multiple_organization: The article does not provide specific information about similar software failure incidents happening at other organizations.
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase can be seen in the article where it mentions the systemic, long-standing issues with the National Weather Service's information technology infrastructure. The agency has struggled to address these issues as demands for its services have increased over time [112079]. (b) The software failure incident related to the operation phase is evident in the repeated problems the National Weather Service encountered with its Internet services, including bandwidth shortages, inadequate website functionality, and outages to critical communication programs like NWS Chat. These issues impacted the operation and reliability of the Weather Service's information dissemination infrastructure [112079].
Boundary (Internal/External) within_system (a) The software failure incident related to the National Weather Service experiencing a major Internet outage was primarily within the system. The articles mention systemic, long-standing issues with the agency's information technology infrastructure, including problems with Internet services, bandwidth shortages, inadequate radar websites, flood at data centers, and outages in critical communication programs like NWS Chat [112079]. These issues highlight internal challenges and shortcomings within the Weather Service's own systems and infrastructure that led to the software failure incident.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: The software failure incident at the National Weather Service was primarily due to systemic issues with its information technology infrastructure, including a major, systemwide Internet failure that made forecasts and warnings inaccessible to the public [112079]. The outage was caused by failures in the agency's networks, impacting product dissemination, data reception, inoperable websites, and no access to critical communication channels like NWS Chat. Additionally, a bandwidth shortage, inadequate radar website functionality, and a flood at the data center in Maryland also contributed to the software failure incident [112079]. (b) The software failure incident occurring due to human actions: The software failure incident at the National Weather Service also had elements of human actions contributing to the failure. For example, the agency struggled to address long-standing issues with its information technology infrastructure despite increasing demands for its services [112079]. Additionally, decisions made by the Weather Service office in Birmingham to switch to an external program like Slack for communication due to the unreliability of NWS Chat were rebuked by higher-ups, indicating human decisions impacting the software failure incident [112079].
Dimension (Hardware/Software) software (a) The software failure incident occurring due to hardware: The software failure incident reported in the articles was not primarily attributed to hardware issues. However, there was a mention of a hardware-related incident where the Weather Service's headquarters in Silver Spring experienced a ruptured water pipe on March 9, causing significant flooding and affecting a data center. This incident led to the stoppage of some NWS data, including data from ocean buoys used for detecting seismic events [112079]. (b) The software failure incident occurring due to software: The software failure incident reported in the articles was primarily attributed to software issues. The National Weather Service experienced a major systemwide Internet failure, making its forecasts and warnings inaccessible to the public and limiting data available to meteorologists. The outage was due to systemic, long-standing issues with the agency's information technology infrastructure, including repeated problems with Internet services, bandwidth shortages, inadequately functioning radar websites, and outages in critical information conveyance programs like NWS Chat [112079].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in the articles does not seem to be malicious. The incident was primarily attributed to systemic issues with the National Weather Service's information technology infrastructure, including Internet failures, bandwidth shortages, inadequate radar websites, floods at data centers, and outages in critical communication systems like NWS Chat. These issues were described as chronic and long-standing, impacting the agency's ability to fulfill its mission of protecting life and property [112079]. There is no indication in the articles that the failures were caused by intentional actions to harm the system. (b) The software failure incident can be categorized as non-malicious. The failures were mainly due to systemic issues, technical challenges, and infrastructure problems faced by the National Weather Service. These issues included bandwidth shortages, inadequate radar websites, floods at data centers, and outages in critical communication systems like NWS Chat. The incident was described as highlighting long-standing problems with the agency's information dissemination infrastructure, impacting its ability to provide accurate forecasts and warnings to the public and meteorologists [112079]. The failures were not attributed to intentional actions to harm the system.
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident related to the National Weather Service experiencing major Internet outages and systemwide failures can be attributed to poor decisions. The incident highlighted systemic, long-standing issues with the agency's information technology infrastructure, which they have struggled to address despite increasing demands for their services [112079]. The problems with stability and reliability of the Weather Service's information dissemination infrastructure date back to at least 2013, indicating a lack of proactive decision-making to address these issues [112079]. Additionally, the agency faced issues such as bandwidth shortages, inadequate radar websites, floods at data centers, and multiple outages to critical communication programs, all of which point to poor decisions in managing and maintaining their IT systems [112079].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident related to development incompetence is evident in the article. The National Weather Service experienced a major systemwide Internet failure due to systemic, long-standing issues with its information technology infrastructure [112079]. The article highlights problems with stability and reliability dating back to at least 2013, indicating a lack of professional competence in addressing and resolving these issues promptly. Additionally, the Weather Service encountered repeated problems with its Internet services, including bandwidth shortages, inadequate website functionality, and outages in critical communication programs like NWS Chat, impacting its ability to fulfill its mission [112079]. (b) The software failure incident related to accidental factors is also present in the article. For example, the Weather Service faced challenges such as a flood at its data center in Silver Spring, Maryland, due to a ruptured water pipe, causing significant flooding and affecting data flow, including data from ocean buoys used for detecting seismic events [112079]. This accidental event contributed to the software failure incident by disrupting critical data transmission, showcasing how unforeseen events can lead to system failures.
Duration temporary The software failure incident reported in the articles was temporary. The incident involved a major, systemwide Internet failure at the National Weather Service, which impacted the distribution of NWS products, including forecasts and warnings, making them inaccessible to the public [112079]. The outage was highlighted by failures nationwide, including inoperable websites and no access to NWS Chat, limiting the data available to meteorologists for making forecasts. The incident was eventually resolved, indicating a temporary nature of the failure.
Behaviour crash, omission, value, other (a) crash: The software failure incident described in the articles can be categorized as a crash. The National Weather Service experienced a major systemwide Internet failure, leading to its flagship website, weather.gov, being down and cutting off access to forecasts and warnings [112079]. (b) omission: The software failure incident can also be categorized as an omission. The outage limited the data available to meteorologists, impacting their ability to make forecasts and fulfill the agency's mission of protecting life and property [112079]. (c) timing: The software failure incident does not seem to be related to timing issues where the system performed its intended functions but at the wrong time. (d) value: The software failure incident can be related to a value failure as the system was not performing its intended functions correctly, leading to the inaccessibility of forecasts and warnings to the public [112079]. (e) byzantine: The software failure incident does not exhibit characteristics of a byzantine failure where the system behaves erroneously with inconsistent responses and interactions. (f) other: The software failure incident can be categorized as a systemwide failure impacting the distribution of NWS products, including inoperable websites, loss of contact with networks, and no access to critical information channels like NWS Chat [112079].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence no_consequence (a) death: The articles do not mention any direct consequences of people losing their lives due to the software failure incident. [112079]
Domain information, government (a) The failed system was intended to support the information industry, specifically the National Weather Service's information dissemination infrastructure. The system failure impacted the distribution of NWS products, forecasts, warnings, and access to critical data for meteorologists [112079]. (h) The software failure incident also affected the government sector, as the National Weather Service is a government agency responsible for providing weather forecasts, warnings, and information to protect life and property. The outage highlighted systemic issues with the agency's information technology infrastructure, impacting its ability to fulfill its mission [112079]. (m) The software failure incident could also be related to the utilities industry indirectly, as accurate weather forecasts and warnings are crucial for utilities such as power companies to prepare for and respond to severe weather events that could impact their services [112079].

Sources

Back to List