Incident: Metro's Delayed Software Upgrade Leads to Fatal Smoke Emergency.

Published Date: 2015-02-16

Postmortem Analysis
Timeline 1. The software failure incident happened on January 12, 2015 [33558].
System 1. Metro's decade-old computerized process for handling smoke emergencies in tunnels [33558] 2. Metro's train-control center software dating back to 2002 [33558]
Responsible Organization 1. Metro - The software failure incident was caused by Metro's outdated computer software for handling smoke emergencies in tunnels, which was not able to quickly identify the origin point of the smoke and coordinate the work of ventilation fans, exacerbating the tunnel calamity near the L'Enfant Plaza station [33558].
Impacted Organization 1. Metro's train-control center in Landover, Md. [33558]
Software Causes 1. The software system for handling smoke emergencies in tunnels at Metro was outdated and needed modernization, but the planned upgrades had not advanced beyond the paperwork stage [33558]. 2. Metro's computer software for train controllers in Landover, Md., dated back to 2002 and needed a complete replacement to improve safety, customer satisfaction, and operational efficiency [33558]. 3. The software system was unable to quickly identify the origin point of the smoke during the Jan. 12 incident, leading to the activation of ventilation fans at cross-purposes, exacerbating the situation [33558]. 4. The lack of coordination in the software system caused the platform fans to pull smoke toward the train while the shaft fans, pulling from the other direction, helped cause the smoke to settle over the train, impacting passenger safety [33558].
Non-software Causes 1. Malfunction involving a bundle of power cables attached to the electrified third rail in the tunnel, causing tremendous heat and smoke [33558]. 2. Activation of ventilation fans at cross-purposes, pulling smoke toward the train instead of pushing it away [33558]. 3. Delay in halting train traffic despite a tunnel smoke detector being activated [33558]. 4. Failure to shut off the train's air-intake system, allowing smoke to enter the train cars [33558]. 5. Activation of fans in exhaust mode, pulling smoke towards the train instead of creating air circulation in the tunnel [33558].
Impacts 1. The software failure incident led to a fatal crisis on Jan. 12, 2015, where noxious fumes enveloped a Metro train underground, resulting in one rider's death from smoke inhalation and many passengers being left gasping for air [33558]. 2. The incident highlighted Metro's inability to quickly identify the origin point of the smoke, leading to the activation of ventilation fans at cross-purposes, exacerbating the situation by pulling smoke toward the train instead of pushing it away [33558]. 3. The outdated computer software at Metro's train-control center, dating back to 2002, was not adequately serving the controllers, impacting their ability to handle emergencies efficiently [33558]. 4. The software failure incident caused passengers to be trapped in a smoke-filled tunnel, with the train's air-intake system pulling smoke into the cars, leading to passengers inhaling smoke and vapor [33558]. 5. The incident prompted changes in Metro's procedures, such as allowing train operators to turn off air-intake systems in smoke emergencies without waiting for permission and requiring trains to return to the station in such situations [33558]. 6. The failure of the software to coordinate the ventilation fans properly resulted in smoke settling over the train, causing passengers to wait for more than 30 minutes for rescuers to arrive and lead them back to the station on foot [33558].
Preventions 1. Implementing the planned upgrades for the computerized process for handling smoke emergencies in tunnels, which would have helped Metro pinpoint the location of smoke and coordinate the work of ventilation fans [33558]. 2. Advancing the software replacement project to make the rail system safer for customers and employees, providing improved customer satisfaction through more reliable and efficient operation [33558]. 3. Coordinating the activation of ventilation fans properly to ensure they work in coordination during emergencies, as the failure to do so exacerbated the tunnel calamity on Jan. 12 incident [33558].
Fixes 1. Implement a complete software replacement to modernize Metro's train-control center software, which is currently outdated and inadequate for dealing with emergencies like smoke incidents [33558]. 2. Develop a software system that provides real-time, accurate, and detailed information to help Metro quickly identify the origin point of smoke in tunnels and coordinate the work of ventilation fans effectively during emergencies [33558]. 3. Divide the rail system into clearly defined zones in the software to enable controllers to pinpoint the source of smoke quickly and automate the coordination of fan operations [33558].
References 1. National Transportation Safety Board (NTSB) - The NTSB provided substantial disclosures about the Jan. 12 incident, indicating Metro's inability to quickly identify the origin point of the smoke and the issues with the ventilation fans [33558]. 2. Metro documents - Documents prepared for a possible contract for new computer software for Metro’s train-control center in Landover, Md., indicated the need for modernizing the system for dealing with smoke in tunnels and the planned upgrades for computer software [33558]. 3. Christopher A. Hart - The acting chairman of the NTSB issued "urgent recommendations" concerning Metro’s procedures for dealing with smoke in tunnels and provided new details about the Jan. 12 incident [33558]. 4. D.C. firefighters - Lt. Stephen Kuhn from the D.C. firefighters provided information about the conditions when they reached the rail cars during the incident [33558]. 5. Bill Bundens - A mechanical engineer who observed smoke pouring from a ventilation shaft near the incident site and called 911, providing details about the activation of the fans and the smoke situation [33558]. 6. Jeffrey Todd - A Navy pilot who was on the smoke-filled train and decided to walk through the tunnel to safety, providing a firsthand account of the conditions and actions taken during the incident [33558].

Software Taxonomy of Faults

Category Option Rationale
Recurring unknown (a) The software failure incident having happened again at one_organization: The article does not mention any specific instance of a similar software failure incident happening again within the same organization (Metro) or with its products and services. Therefore, there is no evidence of a repeated software failure incident within the same organization in the provided articles. (b) The software failure incident having happened again at multiple_organization: The article does not provide information about a similar software failure incident happening again at other organizations or with their products and services. Hence, there is no mention of a repeated software failure incident at multiple organizations in the given articles.
Phase (Design/Operation) design, operation (a) The article mentions that Metro had planned upgrades for its computerized process for handling smoke emergencies in tunnels to help pinpoint the location of smoke and coordinate the work of ventilation fans. However, these upgrades had not advanced beyond the paperwork stage, indicating a failure in the design phase of the software system [33558]. (b) The National Transportation Safety Board (NTSB) disclosed that the tunnel calamity on Jan. 12 was exacerbated by Metro's inability to quickly identify the origin point of the smoke, leading to the activation of tunnel ventilation fans at cross-purposes, pulling the smoke toward the train instead of pushing it away. This failure to quickly and effectively operate the ventilation system contributed to the incident [33558].
Boundary (Internal/External) within_system (a) The software failure incident related to the Metro's train-control center and its handling of smoke emergencies in tunnels was primarily within the system. The incident was exacerbated by Metro's outdated computer software that hindered the quick identification of the origin point of the smoke and coordination of ventilation fans [33558]. The National Transportation Safety Board highlighted that Metro's inability to modernize its software for dealing with smoke emergencies within the tunnels contributed to the crisis [33558]. The software replacement was deemed necessary to enhance safety for customers and employees, improve customer satisfaction, and provide real-time, accurate data for better operation during emergencies [33558]. The incident also led to urgent recommendations from the NTSB for Metro to improve its procedures for dealing with smoke in tunnels, which would require software enhancements within the system [33558].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: The software failure incident in the Metro tunnel calamity on Jan. 12 was exacerbated by Metro's inability to quickly identify the origin point of the smoke. This was due to the outdated computer software system that was not able to pinpoint the location of the smoke and coordinate the work of ventilation fans effectively. The malfunction involving a bundle of power cables in the tunnel caused tremendous heat and smoke, leading to the crisis. The software system was not able to provide real-time, accurate data to assist in managing the emergency situation [33558]. (b) The software failure incident occurring due to human actions: The planned upgrades for the computerized process for handling smoke emergencies in tunnels, which could have helped Metro pinpoint the location of smoke and coordinate ventilation fans, had not advanced beyond the paperwork stage before the disaster occurred. Metro had been aware for months that its train controllers were not adequately served by the outdated computer software system. The slow progress in completing the software improvements and the lack of urgency in addressing the system's deficiencies contributed to the software failure incident during the Jan. 12 incident at L'Enfant Plaza station [33558].
Dimension (Hardware/Software) software (a) The articles do not provide information about a software failure incident occurring due to contributing factors that originate in hardware. (b) The software failure incident mentioned in the articles is related to the inadequacy of Metro's computer software for handling smoke emergencies in tunnels. The software was outdated, dating back to 2002, and had not been adequately modernized despite the agency's awareness of the need for improvements. The National Transportation Safety Board (NTSB) highlighted that Metro's inability to quickly identify the origin point of the smoke during the Jan. 12 incident exacerbated the situation. The software was not able to provide real-time, accurate information needed for effective emergency response, leading to confusion in activating ventilation fans and worsening the smoke situation inside the tunnel [33558].
Objective (Malicious/Non-malicious) non-malicious The software failure incident discussed in the articles is categorized as non-malicious. The incident was related to the failure of Metro's computerized process for handling smoke emergencies in tunnels, which was outdated and not able to quickly identify the origin point of smoke during the Jan. 12 incident near the L'Enfant Plaza station [33558]. The failure was attributed to the lack of modernization and upgrades in the software system, as well as the slow progress in completing the necessary improvements to the software [33558]. The incident was exacerbated by Metro's inability to coordinate the work of ventilation fans effectively due to the outdated software, leading to the fans pulling smoke toward the train instead of pushing it away [33558]. The incident highlighted the critical need for a software overhaul to enhance safety, efficiency, and reliability in emergency situations [33558].
Intent (Poor/Accidental Decisions) poor_decisions The intent of the software failure incident can be attributed to poor decisions made by Metro in handling the upgrade of their computerized process for handling smoke emergencies in tunnels. The article [33558] highlights that Metro had known for months that their system for dealing with smoke in tunnels needed to be modernized, but they moved at a less-than-urgent pace in trying to complete the improvements. The failure to upgrade the software in a timely manner contributed to the exacerbation of the tunnel calamity on Jan. 12, as Metro's inability to quickly identify the origin point of the smoke led to the activation of ventilation fans at cross-purposes, pulling the smoke toward the train instead of pushing it away. This delay in upgrading the software system reflects poor decision-making on Metro's part, which ultimately played a role in the incident.
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident occurring due to development incompetence: The incident described in the articles highlights a software failure incident that occurred due to development incompetence. Metro had been aware for months that their train-control center software was outdated and not adequately serving the controllers. Despite recognizing the need for modernization and improvements in dealing with emergencies like smoke in tunnels, Metro had not taken urgent action to complete the necessary upgrades. The National Transportation Safety Board (NTSB) pointed out that the tunnel calamity on Jan. 12 was exacerbated by Metro's inability to quickly identify the origin point of the smoke, leading to the activation of ventilation fans at cross-purposes, which worsened the situation [33558]. (b) The software failure incident occurring accidentally: The incident described in the articles does not specifically mention the software failure incident as occurring accidentally. The focus is more on the lack of urgency in upgrading the software and the consequences of the outdated system in dealing with emergencies like the smoke incident in the tunnel. The failure seems to be attributed more to development incompetence and lack of timely action rather than being accidental.
Duration temporary The software failure incident related to the Metro's train-control center software not being adequately served by outdated computer software that dates back to 2002 was temporary. The incident was due to contributing factors introduced by certain circumstances, such as the need for modernization and upgrades to the software system to improve handling smoke emergencies in tunnels. The documents prepared for a possible contract for new computer software indicated that Metro knew its system needed to be modernized, but the upgrades had not advanced beyond the paperwork stage [33558]. The incident on Jan. 12, where the smoke calamity near the L'Enfant Plaza station was exacerbated by Metro's inability to quickly identify the origin point of the smoke due to the outdated software, further highlights the temporary nature of the software failure incident [33558].
Behaviour crash, omission, timing, value, other (a) crash: The software failure incident in the Metro system can be categorized as a crash. The incident involved a malfunction in the software system that led to a failure in coordinating the work of ventilation fans during an emergency situation, resulting in the exacerbation of a tunnel calamity [33558]. (b) omission: The software failure incident can also be attributed to omission. The system failed to quickly identify the origin point of smoke in the tunnel, leading to the activation of ventilation fans at cross-purposes, which pulled the smoke toward the train instead of pushing it away, worsening the situation [33558]. (c) timing: The timing of the software failure incident can be considered a factor in the overall failure. The system, although functioning, did not act in a timely manner to address the emergency situation, causing delays in response and exacerbating the crisis [33558]. (d) value: The software failure incident can be linked to a failure in value. The system, despite its intended functions, did not provide the necessary value in terms of safety and efficiency during the emergency. The outdated software system was not able to ensure the safety of customers and employees in a critical situation [33558]. (e) byzantine: The software failure incident does not align with a byzantine behavior as described in the options. The failure was more related to coordination issues and inefficiencies rather than inconsistent responses or interactions within the system [33558]. (f) other: The other behavior exhibited by the software failure incident could be categorized as a failure in system coordination. The lack of coordination between the ventilation fans due to the software malfunction led to a situation where the fans worked against each other, causing the smoke to settle over the train and passengers, instead of being cleared efficiently [33558].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence death, harm (a) death: People lost their lives due to the software failure - The article mentions that in the January 12 incident in the Metro tunnel near the L'Enfant Plaza station, one woman died from smoke inhalation as a consequence of the software failure incident [33558].
Domain transportation The failed software system mentioned in the articles was intended to support the transportation industry. The software was related to Metro's train-control center in Landover, Md., and was crucial for handling smoke emergencies in tunnels, coordinating the work of ventilation fans, and ensuring the safety of passengers in the event of emergencies like the underground fire incident at L'Enfant Plaza station [33558].

Sources

Back to List