Incident: State-of-the-Art Signal System Failure on Long Island Rail Road

Published Date: 2011-09-30

Postmortem Analysis
Timeline 1. The software failure incident happened on a Thursday, as mentioned in the article [7656]. 2. The article was published on 2011-09-30. 3. Estimating the timeline: - The incident occurred on a Thursday before the article was published on 2011-09-30. - Therefore, the software failure incident likely happened in September 2011.
System 1. State-of-the-art computerized signal system installed at the Long Island Rail Road's hub in Jamaica, Queens [7656] 2. Microprocessors and fail-safe features within the signal system [7656] 3. Backup system that failed [7656] 4. Triage software that was supposed to diagnose the troubles [7656] 5. Software program supplied by Ansaldo that gave false readings on whether the equipment was functioning [7656]
Responsible Organization 1. The software program supplied by Ansaldo was responsible for causing the software failure incident [7656].
Impacted Organization 1. Long Island Rail Road passengers [7656] 2. Long Island Rail Road Commuter Council [7656]
Software Causes 1. The software program supplied by Ansaldo gave false readings on whether the equipment was functioning, causing confusion and hindering the efforts to revive the signals [7656].
Non-software Causes 1. A bolt of lightning struck and fried the microchips, causing the failure incident [Article 7656]. 2. The backup system also failed, contributing to the incident [Article 7656].
Impacts 1. The software failure incident led to the paralysis of the Long Island Rail Road for hours, causing significant disruptions to service and leaving passengers stranded [7656]. 2. Passengers experienced delays, cancellations, and frustrations due to the failure of the high-tech signal system, leading to a loss of confidence in the railroad [7656]. 3. The incident resulted in the need for manual intervention using old-fashioned methods to set switches, indicating a reliance on outdated techniques when the modern system failed [7656]. 4. The software failure incident raised questions about the effectiveness and reliability of the state-of-the-art system that was supposed to prevent such issues, highlighting vulnerabilities that still existed despite the upgrade [7656].
Preventions 1. Proper testing and validation of the software program supplied by Ansaldo to ensure it accurately detects equipment functionality [7656]. 2. Implementing redundant backup systems that can seamlessly take over in case of primary system failures caused by lightning strikes or power surges [7656]. 3. Conducting thorough risk assessments to identify and address potential vulnerabilities in the signal system design, such as susceptibility to lightning strikes, to prevent similar incidents in the future [7656].
Fixes 1. Conduct a thorough review of the software system by an independent consultant to identify the root cause of the failure and potential vulnerabilities [7656]. 2. Collaborate with the system's manufacturer, Ansaldo, to address the software program's shortcomings and ensure it functions as intended [7656]. 3. Implement additional safeguards or redundancies in the software to protect against lightning strikes and power surges [7656].
References 1. Helena E. Williams, the president of the Long Island Rail Road [7656] 2. Maureen Michaels, former chairwoman and a member of the Long Island Rail Road Commuter Council [7656] 3. Ansaldo, the manufacturer of the signal system [7656]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident happened again at the Long Island Rail Road. The article mentions that a similar incident occurred in August 2010 when the old World War I-era contraption caught fire, causing service disruptions. The new $56 million state-of-the-art computerized signal system, installed to replace the old system, failed when a bolt of lightning struck and fried the microchips, leading to service disruptions once again [7656]. (b) The article does not provide information about similar incidents happening at other organizations or with their products and services.
Phase (Design/Operation) design, operation (a) The software failure incident in the Long Island Rail Road incident can be attributed to the design phase. The article mentions that the high-tech new signal system with microprocessors and fail-safe features was supposed to prevent incidents like the one that occurred. However, the system failed when a bolt of lightning struck and fried the microchips, causing the backup system and triage software to fail as well. The president of the railroad expressed concerns that the software they invested in did not work as intended, indicating a failure in the design phase [7656]. (b) Additionally, the software failure incident can also be linked to the operation phase. The article describes how the railroad workers were unable to revive the signals due to a software program that provided false readings on the equipment's functionality. This operational issue contributed to the shutdown of the entire network and the stranding of passengers for hours, highlighting a failure in the operation of the system [7656].
Boundary (Internal/External) within_system (a) within_system: The software failure incident was primarily attributed to issues within the system itself. The article mentions that the high-tech new signal system, with its microprocessors and fail-safe features, was supposed to prevent failures like the one that occurred. However, the backup system failed, as did the triage software that was supposed to diagnose the troubles. The software program supplied by Ansaldo gave false readings on whether the equipment was functioning, further complicating the situation [7656]. (b) outside_system: The incident was triggered by an external factor, namely a bolt of lightning that struck and fried the microchips of the signal system. This external event led to the failure of the supposedly impermeable microchips and subsequently impacted the backup system and the triage software within the system [7656].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurred due to non-human actions, specifically a bolt of lightning striking and frying the microchips of the state-of-the-art signal system at the Long Island Rail Road [7656]. (b) The software failure incident also involved human actions as the backup system and triage software, which were supposed to diagnose the troubles, failed to function as intended, leading to manual intervention by workers using the old-fashioned method to set switches with mallets [7656].
Dimension (Hardware/Software) hardware, software (a) The software failure incident in the Long Island Rail Road incident was attributed to hardware issues. The article mentions that a bolt of lightning struck and fried the microchips of the state-of-the-art computerized signal system, causing the system to fail [7656]. Additionally, the backup system also failed, indicating a hardware-related issue. (b) The software failure incident was also related to software issues. The article mentions that the triage software that was supposed to diagnose the troubles failed to work as intended, giving false readings on whether the equipment was functioning properly [7656]. This indicates a software-related failure contributing to the overall incident.
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in the article is non-malicious. The failure was attributed to a bolt of lightning striking and frying the microchips of the state-of-the-art signal system, causing the backup system and triage software to fail as well [7656]. The incident was not caused by any malicious intent but rather by a natural event that led to the system's vulnerabilities being exposed.
Intent (Poor/Accidental Decisions) poor_decisions (a) The intent of the software failure incident related to poor decisions: - The incident involved a state-of-the-art computerized signal system installed at the Long Island Rail Road, which was supposed to ensure reliability of train service [Article 7656]. - The new system was designed to protect against lightning strikes and power surges, but it failed when a bolt of lightning struck and fried the microchips, leading to a system paralysis [Article 7656]. - The railroad's president expressed concerns over the fact that the software they paid millions for did not work as intended, indicating a poor decision in investing in a system that ultimately failed during a critical situation [Article 7656]. (b) The intent of the software failure incident related to accidental decisions: - The failure was triggered by a bolt of lightning striking and frying the microchips of the supposedly impermeable system, leading to a system breakdown [Article 7656]. - The backup system also failed, along with the triage software that was supposed to diagnose the issues, indicating a series of unintended consequences that compounded the incident [Article 7656].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident in the Long Island Rail Road was attributed to development incompetence. The president of the railroad expressed concerns over the fact that the software they paid millions for did not work as intended, highlighting a lack of professional competence in the development of the system [7656]. (b) Additionally, the incident was also influenced by accidental factors such as a bolt of lightning striking and frying the microchips, which was an unforeseen event that contributed to the failure of the system [7656].
Duration temporary The software failure incident described in the article was temporary. The failure occurred due to a lightning strike that fried the microchips of the state-of-the-art signal system, causing the backup system and triage software to fail as well. Workers had to resort to manual methods to set switches, and the railroad was paralyzed for hours. However, by Friday morning, service had nearly returned to normal, indicating a temporary nature of the failure [Article 7656].
Behaviour crash, omission, timing, value, other (a) crash: The software failure incident in the Long Island Rail Road article resulted in a crash as the signal system failed, leading to the railroad being paralyzed for hours. Workers had to resort to manual methods to set switches, indicating a loss of system state and failure to perform its intended functions [7656]. (b) omission: The software failure incident involved omission as the backup system failed, along with the triage software that was supposed to diagnose the troubles. This omission of critical functions contributed to the overall failure of the system [7656]. (c) timing: The timing of the software failure incident was crucial as it occurred at the start of the Thursday evening rush, making it nearly impossible to safely run trains between multiple stations. The system's failure at this specific time led to significant disruptions in service [7656]. (d) value: The software failure incident involved a failure in value as the software program supplied by Ansaldo gave false readings on whether the equipment was functioning. This incorrect performance of the software contributed to the challenges faced by the railroad workers in reviving the signals [7656]. (e) byzantine: The software failure incident did not exhibit characteristics of a byzantine failure as described in the articles. The failure was more related to system crashes, omissions, timing issues, and incorrect performance rather than inconsistent responses and interactions. (f) other: The software failure incident also showcased a failure in communication as the railroad's president expressed frustration over the software not working as intended despite the significant investment made in the system. This breakdown in communication and expectations contributed to the overall impact of the failure incident [7656].

IoT System Layer

Layer Option Rationale
Perception sensor, embedded_software (a) The failure was related to the perception layer of the cyber physical system that failed due to contributing factors introduced by sensor error. The article mentions that a bolt of lightning struck and fried the microchips of the high-tech signal system, causing the failure. This indicates a sensor error as the sensors (microchips) were directly affected by the lightning strike [Article 7656]. (e) The failure was also related to the embedded software layer of the cyber physical system that failed due to contributing factors introduced by embedded software error. The article mentions that the software program supplied by Ansaldo gave false readings on whether the equipment was functioning, which hindered the efforts to revive the signals. This indicates an error in the embedded software used in the system [Article 7656].
Communication unknown The software failure incident described in Article 7656 was not explicitly attributed to a specific communication layer of the cyber-physical system. The focus of the article was on the failure of the state-of-the-art computerized signal system at the Long Island Rail Road due to a lightning strike that fried the microchips, causing the backup system and triage software to fail as well. The article does not provide detailed information on whether the failure was specifically related to the communication layer at the link_level or connectivity_level.
Application TRUE The software failure incident described in Article 7656 was related to the application layer of the cyber physical system. The article mentions that the new state-of-the-art computerized signal system, which included microprocessors and fail-safe features, experienced a failure when a bolt of lightning struck and fried the microchips. Additionally, the triage software that was supposed to diagnose the troubles also failed, leading to manual intervention using the old-fashioned method of setting switches with mallets. This indicates that the failure was indeed related to the application layer of the system due to the software not functioning as intended [7656].

Other Details

Category Option Rationale
Consequence delay, non-human, theoretical_consequence, other (a) death: There were no reports of people losing their lives due to the software failure incident described in the articles [7656]. (b) harm: The articles did not mention any physical harm to individuals due to the software failure incident [7656]. (c) basic: There was no indication that people's access to food or shelter was impacted by the software failure incident [7656]. (d) property: The software failure incident did not directly impact people's material goods, money, or data as per the articles [7656]. (e) delay: The software failure incident caused significant delays and disruptions in the Long Island Rail Road service, leaving passengers stranded for hours [7656]. (f) non-human: The software failure incident impacted the functionality of the signal system and related equipment on the Long Island Rail Road [7656]. (g) no_consequence: The software failure incident had real observed consequences, particularly in terms of service disruptions and delays [7656]. (h) theoretical_consequence: There were discussions about potential consequences of the software failure incident, such as the system not working as intended despite the investment in modern technology [7656]. (i) other: The software failure incident led to frustration among passengers, loss of confidence in the railroad, and the need to suspend service, but there were no other specific consequences mentioned beyond delays and disruptions [7656].
Domain transportation (a) The failed system was intended to support the transportation industry, specifically the Long Island Rail Road, one of the nation's largest commuter railroads [Article 7656].

Sources

Back to List