Published Date: 2016-08-08
| Postmortem Analysis | |
|---|---|
| Timeline | 1. The software failure incident involving Delta Air Lines occurred on Monday, August 8, 2016 [46631, 47393, 47391, 46607]. |
| System | 1. Delta Air Lines' computer systems failed due to a power outage at one of its Atlanta facilities, leading to the grounding of flights and widespread cancellations and delays [46631, 47393, 47391, 46607]. 2. The failure involved a piece of electrical equipment, specifically a failed switchgear, which is similar to a circuit-breaker box in a house [46631]. 3. The outage affected Delta's Operations and Customer Center at Atlanta's Hartsfield-Jackson, which monitors Delta's global fleet, crews, and passengers [46607]. 4. The outage caused static check-in lanes at airports, gate agents writing boarding passes by hand, and incorrect flight information displayed on departure boards and smartphone apps [46607]. 5. Delta's contingency plans, including a second command center nearby and uninterruptible power supply systems, were unable to overcome the outage [46607]. 6. The outage did not affect standard air traffic control systems, which were operational and ready to handle flights already in the air [46607]. |
| Responsible Organization | 1. Delta Air Lines [46631, 47393, 47391, 46607] |
| Impacted Organization | 1. Delta Air Lines [46631, 47393, 47391, 46607] |
| Software Causes | 1. The failure incident at Delta Air Lines was caused by a power outage that led to the shutdown of its computer systems worldwide, resulting in hundreds of canceled and delayed flights [46631, 47393, 47391]. 2. The outage at Delta's Operations and Customer Center in Atlanta disrupted the monitoring of Delta's global fleet, crews, and passengers, affecting key computers or servers and leading to a crisis situation [46607]. 3. The failure was not specifically attributed to software issues but rather to the complexity of the airline's computer systems, which are built in layers with some components being 30 years old, making them difficult and expensive to update [46607]. |
| Non-software Causes | 1. Power outage at one of Delta's facilities in Atlanta [46631, 47393] 2. Failure of a piece of electrical equipment at Delta's Atlanta facility [46631] 3. Internal computer glitch or power outage at Delta's Atlanta hub [47391] 4. Failure of a switchgear, a heavy-duty version of a circuit breaker panel [46607] |
| Impacts | 1. The software failure incident at Delta Air Lines led to the grounding of flights, cancellations, and delays, affecting thousands of passengers globally [46631, 47393, 47391, 46607]. 2. Passengers experienced chaos and confusion at airports, with some stranded overnight, facing long delays, and having to deal with manual check-ins and boarding passes [46631, 47393, 47391, 46607]. 3. Delta had to cancel hundreds of flights, with significant financial implications, including potential costs in the tens of millions of dollars [46631, 47393, 47391, 46607]. 4. The incident disrupted Delta's operations, leading to a significant number of cancellations and delays in various countries, impacting both domestic and international flights [46631, 47393, 47391, 46607]. 5. The failure of the computer systems affected Delta's mission control center, leading to operational crises and challenges in managing the global fleet, crews, and passengers [46607]. 6. The outage caused inconvenience to passengers, with some missing connections, facing long delays, and experiencing uncertainty about their travel plans [46631, 47393, 47391, 46607]. 7. The incident highlighted the vulnerabilities in the airline industry's complex computer systems, which are often built in layers and rely on outdated technology, making it challenging to ensure foolproof backups and redundancy [47391, 46607]. 8. The failure of the computer systems raised concerns about the need for airlines to invest in modern, integrated computer systems to prevent similar incidents in the future [47391]. 9. The impacts of the software failure incident extended beyond Delta, with previous incidents at other airlines like Southwest and United Airlines also causing significant disruptions and financial losses [46631, 47393, 47391, 46607]. |
| Preventions | 1. Implementing foolproof backup systems for computer networks and power sources to ensure redundancy and continuity of operations [Article 47391]. 2. Investing in new, modern computer systems to replace outdated and patchwork networks that are prone to failures [Article 47391]. 3. Regularly updating and maintaining the complex layers of computer systems used by airlines to prevent failures caused by outdated components [Article 46607]. 4. Conducting thorough testing and ensuring compatibility when making changes or updates to the system to avoid disruptions to the steady state of operations [Article 46607]. |
| Fixes | 1. Implement foolproof backups for computer systems and electric power sources to prevent incidents like power outages or internal computer glitches from causing widespread disruptions [Article 47391]. 2. Invest in new computer systems to replace outdated and patchwork networks currently in place in the airline industry [Article 47391]. 3. Ensure redundancy built on top of redundancy in computer systems to maintain operations even in the event of failures [Article 47391]. 4. Regularly update and maintain complex computer systems used by airlines, especially those built in layers with subcomponents and data feeds from diverse sources [Article 46607]. 5. Conduct thorough investigations into the root causes of software failures to identify any changes that may have disrupted the steady state of the systems, such as updates pushed through the system [Article 46607]. | References | 1. Delta Air Lines [46631, 47393, 47391, 46607] 2. Southwest Airlines [46631, 47393, 47391, 46607] 3. United Airlines [46631, 47393, 47391, 46607] 4. American Airlines [46631, 47393, 47391, 46607] 5. Georgia Power [46631, 47391, 46607] 6. Travel Technology Consulting [46631] 7. Sabre [46631] 8. Amadeus [46631] 9. Travelport [46631] 10. Atmosphere Research Group [46631] 11. FareCompare.com [47391] 12. Embry-Riddle Aeronautical University [47391, 46607] 13. NASA [46607] 14. The Dallas Morning News [46607] |
| Category | Option | Rationale |
|---|---|---|
| Recurring | one_organization, multiple_organization | (a) The software failure incident having happened again at one_organization: - Delta Air Lines experienced a major software failure incident due to a power outage that grounded flights and led to cancellations and delays [46631]. - The incident at Delta was not an isolated case, as other airlines like Southwest Airlines and United Airlines had also faced similar computer system failures in the past [46631]. - Delta's computer system failure was particularly damaging as it affected its Operations and Customer Center, which monitors the global fleet, crews, and passengers, causing widespread disruptions [46607]. (b) The software failure incident having happened again at multiple_organization: - The articles mention that Southwest Airlines and United Airlines had also experienced significant computer system failures in the past, with Southwest canceling thousands of flights due to a router failure and United facing systemwide computer problems [47393, 47391]. - The incidents at Delta, Southwest, and United highlight the challenges faced by airlines in maintaining complex computer systems and the potential for widespread disruptions when such systems fail [46631, 47393, 47391]. |
| Phase (Design/Operation) | design, operation | (a) The software failure incident occurring due to the development phases: - The incident involving Delta Air Lines' computer outage was attributed to a failure in the system's design and development phases. The article highlights that the complexity of the system, which is built in layers with subcomponents and data feeds from diverse sources, contributed to the failure. Some parts of the system were noted to be 30 years old, making them expensive and challenging to update. The failure was especially damaging as it affected Delta's Operations and Customer Center, which monitors the global fleet, crews, and passengers, indicating a design flaw in the system's critical components [Article 46607]. (b) The software failure incident occurring due to the operation phases: - The incident involving Delta Air Lines' computer outage was also influenced by factors related to the operation of the system. The outage disrupted the airline's mission control center, leading to static check-in lanes, handwritten boarding passes, and incorrect flight information displayed to passengers. The failure in the operation of the system resulted in delays, cancellations, and chaos at airports, showcasing operational challenges faced during the incident [Article 46607]. |
| Boundary (Internal/External) | within_system | (a) The software failure incident reported in the articles is primarily within the system. The incidents at Delta Air Lines were caused by internal factors such as a power failure that led to the shutdown of the airline's computer systems, resulting in widespread flight cancellations and delays [46631, 47393, 47391, 46607]. The failures were attributed to issues with the airline's computer systems, including a failure of a piece of electrical equipment at one of its facilities, a computer outage affecting the mission control center, and a systemwide computer outage that affected flights globally. These incidents highlight the critical role of internal system components in causing software failures within the organization. |
| Nature (Human/Non-human) | non-human_actions, human_actions | (a) The software failure incident occurring due to non-human actions: - The software failure incident at Delta Air Lines was caused by a power failure that grounded flights and led to cancellations and delays. The failure of a piece of electrical equipment at one of its Atlanta facilities shut down its computer systems worldwide, resulting in hundreds of canceled and delayed flights [46631]. - Delta experienced a computer outage that affected flights scheduled for the morning, which crippled its mission control center, leading to static check-in lanes, handwritten boarding passes, and incorrect flight status information displayed on departure boards and smartphone apps. The outage was not due to a complete blackout but rather the failure of key computers or servers, impacting Delta's global fleet, crews, and passengers [46607]. (b) The software failure incident occurring due to human actions: - The article mentions that Delta's computer system failure was due to a power outage or an internal computer glitch, and aviation and computer specialists stated that computer systems and their electric power sources should have foolproof backups, indicating a potential lack of proper redundancy or backup systems [47391]. - Passengers expressed surprise and disappointment at Delta's computer outage, questioning why one of the world's largest corporations had not found a more reliable way to keep operations running smoothly after a computer glitch. The consequences of such failures were highlighted, emphasizing the need for redundancy and reliable systems to ensure operations continue even in the event of a failure [47393]. |
| Dimension (Hardware/Software) | hardware, software | (a) The software failure incident occurring due to hardware: - The software failure incident at Delta Air Lines was attributed to a power failure that occurred at one of its Atlanta facilities, which led to a shutdown of its computer systems worldwide [46631]. - Delta Airlines experienced a computer outage due to a power outage that affected its mission control center, disrupting its global fleet operations [46607]. - Georgia Power, the utility at Delta's Atlanta hub, mentioned that the issue was related to a failed switchgear, a hardware component similar to a circuit breaker panel [47391]. (b) The software failure incident occurring due to software: - The failure of a piece of electrical equipment at Delta Air Lines' Atlanta facility led to a cascade of hundreds of canceled and delayed flights, indicating a software failure incident [46631]. - The outage in Atlanta that affected Delta's mission control center was described as a computer outage, suggesting a software-related issue [46607]. - The article highlights that the complexity of airline computer systems, built in layers with diverse sources, some of which are 30 years old, can contribute to software failures [46607]. |
| Objective (Malicious/Non-malicious) | non-malicious | (a) The articles do not mention any indication of malicious intent behind the software failure incident reported by Delta Air Lines. The incidents were attributed to a power outage, internal computer glitches, and failures in the airline's computer systems, which led to widespread delays and cancellations [46631, 47393, 47391, 46607]. (b) The software failure incidents reported by Delta Air Lines, Southwest Airlines, United Airlines, and other carriers were non-malicious in nature. These incidents were primarily caused by technical issues such as power outages, router failures, computer glitches, and system failures rather than intentional actions to harm the systems [46631, 47393, 47391, 46607]. |
| Intent (Poor/Accidental Decisions) | poor_decisions, accidental_decisions | (a) The software failure incident involving Delta Air Lines was primarily due to poor decisions made in the design and implementation of their computer systems. The incident was caused by a failure of a piece of electrical equipment at one of Delta's facilities, which led to a cascade of hundreds of canceled and delayed flights [46631]. The article highlights that the airline had backup systems in place, but they were not triggered as they should have been when the primary system failed. This failure to activate the backup systems points to a lack of robust contingency planning and poor decision-making in ensuring system redundancy [46631]. Additionally, the article mentions that Delta's computer system failure was especially damaging because it affected the Operations and Customer Center, which monitors Delta's global fleet, crews, and passengers. The outage in this critical control center led to widespread disruptions, indicating a failure in ensuring the resilience and redundancy of essential operational systems [46607]. (b) The software failure incident can also be attributed to accidental decisions or unintended mistakes. The article mentions that the outage in Delta's system was initially thought to be a power outage, but Georgia Power clarified that it was a problem specific to Delta and not a city-wide issue [47393]. This misunderstanding could be seen as an accidental decision based on incomplete information. Furthermore, the article discusses how the failure of the computer system at Delta's mission control center disrupted operations, leading to manual check-ins and flight delays. The unintended consequences of the system failure resulted in chaos at airports and inconvenience for passengers [46607]. |
| Capability (Incompetence/Accidental) | development_incompetence, accidental | (a) The articles provide information about the software failure incident related to development incompetence. The incident at Delta Air Lines was attributed to a failure of a piece of electrical equipment at one of its Atlanta facilities, which led to a shutdown of its computer systems worldwide, causing hundreds of canceled and delayed flights [46631]. The incident was described as a failure of the system's complexity, with layers of computer systems built over decades, some parts being 30 years old, making it difficult and expensive to update them [46607]. The outage affected Delta's Operations and Customer Center, which monitors the global fleet, crews, and passengers, leading to a crisis when it went offline [46607]. (b) The articles also suggest that the software failure incident was accidental. The outage at Delta Air Lines was initially reported as a power outage that affected its computers, leading to widespread delays and cancellations [47393]. Georgia Power, the utility at Delta's Atlanta hub, mentioned that the outage was likely due to a failed switchgear, a heavy-duty version of a circuit breaker panel, suggesting a hardware issue rather than a software problem [46607]. The outage was not intentional and was not caused by external factors but rather an internal issue within Delta's systems. |
| Duration | temporary | The software failure incident reported in the articles was temporary. The failure was not permanent as it was caused by specific circumstances such as a power outage or internal computer glitch, rather than being a result of contributing factors introduced by all circumstances. The incident was described as a computer outage that affected flights scheduled for a specific morning, leading to cancellations and delays [Article 46607]. Delta's chief executive mentioned that employees were working around the clock to restore service, indicating efforts were being made to resolve the issue [Article 47393]. The outage was not a permanent failure but a temporary disruption in operations caused by specific events. |
| Behaviour | crash, omission, other | (a) crash: The software failure incident in the Delta Air Lines case can be categorized as a crash. The incident led to a complete shutdown of the airline's computer systems worldwide, resulting in hundreds of canceled and delayed flights. The failure was caused by a piece of electrical equipment failure at one of Delta's facilities, leading to a cascade of flight disruptions [46631]. (b) omission: The software failure incident can also be categorized as an omission. Passengers were affected by the system's failure to perform its intended functions, such as issuing boarding passes, processing flight plans, and providing accurate flight information. The outage caused delays, cancellations, and confusion among passengers due to the system's inability to function properly [47393, 46607]. (c) timing: The timing of the software failure incident can be considered a factor in the overall failure. The outage occurred early in the morning, disrupting flights scheduled for that day. The incident caused delays and cancellations, impacting passengers who were expecting to travel on Monday morning. The timing of the failure led to significant disruptions in Delta's operations [46607]. (d) value: The software failure incident did not specifically indicate a failure due to the system performing its intended functions incorrectly (value). The focus was more on the system's complete shutdown, leading to flight disruptions and operational challenges [46631, 47393, 47391, 46607]. (e) byzantine: The software failure incident did not exhibit characteristics of a byzantine failure, which involves inconsistent responses and interactions within the system. The Delta Air Lines incident was primarily characterized by a widespread system outage that affected various aspects of the airline's operations, leading to flight cancellations and delays [46631, 47393, 47391, 46607]. (f) other: The behavior of the software failure incident can be described as a systemwide outage that crippled Delta's mission control center, leading to static check-in lanes, handwritten boarding passes, and incorrect flight information displayed on departure boards and smartphone apps. The failure was attributed to the complexity of the system, outdated components, and the critical nature of the affected systems. The incident highlighted the challenges airlines face in maintaining and updating their intricate computer systems [46607]. |
| Layer | Option | Rationale |
|---|---|---|
| Perception | None | None |
| Communication | None | None |
| Application | None | None |
| Category | Option | Rationale |
|---|---|---|
| Consequence | delay | The consequence of the software failure incident described in the articles is primarily related to delays experienced by passengers due to flight cancellations and disruptions in airline operations. Passengers were stranded in airports, flights were canceled, and delays occurred as a result of the software failure incidents at Delta Air Lines [46631, 47393, 47391, 46607]. This aligns with option (e) delay. |
| Domain | information, transportation | (a) The failed system was intended to support the information industry, specifically in the context of airline operations and global flight management systems. The articles mention how Delta Air Lines' computer systems, which are crucial for managing flights, crews, and passengers globally, experienced a major failure, leading to widespread cancellations and delays [46631, 47393, 47391, 46607]. (b) The transportation industry was impacted by the software failure incident, particularly in the context of airline operations. The failure of Delta Air Lines' computer systems disrupted the movement of people and flights, resulting in numerous cancellations and delays across various locations [46631, 47393, 47391, 46607]. (m) The software failure incident was not directly related to any other industry beyond the information and transportation sectors, as the focus of the articles was on the impact on airline operations and global flight management systems [46631, 47393, 47391, 46607]. |
Article ID: 46631
Article ID: 47393
Article ID: 47391
Article ID: 46607