Incident: British Airways IT Failure: Power Surge or Infrastructure Weakness?

Published Date: 2017-05-30

Postmortem Analysis
Timeline 1. The software failure incident involving British Airways' catastrophic IT failure was reported in the article published on May 30, 2017 [Article 59063]. 2. The incident date can be estimated as sometime around the weekend prior to May 30, 2017.
System The system that failed in the British Airways IT failure incident was the power system. The failure was attributed to a power surge that supposedly rendered the back-up system ineffective. However, experts questioned this explanation, suggesting that the issue might have been due to bad design or other factors beyond just a power surge [59063]. Therefore, the system that failed in the software failure incident was: 1. Power system, including surge protection, uninterruptible power supply (UPS), and quality earthing system [59063].
Responsible Organization 1. British Airways' claim of a "power surge" as the cause of the catastrophic IT failure has been questioned by experts [59063]. 2. Data centre designers have pointed out that the lack of resilience in data centres to deal with common problems, such as power surges, could have contributed to the incident [59063]. 3. The electricity companies SSE and UK Power Networks, which provide energy to the area where BA's data centre is located, denied there had been a power surge, suggesting a discrepancy in the reported cause of the failure [59063]. 4. Experts highlighted the importance of proper infrastructure and maintenance practices in preventing power-related issues that could lead to data centre outages [59063].
Impacted Organization 1. British Airways [59063]
Software Causes 1. Unknown
Non-software Causes 1. Lack of resilience in data centres to deal with common problems such as power surges [59063]. 2. Inadequate infrastructure to ensure continuous power supply to IT equipment [59063]. 3. Potential issues with the rebooting of crucial databases or executing procedures like server restoration after a power outage [59063]. 4. Outdated infrastructure in the airline industry, including data centres, which may not be equipped to handle modern challenges like power surges [59063].
Impacts 1. Flight cancellations and delays: The software failure incident led to British Airways experiencing flight cancellations and delays, causing inconvenience to passengers [Article 59063]. 2. Data center downtime: The incident resulted in the data center and its backup system being rendered ineffective, impacting the airline's operations [Article 59063]. 3. Reputation damage: The failure incident likely damaged British Airways' reputation as customers faced disruptions and the company's explanation of a power surge was questioned by experts [Article 59063].
Preventions 1. Implementing proper surge protection measures, including surge protection devices and uninterruptible power supplies, to safeguard against power surges [59063]. 2. Regularly testing and maintaining critical systems such as databases and servers to ensure they can handle power restoration procedures without issues [59063]. 3. Investing in updated infrastructure and ensuring data centers are equipped with modern technology to prevent failures due to outdated systems [59063].
Fixes 1. Implement surge protection and uninterruptible power supply (UPS) systems in the data centre to safeguard against power surges [59063]. 2. Conduct regular testing and maintenance of crucial databases and servers to ensure they can be rebooted and restored effectively [59063]. 3. Invest in updating and modernizing outdated infrastructure to prevent future incidents related to power problems and system failures [59063].
References 1. Data centre designers, such as James Wilman from Future-tech and Andy Hirst from Sudlows [Article 59063] 2. Electricity companies SSE and UK Power Networks [Article 59063] 3. Matthew Bloch, managing director of Bytemark Hosting [Article 59063] 4. Barry Elliott, director of Capitoline consultants [Article 59063]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: The article mentions that the airline industry, including British Airways, is known for running outdated infrastructure long after standards have improved. It was revealed that passenger booking systems used by multiple airlines were vulnerable to hackers [59063]. (b) The software failure incident having happened again at multiple_organization: The article does not provide specific information about similar incidents happening at other organizations. Therefore, it is unknown if similar incidents have occurred at multiple organizations.
Phase (Design/Operation) design, operation (a) The article mentions experts questioning British Airways' claim that the catastrophic IT failure was solely due to a "power surge." Data center designers highlighted that a power surge should not have been able to bring down a data center and its backup system, suggesting potential issues with the design of the system or the presence of additional factors beyond just a power surge [Article 59063]. (b) The article discusses the importance of testing procedures like rebooting crucial databases and restoring servers after a power outage. Matthew Bloch, the managing director of Bytemark Hosting, raised concerns about the procedures followed when the power was turned back on, indicating potential operational factors contributing to the software failure incident [Article 59063].
Boundary (Internal/External) within_system, outside_system (a) within_system: The software failure incident reported in the articles seems to have been influenced by factors originating from within the system itself. The article mentions concerns raised by experts regarding British Airways' claim that the catastrophic IT failure was caused by a "power surge" [59063]. Data center designers highlighted issues with the design and resilience of the data center infrastructure, such as the lack of surge protection, uninterruptible power supply, and quality earthing system, which should have protected against power surges [59063]. Additionally, the article discusses the importance of testing procedures like rebooting crucial databases and restoring servers, indicating potential internal system vulnerabilities that could have contributed to the failure [59063]. (b) outside_system: On the other hand, the articles also suggest that factors external to the system may have played a role in the software failure incident. The electricity companies providing energy to the area where BA's data center is located denied that there had been a power surge, questioning the validity of BA's explanation for the failure [59063]. This external factor raises doubts about the accuracy of BA's initial claim regarding the cause of the IT failure, indicating a potential discrepancy between the internal system assessment and external verification.
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: The article mentions that British Airways' catastrophic IT failure was initially attributed to a "power surge" by the company's chief executive. However, experts questioned this explanation, stating that a power surge should not be able to bring down a data centre and its backup systems. Data centre designers highlighted the importance of surge protection, uninterruptible power supply (UPS), and a quality earthing system to protect against power surges. Additionally, it was noted that the real problem might have occurred when the power was turned back on after the outage, raising questions about the testing and procedures related to crucial databases and servers [Article 59063]. (b) The software failure incident occurring due to human actions: The article does not provide direct evidence or mention of human actions contributing to the software failure incident. Therefore, it is unknown if human actions played a role in the British Airways IT failure incident reported in the article.
Dimension (Hardware/Software) hardware (a) The software failure incident in the British Airways case was initially attributed to a "power surge" by the company's chief executive [Article 59063]. However, experts questioned this explanation, stating that a power surge should not be able to bring down a data centre and its backup systems. They highlighted the importance of surge protection, uninterruptible power supply (UPS), and a quality earthing system to protect against power surges, indicating a hardware-related issue in the infrastructure design [Article 59063]. (b) The incident also raised concerns about the resilience of data centres and the infrastructure required to ensure continuous operation of IT equipment. Experts mentioned that failures in the infrastructure, such as power problems, are common causes of data centre outages, indicating a software-related issue in terms of the systems and procedures in place to handle power restoration and database reboots [Article 59063].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in the articles does not indicate any malicious intent behind the failure. The incident is primarily attributed to issues related to power surges, lack of resilience in data centers, outdated infrastructure, and potential failures in testing and rebooting crucial systems [59063]. These factors point towards a non-malicious nature of the failure.
Intent (Poor/Accidental Decisions) poor_decisions [a] The software failure incident reported in the articles seems to be more aligned with poor_decisions. The incident was attributed to a claimed power surge by British Airways' chief executive, which experts questioned. Data center designers mentioned that proper surge protection measures, such as surge protection and uninterruptible power supply, should have prevented such an incident. Additionally, the lack of resilience in data centers to common problems like power surges was highlighted, indicating a potential lack of investment in infrastructure maintenance and upgrades [59063].
Capability (Incompetence/Accidental) accidental (a) The articles do not specifically mention the software failure incident being attributed to development incompetence by humans or the development organization. The focus is more on the potential issues related to power surges, infrastructure resilience, and outdated systems. (b) The articles highlight the possibility of the software failure incident being accidental. For example, the article mentions concerns raised by experts regarding the claim of a power surge causing the catastrophic IT failure at British Airways. There are discussions about the lack of resilience in data centers, the importance of proper testing procedures, and the challenges associated with outdated infrastructure. These factors suggest that the incident may have been accidental rather than intentionally caused by development incompetence [59063].
Duration unknown The articles do not provide specific information about whether the software failure incident was permanent or temporary.
Behaviour crash (a) crash: The software failure incident in the British Airways case resulted in a catastrophic IT failure that brought down the data center and its backup system, rendering them ineffective [Article 59063]. (b) omission: The incident highlighted the lack of resilience in many data centers to deal with common problems, indicating an omission in ensuring infrastructure to prevent power outages [Article 59063]. (c) timing: The issue was not just the power surge itself but also the consequences of turning the power back on, raising questions about the timing of crucial database reboots and server restoration procedures [Article 59063]. (d) value: The failure was not directly attributed to the system performing its intended functions incorrectly but rather to the inability of the infrastructure to handle power surges effectively [Article 59063]. (e) byzantine: The incident did not exhibit characteristics of a byzantine failure where the system behaves erroneously with inconsistent responses and interactions. Instead, it was primarily a result of the power surge and infrastructure vulnerabilities [Article 59063]. (f) other: The incident also shed light on the issue of outdated infrastructure in the airline industry and other sectors, indicating a broader problem beyond just the immediate software failure incident [Article 59063].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay, theoretical_consequence The consequence of the software failure incident discussed in the articles is primarily related to the impact on people's travel plans and the airline's operations. There is no mention of any direct harm, death, or significant property loss resulting from the software failure incident. The main consequences observed are related to delays and disruptions in travel plans for passengers, as British Airways had to cancel flights and faced significant IT downtime affecting their operations [59063]. The incident led to inconvenience for passengers, financial losses for the airline, and reputational damage. There is also a discussion about potential consequences such as the lack of resilience in data centers to common problems and the need for better infrastructure to prevent such incidents in the future.
Domain transportation (a) The failed system was related to the airline industry, specifically British Airways, as mentioned in the article [59063]. The incident caused a catastrophic IT failure within the airline's operations, leading to significant disruptions in their services. (b) The article does not mention any direct connection to the transportation industry beyond the specific case of British Airways. (c) The incident is not related to the extraction of natural resources. (d) The software failure incident does not pertain to sales transactions. (e) There is no indication that the failure is linked to the construction industry. (f) The software failure incident is not associated with manufacturing processes. (g) The article discusses power-related issues, but it does not directly involve utilities services beyond the power surge aspect. (h) The incident does not involve financial transactions or systems. (i) The failure is not related to knowledge-based activities like education or research. (j) The software failure incident is not connected to the health industry. (k) The incident does not involve the entertainment industry. (l) The failed system is not directly related to government operations or services. (m) The software failure incident is not explicitly linked to any other industry beyond the airline industry discussed in the article.

Sources

Back to List