Incident: Power Surge at British Airways Data Centre Causes Chaos and Delays

Published Date: 2017-05-31

Postmortem Analysis
Timeline 1. The software failure incident leading to chaos for British Airways occurred on Saturday morning [59059]. 2. The article was published on 2017-05-31. 3. Therefore, the software failure incident happened in May 2017.
System The system(s) that failed in the British Airways software failure incident were: 1. Electrical power supply system at the UK data centre, leading to an uncontrolled return of power and subsequent power surge that took out the IT systems [Article 59059].
Responsible Organization 1. The software failure incident at British Airways was caused by an "uncontrolled return of power" following an outage that physically damaged servers at its data centre [59059].
Impacted Organization 1. Passengers - Approximately 75,000 passengers were affected as flights were cancelled [Article 59059]. 2. British Airways (BA) - The airline faced significant disruptions and financial implications due to the IT shutdown [Article 59059].
Software Causes 1. unknown
Non-software Causes 1. The IT shutdown at British Airways was caused by an "uncontrolled return of power" following an outage that physically damaged servers at its data centre [59059]. 2. The initial power outage and subsequent surge were attributed to a loss of power to the UK data centre, compounded by the uncontrolled return of power, causing a power surge that took out the IT systems [59059]. 3. The airline mentioned that the incident was not an IT failure and had nothing to do with the outsourcing of IT; it was an electrical power supply interruption [59059].
Impacts 1. About 75,000 passengers were affected as flights were cancelled, leading to chaos for British Airways [Article 59059]. 2. Passengers were left without their luggage for an extended period of time. 3. BA faced potential compensation costs exceeding £100m. 4. BA's parent company, IAG, saw shares initially fall by about 4% in the first day of trading in London after the outage occurred. 5. Travellers had to spend the night sleeping on yoga mats spread on terminal floors after all flights leaving Heathrow and Gatwick were cancelled.
Preventions 1. Proper maintenance and monitoring of the electrical power supply to prevent interruptions and power surges could have prevented the software failure incident [59059]. 2. Retaining a sufficient number of dedicated and loyal IT staff instead of cutting jobs and outsourcing the work to India may have prevented the incident, as suggested by the GMB union [59059].
Fixes 1. Conducting an exhaustive investigation to determine the exact circumstances of the power outage and power surge that caused the IT systems to fail, ensuring it never happens again [59059]. 2. Implementing measures to prevent uncontrolled power returns and power surges that can damage servers and IT systems [59059]. 3. Reviewing and potentially enhancing the electrical power supply infrastructure to prevent future interruptions that could lead to IT failures [59059].
References 1. British Airways statement 2. GMB union 3. BA's chief executive, Alex Cruz 4. Experts 5. BA's parent company, IAG 6. Passengers affected by the incident 7. Heathrow Airport 8. Gatwick Airport 9. BA's customer relations department

Software Taxonomy of Faults

Category Option Rationale
Recurring unknown (a) The software failure incident having happened again at one_organization: - The article does not mention any previous incidents of a similar nature happening again within British Airways or with its products and services. Therefore, there is no indication of a recurring software failure incident within the same organization [59059]. (b) The software failure incident having happened again at multiple_organization: - The article does not provide information about similar incidents happening again at other organizations or with their products and services. Hence, there is no mention of a recurring software failure incident across multiple organizations [59059].
Phase (Design/Operation) design (a) The software failure incident related to the design phase: The article mentions that the IT shutdown that caused chaos for British Airways was due to an "uncontrolled return of power" following an outage that physically damaged servers at its data centre. The airline stated that there was a loss of power to the UK data centre which was compounded by the uncontrolled return of power, causing a power surge that took out their IT systems. This indicates that the failure was due to contributing factors introduced by the system development or infrastructure design, specifically related to power supply and system resilience [Article 59059]. (b) The software failure incident related to the operation phase: The article does not provide specific information indicating that the failure was due to contributing factors introduced by the operation or misuse of the system. Therefore, it is unknown if the failure was directly related to operational issues [Article 59059].
Boundary (Internal/External) within_system (a) The software failure incident involving the IT shutdown at British Airways was primarily within the system. The incident was caused by an "uncontrolled return of power" following an outage that physically damaged servers at the data centre [Article 59059]. The airline clarified that it was not an IT failure and had nothing to do with outsourcing of IT; instead, it was an electrical power supply interruption that led to the failure [Article 59059]. The airline mentioned that the exact circumstances of the incident needed to be investigated to ensure it does not happen again [Article 59059].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident was attributed to an "uncontrolled return of power" following an outage that physically damaged servers at the data centre, causing a power surge that took out the IT systems [Article 59059]. (b) The airline denied that the failure was due to outsourcing of IT or an IT failure, stating that it was an electrical power supply interruption. The GMB union accused British Airways of contributing to the issue by cutting the jobs of IT staff and contracting the work to India, implying human actions played a role in the incident [Article 59059].
Dimension (Hardware/Software) hardware (a) The software failure incident was attributed to hardware issues. British Airways stated that the IT shutdown was caused by an "uncontrolled return of power" following an outage that physically damaged servers at its data centre. The loss of power to the UK data centre, compounded by the uncontrolled return of power, caused a power surge that took out their IT systems [59059]. The airline emphasized that it was not an IT failure and had nothing to do with outsourcing of IT; rather, it was an electrical power supply interruption. (b) The articles did not mention any contributing factors originating in software for the software failure incident.
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident in this case was non-malicious. The incident was attributed to an "uncontrolled return of power" following an outage that physically damaged servers at the data center, causing a power surge that took out the IT systems [Article 59059]. The airline emphasized that it was not an IT failure and had nothing to do with the outsourcing of IT, but rather an electrical power supply interruption. (b) The incident was not reported to be malicious in nature, but rather a result of power-related issues and system failures.
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions The software failure incident involving British Airways was not directly attributed to an IT failure or outsourcing of IT, but rather to an "uncontrolled return of power" following an outage that physically damaged servers at the data center [59059]. However, there are mentions of poor decisions and accidental decisions that may have contributed to the incident: 1. Poor Decisions: The GMB union accused British Airways of greed, suggesting that the issue could have been prevented if the airline had not cut the jobs of IT staff and contracted the work to India [59059]. This implies that the decision to cut jobs and outsource IT work may have played a role in the incident. 2. Accidental Decisions: The exact cause of the initial power outage and subsequent surge, which led to the software failure incident, has not been revealed yet. British Airways mentioned that there was a loss of power to the UK data center, compounded by the uncontrolled return of power, causing a power surge that took out their IT systems [59059]. This indicates that the incident may have been a result of unintended consequences or accidental decisions related to power management. Therefore, both poor decisions (outsourcing and cutting IT jobs) and accidental decisions (uncontrolled return of power) could have been contributing factors to the software failure incident at British Airways.
Capability (Incompetence/Accidental) accidental (a) The software failure incident does not seem to be directly attributed to development incompetence based on the information provided in the article [59059]. The incident was described as an "uncontrolled return of power" following an outage that physically damaged servers at the data centre, leading to a power surge that took out the IT systems. The airline emphasized that it was not an IT failure and had nothing to do with the outsourcing of IT, but rather an electrical power supply interruption. (b) The software failure incident appears to be accidental in nature based on the information provided in the article [59059]. The power outage and subsequent surge that caused the IT systems to fail were described as unexpected events that led to the disruption of services, affecting around 75,000 passengers. The airline stated that they need to find out why the power outage occurred and are conducting an exhaustive investigation to prevent such incidents from happening again.
Duration temporary The software failure incident reported in Article 59059 was temporary. The failure was attributed to an "uncontrolled return of power" following an outage that physically damaged servers at British Airways' data center. The incident caused flights to be canceled, affecting about 75,000 passengers. The airline was unable to resume a full schedule until Tuesday, and many passengers were still without their luggage. The airline mentioned that the power outage and subsequent surge caused a power surge that took out their IT systems, emphasizing that it was not an IT failure but an electrical power supply interruption. The airline stated that they were conducting an exhaustive investigation to prevent such incidents in the future, indicating that the failure was temporary and caused by specific circumstances [59059].
Behaviour omission, other (a) crash: The software failure incident in the British Airways case led to flights being canceled, affecting about 75,000 passengers. The airline was unable to resume a full schedule until Tuesday, and many passengers were left without their luggage [59059]. (b) omission: The software failure incident resulted in the omission of flights as they were canceled, causing chaos for passengers. The outage also led to delays in processing delayed bags, with the airline admitting that it may take some time to complete the process [59059]. (c) timing: The software failure incident caused delays in the timing of flights, with passengers experiencing disruptions in their travel schedules. Passengers had to spend the night at terminals after all flights leaving Heathrow and Gatwick were canceled [59059]. (d) value: The software failure incident did not specifically mention any issues related to the system performing its intended functions incorrectly [59059]. (e) byzantine: The software failure incident did not exhibit behaviors of inconsistent responses or interactions [59059]. (f) other: The software failure incident was attributed to an "uncontrolled return of power" following an outage that physically damaged servers at the data center, causing a power surge that took out the IT systems. The incident was not classified as an IT failure but rather as an electrical power supply interruption [59059].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property (a) death: People lost their lives due to the software failure (b) harm: People were physically harmed due to the software failure (c) basic: People's access to food or shelter was impacted because of the software failure (d) property: People's material goods, money, or data was impacted due to the software failure (e) delay: People had to postpone an activity due to the software failure (f) non-human: Non-human entities were impacted due to the software failure (g) no_consequence: There were no real observed consequences of the software failure (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? The consequence of the software failure incident based on the article is as follows: (d) property: The software failure incident led to significant consequences for British Airways and its passengers. About 75,000 passengers were affected, flights were canceled, and many passengers were left without their luggage. The airline faced potential compensation costs of over £100m, and its parent company, IAG, saw shares initially fall by about 4% in the first day of trading in London after the outage occurred [Article 59059].
Domain transportation (a) The failed system was intended to support the transportation industry. The software failure incident affected British Airways, leading to flight cancellations and chaos for about 75,000 passengers [Article 59059].

Sources

Back to List