Incident: O2 Network Outage: Massive Failure Due to Faulty Software Upgrade

Published Date: 2012-07-12

Postmortem Analysis
Timeline 1. The software failure incident at O2 occurred on a Wednesday and lasted for 24 hours, with the network eventually coming back online on a Thursday afternoon [13165]. Therefore, based on the information provided in the article, the software failure incident at O2 happened in July 2012.
System 1. Faulty software upgrade to one of O2's core systems, specifically the home location register [13165].
Responsible Organization 1. O2's faulty software upgrade to one of its core systems was responsible for causing the software failure incident [13165].
Impacted Organization 1. O2 customers, approximately 7.6 million of them, were impacted by the software failure incident as they experienced a 24-hour blackout across the UK and Ireland [13165].
Software Causes 1. The software failure incident at O2 was caused by a faulty software upgrade to one of O2's core systems, specifically in the home location register, which is a database of customers linked to their telephone numbers on the network [13165].
Non-software Causes 1. The outage was caused by a faulty upgrade in preparation for the London Olympics [13165]. 2. The problem was believed to have centered in O2's home location register, which is a database of customers linked to their telephone numbers on the network [13165].
Impacts 1. Almost 8 million O2 customers were affected by the 24-hour blackout, unable to receive calls or text messages [13165]. 2. O2 saw roughly 200,000 failed call attempts every half an hour and a drop of about a third in phone calls and text messages to O2 customers during the outage [13165]. 3. O2's reputation was damaged, with its chief executive admitting the outage was humiliating for the company [13165]. 4. O2 is likely to have to pay compensation to the affected customers and is facing an investigation by telecoms regulator Ofcom into the cause of the downtime [13165]. 5. The outage occurred at an inconvenient time for O2 as mobile networks were competing to attract new customers in the booming smartphone market [13165].
Preventions 1. Implementing thorough testing procedures for software upgrades before deployment could have prevented the software failure incident [13165]. 2. Conducting a risk assessment prior to implementing the software upgrade to identify potential issues and mitigate them in advance could have helped prevent the outage [13165]. 3. Ensuring redundancy and failover mechanisms in critical systems to minimize the impact of any software failures could have been a preventive measure [13165].
Fixes 1. Conduct a thorough investigation into the faulty software upgrade that caused the outage to understand the root cause and prevent similar incidents in the future [13165]. 2. Implement more robust testing procedures for software upgrades to ensure they do not disrupt core systems and services [13165]. 3. Enhance monitoring systems to quickly detect and respond to any anomalies or failures in the network to minimize downtime [13165]. 4. Develop a comprehensive disaster recovery plan to mitigate the impact of future network failures and ensure a quicker restoration of services [13165].
References 1. Chief executive, Ronan Dunne 2. O2 customers 3. Telecoms regulator Ofcom 4. Operators of other mobile networks 5. France Telecom's Orange 6. Spokesman for Ofcom 7. Rival mobile networks 8. O2's core systems 9. O2's home location register

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident at O2 was not an isolated case. The article mentions that the network meltdown at O2 follows similar IT failures at Natwest and RBS, indicating that O2 has experienced software failure incidents before [13165]. (b) The article also mentions that France Telecom's Orange experienced a network collapse for nine hours last week, during which they offered compensation to their customers. This indicates that other organizations like Orange have also faced similar software failure incidents [13165].
Phase (Design/Operation) design, operation (a) The software failure incident at O2 was attributed to a faulty software upgrade in preparation for the London Olympics, which caused a network meltdown. The upgrade was related to one of O2's core systems, specifically the home location register, which is a database of customers linked to their telephone numbers on the network. This design-related failure led to the inability to correctly register handsets to some customers over the 24-hour period, resulting in the widespread outage across the UK and Ireland [13165]. (b) The operation-related factors contributing to the software failure incident at O2 included the impact on customers due to the lack of service during the blackout. O2 customers were unable to receive calls or text messages, with roughly 200,000 failed call attempts every half an hour on Wednesday evening. The outage affected approximately 7.6 million customers, leading to dropped phone calls and failed text messages. The operation of the network was disrupted, causing inconvenience and frustration among users, ultimately requiring compensation and an investigation by the telecoms regulator Ofcom [13165].
Boundary (Internal/External) within_system (a) The software failure incident related to the O2 network blackout was primarily within the system. The outage was caused by a faulty software upgrade to one of O2's core systems, specifically the home location register, which is a database of customers linked to their telephone numbers on the network. This internal software issue led to the network being unable to correctly register handsets to some customers, resulting in the widespread blackout affecting approximately 7.6 million customers [13165].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident at O2 was caused by a faulty software upgrade to one of O2's core systems, specifically in the home location register, which is a database of customers linked to their telephone numbers on the network. This faulty upgrade led to the network being unable to correctly register handsets to some customers over the 24-hour period, resulting in the widespread outage across the UK and Ireland. This issue was a non-human action that contributed to the failure [13165]. (b) In response to the outage, O2's chief executive, Ronan Dunne, admitted that the massive outage was humiliating for the company, indicating the impact of human actions in the incident. Additionally, O2 faced an investigation by telecoms regulator Ofcom into what caused the downtime, and Dunne mentioned that O2 would offer some form of compensation to the affected customers, showing the involvement of human actions in addressing the aftermath of the failure [13165].
Dimension (Hardware/Software) software (a) The software failure incident at O2 was primarily attributed to a faulty software upgrade in one of O2's core systems, specifically in the home location register, which is a database of customers linked to their telephone numbers on the network. This faulty upgrade caused the network to be unable to correctly register handsets to some customers, leading to the widespread outage across the UK and Ireland. The article mentions that if the problem had affected critical hardware like a telephone mast, the outage would have been confined to a smaller geographic area, indicating that the issue originated in the software rather than hardware [13165]. (b) The software failure incident at O2 was directly linked to a faulty software upgrade in one of O2's core systems, specifically in the home location register. This software issue prevented the correct registration of handsets to some customers, resulting in the 24-hour blackout experienced by almost 8 million customers. The article highlights that the IT failure was caused by a faulty software upgrade, indicating that the contributing factors that led to the incident originated in the software rather than hardware [13165].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident at O2 was non-malicious. The outage was caused by a faulty software upgrade in preparation for the London Olympics, which led to the network blackout affecting almost 8 million customers [13165]. The issue was related to a core system software upgrade that impacted the home location register, preventing correct handset registration for some customers. This indicates that the failure was not due to malicious intent but rather a technical error during the upgrade process.
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident at O2 was primarily due to poor decisions related to a faulty software upgrade. The outage was caused by a faulty upgrade in preparation for the London Olympics, which led to the network meltdown affecting almost 8 million customers [13165]. The faulty upgrade to one of O2's core systems, specifically the home location register, resulted in the inability to correctly register handsets to some customers over the 24-hour period, causing widespread disruption [13165]. This indicates that the failure was a result of poor decisions made during the upgrade process.
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident at O2 was attributed to a faulty software upgrade in preparation for the London Olympics, which caused a network blackout affecting almost 8 million customers. This indicates a failure due to development incompetence, as the upgrade introduced contributing factors that led to the outage [13165]. (b) The outage at O2 was described as an "embarrassing" network failure that resulted in a 24-hour blackout across the UK and Ireland. The incident was characterized by O2's chief executive as humiliating for the company, suggesting that the failure was accidental in nature, possibly not intended or foreseen by the organization [13165].
Duration temporary The software failure incident reported in Article 13165 was temporary. The outage lasted for 24 hours, starting at lunchtime on Wednesday and ending on Thursday afternoon. O2 deployed hundreds of engineers to solve the problem, and the network eventually came back online after the 24-hour blackout. The outage was caused by a faulty software upgrade in preparation for the London Olympics, affecting O2's core systems and leading to issues with registering handsets to some customers over the 24-hour period [13165].
Behaviour crash, omission, value (a) crash: The software failure incident in Article 13165 resulted in a network blackout for almost 24 hours, during which O2 customers were unable to receive calls or text messages. The outage was described as a "massive outage" and a "network meltdown" caused by a faulty software upgrade in preparation for the London Olympics. The system lost its state and did not perform its intended functions, leading to a complete crash of the network [13165]. (b) omission: The software failure incident also led to the omission of performing intended functions by the system. Customers were affected by the lack of service, with roughly a third of O2's customers potentially impacted by the outage. Phone calls and text messages to O2 customers dropped by about a third during the 24-hour blackout, indicating an omission of normal communication services [13165]. (c) timing: The timing of the software failure incident was crucial as it occurred during a period when O2 was preparing for the London Olympics and mobile networks were competing to attract new customers in the smartphone market. The outage was inconvenient for O2 and its customers, happening at a time when the network was dealing with 125 million phone calls a day. The system performed its intended functions too late, causing significant disruption to the network and its users [13165]. (d) value: The software failure incident also resulted in the system performing its intended functions incorrectly. O2's network was unable to correctly register handsets to some customers over the 24-hour period due to a faulty software upgrade in one of its core systems. This incorrect performance led to a blackout affecting approximately 7.6 million customers, requiring O2 to offer compensation for the service they didn't receive [13165]. (e) byzantine: The software failure incident did not exhibit behavior indicative of a byzantine failure, which involves inconsistent responses and interactions. The focus of the incident was on a network blackout caused by a faulty software upgrade, leading to a loss of service for O2 customers. The problem centered around the home location register and the inability to register handsets correctly, rather than erratic or inconsistent behavior [13165]. (f) other: The software failure incident in Article 13165 did not exhibit behavior falling under the "other" category. The primary issues stemmed from a crash, omission, timing, and value-related failures caused by a faulty software upgrade in O2's core systems, leading to a significant network blackout and disruption for millions of customers [13165].

IoT System Layer

Layer Option Rationale
Perception processing_unit, network_communication The software failure incident at O2 was related to the network communication layer of the cyber physical system that failed. The outage was caused by a faulty software upgrade to one of O2's core systems, specifically the home location register, which is a database of customers linked to their telephone numbers on the network. This issue prevented correct registration of handsets to some customers over the 24-hour period, leading to the widespread blackout across the UK and Ireland [13165].
Communication connectivity_level The software failure incident at O2 was related to the communication layer of the cyber physical system that failed. The outage was caused by a faulty software upgrade to one of O2's core systems, specifically in the home location register, which is a database of customers linked to their telephone numbers on the network. This issue prevented correct registration of handsets to some customers over the 24-hour period, indicating a failure at the communication layer of the system [13165].
Application TRUE The software failure incident reported in Article 13165 was related to a faulty software upgrade in one of O2's core systems. The problem was specifically identified as a faulty upgrade in preparation for the London Olympics, which caused the network meltdown. The issue was centered around O2's home location register, which is a database of customers linked to their telephone numbers on the network. This indicates that the failure was indeed related to the application layer of the cyber physical system, as it involved a software upgrade causing issues with customer registration and connectivity [13165].

Other Details

Category Option Rationale
Consequence property, delay, theoretical_consequence The consequence of the software failure incident reported in Article 13165 was primarily related to harm and delay. The software failure led to a 24-hour blackout across the UK and Ireland, impacting almost 8 million O2 customers. During this blackout, customers were unable to receive calls or text messages, resulting in inconvenience and potential harm as communication services were disrupted [13165]. Additionally, the outage caused embarrassment to the company and its CEO, Ronan Dunne, who publicly apologized for the disruption in service [13165]. The incident also led to an investigation by the telecoms regulator Ofcom, indicating potential regulatory consequences [13165]. Compensation was mentioned as a form of redress for the affected customers, suggesting financial implications [13165]. The outage was attributed to a faulty software upgrade in one of O2's core systems, highlighting the impact of software failures on service reliability and customer experience [13165].
Domain unknown The software failure incident reported in Article 13165 is related to the telecommunications industry, specifically affecting O2, the UK's second-largest mobile network. The incident caused a 24-hour blackout across the UK and Ireland, impacting almost 8 million customers. The outage was attributed to a faulty software upgrade in one of O2's core systems, particularly in the home location register, which is a database of customers linked to their telephone numbers on the network. This failure disrupted services such as calls and text messages, highlighting the critical role of software systems in supporting the telecommunications sector [13165].

Sources

Back to List