Incident: O2 Network Outage Caused by Ericsson Software Glitch.

Published Date: 2018-12-06

Postmortem Analysis
Timeline 1. The software failure incident happened on December 6, 2018 [78847, 79622, 79621, 79665]. 2. The incident occurred on December 6, 2018, based on the published dates of the articles and the information provided.
System 1. Serving GPRS Support Node - Mobility Management Entity (SGSN-MME) nodes [79621] 2. Ericsson's software [78871, 79308]
Responsible Organization 1. Ericsson - Ericsson's faulty software was identified as the root cause of the software failure incident that affected millions of smartphone users in Britain, Japan, and other countries [79622, 79621]. 2. O2 - O2 experienced the network outage due to a failure in systems operated by Ericsson, its equipment supplier [78847, 79621].
Impacted Organization 1. O2 network customers, including 25 million O2 subscribers and customers of Tesco Mobile, Sky Mobile, GiffGaff, and Lycamobile [78847, 78871, 79228, 78878, 79308]. 2. Softbank customers in Japan [79622]. 3. Businesses relying on O2 services, such as gig economy workers, taxi firms, couriers, food delivery services, and Transport for London's electronic timetable service at bus stops [78847, 78871, 79228, 79308].
Software Causes 1. The software glitch causing the failure incident was identified as an issue in certain nodes in the core network, resulting in network disturbances for some customers [Article 79621]. 2. Ericsson confirmed that the disruption was caused by a problem with its software, specifically an expired certificate in the software versions installed with affected customers [Article 79621]. 3. The outage was attributed to out-of-date software licenses in Ericsson's systems, affecting O2 after implementing the latest version of the supplier's systems [Article 78871].
Non-software Causes 1. The outage was caused by expired certificates in the software versions installed with the affected customers [79621]. 2. The problem stemmed from issues at exchanges for the high-speed wireless LTE network [79622].
Impacts 1. Millions of O2 customers were left unable to access online data services, apps, and make calls, causing immense stress and difficulty for users, including those with health conditions like epilepsy and diabetes [78847, 78847]. 2. Various individuals, such as a heating engineer, plumber, and a cleaner, faced challenges in their work as they rely heavily on mobile data for their businesses and personal safety monitoring systems [78847]. 3. The outage affected not only O2 customers but also users of Tesco Mobile, Sky Mobile, GiffGaff, and Lycamobile, totaling around 32 million customers [78847, 79308]. 4. The disruption led to issues with electronic bus timetable updates, affecting commuters and transportation services [78847, 79308]. 5. Smart meters were impacted, leading to the cancellation of installations and affecting services that rely on O2 data [78847]. 6. Mobile payment services like Apple Pay and Google Pay were blocked, impacting users at shop tills and on transport networks [78847]. 7. The outage caused frustration, financial losses, and inconvenience to individuals and businesses, with reports of increased Uber fares, disrupted home visit services, and challenges in navigating without GPS [78847]. 8. The glitch resulted in a significant impact on various sectors, including taxi firms, couriers, food delivery services, and businesses relying on mobile data for operations [78847]. 9. The outage highlighted the vulnerability of the nation's communications system and the reliance of Britons on smartphone access to the internet [78847]. 10. O2 customers were offered compensation in the form of phone credit, discounts, and refunds for the inconvenience caused by the software failure incident [78983, 79665, 78878].
Preventions 1. Regularly updating software licenses to prevent expired certificates causing network disturbances could have prevented the software failure incident [Article 79621]. 2. Conducting thorough audits and reviews of software systems to ensure proper management and functionality could have helped prevent the outage [Article 78871]. 3. Implementing robust testing procedures for software updates to catch any potential issues before they impact customers could have prevented the disruption [Article 78878].
Fixes 1. Decommissioning the faulty software that caused the issues [79621]. 2. Conducting a full audit and review with Ericsson to understand what went wrong and prevent future occurrences [78871]. 3. Restoring the network services fully by working closely with Ericsson to resolve the problem [79621]. 4. Offering compensation to customers affected by the outage, such as crediting two days of monthly airtime subscription charges for monthly customers and providing discounts for prepaid customers [78983, 79665]. 5. Monitoring service performance closely to ensure stability and conducting a review to fully understand the cause of the issue [78878]. 6. Implementing measures to limit the impact of software failures and restore services as quickly as possible [79621].
References 1. Article 78847 2. Article 79622 3. Article 78983 4. Article 79228 5. Article 79621 6. Article 79665 7. Article 78878 8. Article 79308

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at O2: - O2 experienced a software glitch causing smartphone users to lose internet access, leading to a day-long data network collapse [Article 79228]. - O2 planned to compensate customers with phone credit due to the outage [Article 79665]. - O2 offered customers up to two days' credit in compensation as a goodwill gesture after a UK-wide outage [Article 78878]. (b) The software failure incident having happened again at multiple organizations: - The software glitch affected multiple mobile providers including O2, Tesco Mobile, and Sky Mobile, impacting over 30 million customers [Article 79308]. - Ericsson, the equipment supplier for O2 and Softbank, confirmed that a software problem caused disruptions for operators in multiple markets [Article 79622]. - Ericsson identified an issue in certain nodes in the core network, causing network disturbances for some customers of various mobile companies [Article 79621].
Phase (Design/Operation) design, operation (a) The software failure incident occurring due to the development phases: - The software failure incident affecting O2 and other mobile operators was attributed to a faulty software developed by Ericsson, the equipment supplier for O2 ([78847], [79622]). - Ericsson confirmed that the disruption was caused by a problem with its software, specifically due to expired certificates in the software versions installed with affected customers ([79622]). - O2's chief executive mentioned that the issue was related to the Ericsson software and that a full audit would be conducted to understand what went wrong ([78878]). - The outage was a result of a technical fault that caused a UK-wide outage, impacting over 30 million customers, and was unlikely to be fixed until the following morning ([79308]). (b) The software failure incident occurring due to the operation phases: - The outage affected O2's entire network, preventing customers from getting online on their phones, and later, customers were unable to make calls due to the volume of demand on the network ([79308]). - O2 initially stated that voice calls were still operational, but later admitted that customers could not make calls due to the network strain ([79308]). - The outage disrupted services such as Transport for London's live updates of bus arrival times, which rely on O2's network for data updates ([79308]). - O2's equipment supplier, Ericsson, acknowledged that its software caused the problem, impacting not only O2 but also other mobile operators in multiple markets ([78878], [79622]).
Boundary (Internal/External) within_system (a) within_system: The software failure incident was primarily caused by a problem with the software operated by Ericsson, the equipment supplier for O2. Ericsson confirmed that the disruption affecting O2 and other operators in multiple markets was due to a faulty software issue [79622]. An initial analysis indicated that expired certificates in the software versions installed with affected customers were to blame for the outage [79621]. O2's chief executive mentioned that the issue was related to the Ericsson software and that both companies were working to restore services [78878]. (b) outside_system: The software failure incident was not attributed to factors originating from outside the system. The outage was acknowledged to be a result of a problem with the software within the core network nodes, specifically the Serving GPRS Support Node - Mobility Management Entity (SGSN-MME) [79621]. The outage was not linked to illegal activity or hacking [78847].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: - The software failure incident affecting O2 and other mobile providers was caused by a technical fault in the network, specifically due to a problem with Ericsson's software [Article 79308]. - Ericsson confirmed that the issue was related to certain nodes in the core network, resulting in network disturbances for customers, and the faulty software causing the problem was being decommissioned [Article 79621]. - The outage was attributed to an expired certificate in the software versions installed with affected customers, leading to network disturbances for millions of users in multiple markets [Article 79621]. (b) The software failure incident occurring due to human actions: - The outage was blamed on a failure in systems operated by O2’s equipment supplier, Ericsson, indicating a potential human error in managing the software systems [Article 78847]. - O2's chief executive mentioned that they were working with Ericsson to identify the issue, suggesting a collaborative effort to understand the root cause, which could involve human actions in managing the software [Article 79308]. - O2's CEO expressed apologies for letting customers down, indicating a sense of responsibility for the incident, which could imply human involvement in the chain of events leading to the software failure [Article 79308].
Dimension (Hardware/Software) software (a) The software failure incident occurring due to hardware: - There is no specific mention in the articles about the software failure incident being caused by hardware issues. The primary cause of the outage was attributed to a software glitch in Ericsson's equipment, leading to disruptions in the network services [79622, 79621]. (b) The software failure incident occurring due to software: - The software failure incident was primarily caused by a software glitch in Ericsson's equipment, affecting millions of smartphone users in Britain, Japan, and other countries, leading to network disturbances and outages on 4G networks [79622, 79621]. - Ericsson confirmed that the problem was related to faulty software, specifically an issue in certain nodes in the core network, resulting in network disturbances for customers [79621]. - O2 and Japan's Softbank reported outages on their 4G networks due to problems with Ericsson's software, with Ericsson identifying an issue in specific software versions installed with affected customers [79621]. - O2 planned to compensate customers for the software glitch by offering phone credit and other forms of compensation [78983, 79665]. - The outage affected various services and customers, including those using O2's network and its subsidiary services like Tesco Mobile, Sky Mobile, Giffgaff, and Lycamobile, all of which rely on O2's platform [79308].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident was non-malicious. The incident was caused by a technical fault in the software provided by Ericsson, O2's equipment supplier. Ericsson confirmed that the problem was due to faulty software and an expired certificate in the software versions installed with affected customers [79621]. O2's chief executive, Mark Evans, mentioned that the company was confident in fully restoring the service by the next morning and acknowledged the issue as a technical fault [79308]. Additionally, O2 announced compensation plans for affected customers, such as crediting two days of monthly airtime subscription charges and offering discounts, indicating a goodwill gesture rather than punitive measures typically associated with malicious intent [78983, 79665].
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident was primarily due to accidental decisions rather than poor decisions. The incident was caused by a technical fault in the software provided by Ericsson, O2's equipment supplier. Ericsson confirmed that the issue was with their software, specifically an expired certificate in the software versions installed with affected customers [79621]. O2's chief executive, Mark Evans, mentioned that the company was confident the service would be fully restored by the next morning and expressed apologies for letting customers down [79308]. Additionally, O2 announced compensation plans for affected customers, including crediting two days of monthly airtime subscription charges for monthly customers and offering a 10% discount and credit for prepaid customers [78983]. This response indicates that the failure was not a result of deliberate poor decisions but rather an unintended consequence of a technical fault.
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident occurring due to development_incompetence: - The outage was blamed on a failure in systems operated by O2’s equipment supplier, Swedish firm Ericsson [78847]. - Ericsson admitted that its software had caused the problem, mentioning that an initial analysis indicated that expired certificates in the software versions installed with the affected customers were to blame [79621]. - O2's chief executive mentioned that they were confident they knew what the issue was and were working with Ericsson to identify and fix it [79308]. (b) The software failure incident occurring accidentally: - Ericsson mentioned that an initial root cause analysis indicated that the main issue was an expired certificate in the software versions installed with the affected customers, suggesting an accidental introduction of the problem [79621]. - O2's equipment supplier, Ericsson, apologized for the inconvenience caused by the faulty software and mentioned that it was being decommissioned, indicating an accidental nature of the issue [79621].
Duration temporary The software failure incident related to the O2 network outage was temporary. The outage began on Thursday, affecting millions of customers, and services like 3G and 4G were restored by the following day [#78847, #79621, #79308]. The outage was caused by a software glitch in Ericsson's equipment, specifically due to an expired certificate in the software versions installed with the affected customers [#79621]. O2's chief executive mentioned that they were confident the service would be fully restored by Friday morning [#79308]. The outage impacted various services, including live bus updates and the ability to make or receive calls [#79308].
Behaviour crash, omission, other (a) crash: The software failure incident in the articles can be categorized as a crash behavior. This is evident from the reports of the network outage causing the 3G and 4G services to go down, leaving customers unable to access the internet or use apps that require an internet connection [Article 78878]. Additionally, the outage affected services like live bus updates, indicating a system crash where the network lost its state and failed to perform its intended functions [Article 79308]. (b) omission: The software failure incident can also be classified as an omission behavior. This is seen in the reports of customers being unable to make or receive phone calls, indicating an omission in the system's intended functions [Article 79308]. (c) timing: The software failure incident does not align with a timing behavior as there are no reports of the system performing its intended functions but at incorrect times. (d) value: The software failure incident does not align with a value behavior as there are no reports of the system performing its intended functions incorrectly. (e) byzantine: The software failure incident does not align with a byzantine behavior as there are no reports of inconsistent responses or interactions from the system. (f) other: The software failure incident can also be categorized as a crash behavior due to the disruption in services like live bus updates and the inability of customers to access the internet or make calls, indicating a system crash where the network lost its state and failed to perform its intended functions [Article 79308].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human (a) death: People lost their lives due to the software failure - There were no reports of any deaths resulting from the software failure incident. [78847, 79622, 78983, 78871, 79228, 79621, 79665, 78878, 79308] (b) harm: People were physically harmed due to the software failure - There were no reports of physical harm to individuals due to the software failure incident. [78847, 79622, 78983, 78871, 79228, 79621, 79665, 78878, 79308] (c) basic: People's access to food or shelter was impacted because of the software failure - There were no reports of people's access to food or shelter being impacted by the software failure incident. [78847, 79622, 78983, 78871, 79228, 79621, 79665, 78878, 79308] (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure impacted millions of O2 customers, affecting their ability to access online data services, apps, and make calls. This caused inconvenience and financial losses for some individuals, such as a heating engineer and plumber who couldn't reach customers, and a cleaner who relied on her watch connected to her phone for health monitoring. [78847, 79622, 78983, 78871, 79228, 79621, 79665, 78878, 79308] (e) delay: People had to postpone an activity due to the software failure - The software failure caused delays and disruptions in various activities such as work, communication, navigation, and health monitoring for individuals and businesses relying on O2 services. For example, a heating engineer couldn't reach customers, a cleaner couldn't monitor health data, and commuters faced issues with electronic bus timetables. [78847, 79622, 78983, 78871, 79228, 79621, 79665, 78878, 79308] (f) non-human: Non-human entities were impacted due to the software failure - Non-human entities such as smart meters, electronic bus timetable services, and traffic apps were affected by the software failure incident. For example, smart meters relying on O2 data services were impacted, and electronic bus timetable updates failed. [78847, 79622, 78983, 78871, 79228, 79621, 79665, 78878, 79308] (g) no_consequence: There were no real observed consequences of the software failure - The software failure incident had significant consequences on O2 customers, affecting their ability to access data services, apps, and make calls. The outage caused inconvenience, financial losses, and disruptions in various activities. [78847, 79622, 78983, 78871, 79228, 79621, 79665, 78878, 79308] (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - The articles did not mention any potential consequences discussed that did not occur. [78847, 79622, 78983, 78871, 79228, 79621, 79665, 78878, 79308] (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - The articles did not mention any other specific consequences of the software failure incident beyond those related to financial losses, inconvenience, and disruptions in services. [78847, 79622, 78983, 78871, 79228, 79621, 79665, 78878, 79308]
Domain information, transportation, health, government (a) The software failure incident affected the production and distribution of information as it disrupted mobile data services, internet access, and apps for millions of customers, impacting their ability to stay connected and access online services [78847, 79622, 78983, 78871, 79228, 79665, 78878, 79308]. (b) The transportation industry was also impacted as the outage affected services like electronic bus timetable updates, satnav services for taxi firms, couriers, and food delivery services, and caused disruptions for commuters relying on traffic apps and electronic bus timetables [78847, 79622, 79308]. (l) The government sector was affected as well, with services like Transport for London's electronic bus timetable updates being disrupted due to the software failure incident [79308]. (m) Other industries impacted by the software failure incident included healthcare, with individuals like Amy-Jayne Toulson, who suffers from epilepsy and relies on a watch connected to her mobile phone for seizure monitoring, facing challenges due to the outage [78847].

Sources

Back to List