Incident: O2 4G Data Network Outage Caused by Software Glitch

Published Date: 2018-12-07

Postmortem Analysis
Timeline 1. The software failure incident happened on December 6, 2018. [79227]
System 1. Faulty software - Ericsson and O2 issued a joint apology mentioning that the disruption was caused by faulty software [79227].
Responsible Organization 1. Ericsson - The software failure incident was caused by faulty software supplied by Ericsson, as mentioned in the joint apology issued by O2 and Ericsson [79227].
Impacted Organization 1. O2 network users, including customers of O2, Sky, Giffgaff, and Lycamobile [79227] 2. Businesses relying on O2 network services, such as bus timetable information services and companies like plumbing businesses and home visit providers [79227] 3. Softbank network in Japan, affecting services on Y!Mobile [79227] 4. Transport for London's electronic timetable service at bus stops, which stopped working due to the O2 network outage [79227]
Software Causes 1. The software failure incident was caused by a "faulty software" as mentioned by Ericsson UK boss Marielle Lindgren and O2 boss Mark Evans in their joint apology statement [79227]. 2. The main issue identified in the root cause analysis was an "expired certificate in the software versions installed with these customers" as stated by Ericsson president Börje Ekholm [79227].
Non-software Causes 1. An expired certificate in the software versions installed with the customers [79227].
Impacts 1. Texting issues: Some users experienced problems with sending texts, including error messages and duplicated sendings, leading to communication disruptions [79227]. 2. Business disruptions: Many businesses faced challenges due to the network outage, with employees losing work time and productivity being affected [79227]. 3. Loss of essential services: Services such as bus timetable information were affected, impacting commuters and travelers [79227]. 4. Financial losses: Customers faced financial losses, such as a taxi driver losing out on fares and individuals incurring bank charges due to the inability to access their money [79227]. 5. Personal inconveniences: Individuals with critical needs, like an insulin-dependent diabetic, faced risks due to the inability to contact anyone in case of emergencies [79227].
Preventions 1. Regularly updating and renewing certificates in the software to prevent issues like expired certificates causing disruptions [79227]. 2. Implementing more robust testing procedures to catch software glitches before they impact millions of customers [79227]. 3. Enhancing communication and coordination between the mobile operator (O2) and the network equipment supplier (Ericsson) to address and resolve issues promptly [79227].
Fixes 1. Decommissioning the faulty software that caused the issues [79227] 2. Conducting a complete and comprehensive root cause analysis to identify the main issue, such as an expired certificate in the software versions installed with customers [79227]
References 1. O2 website [79227] 2. Users reporting issues to BBC [79227] 3. O2 spokeswoman [79227] 4. Ericsson UK boss Marielle Lindgren [79227] 5. O2 boss Mark Evans [79227] 6. Consumer expert Helen Dewdney [79227] 7. Ericsson president Börje Ekholm [79227] 8. Japan's Softbank network [79227] 9. Tom Morrod at market research firm IHS Markit [79227] 10. O2 customers (Allison Rose-Mannall, Lynsey Greaves, Luke Stagg, Mischa Bittar, Omeran Amirat) [79227]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: The incident involving the O2 4G data network outage was not the first time such an issue occurred. O2 had previously experienced a similar software glitch that caused disruption to its data services. The joint apology issued by O2 and Ericsson acknowledged that the faulty software was the root cause of the issues and mentioned that the software responsible for the problem was being decommissioned [79227]. (b) The software failure incident having happened again at multiple_organization: Apart from O2, the software failure incident also affected Japan's Softbank network, leading to disruptions in services provided by Y!Mobile. This indicates that the software issue was not limited to a single organization but had implications for other networks as well [79227].
Phase (Design/Operation) design, operation (a) The software failure incident was attributed to a design issue caused by faulty software. O2 and Ericsson issued a joint apology acknowledging that the disruption was due to "faulty software" that needed to be decommissioned [79227]. (b) The software failure incident also had operational impacts as some users reported issues with texting, such as error messages sending texts and duplicated sendings. Users mentioned difficulties in sending texts, which could be considered an operational issue related to the use of the system [79227].
Boundary (Internal/External) within_system (a) The software failure incident reported in the articles was primarily within the system. The outage was caused by a "faulty software" issue that affected the data services of O2 and other networks using O2 infrastructure [79227]. The root cause of the disruption was identified as an expired certificate in the software versions installed with the customers, indicating an internal software issue [79227]. Additionally, the joint apology issued by O2 and Ericsson acknowledged the software glitch as the cause of the disruption [79227]. The software issue led to network disruption for customers in multiple countries, further emphasizing that the failure originated within the system [79227].
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: - The software failure incident on the O2 network was attributed to a "faulty software" issue, specifically an expired certificate in the software versions installed with the customers [79227]. - Ericsson president Börje Ekholm mentioned that the main issue causing the disruption was an expired certificate in the software versions installed with the customers, indicating a non-human action as the root cause [79227]. (b) The software failure incident occurring due to human actions: - The articles do not provide any information suggesting that the software failure incident was caused by human actions.
Dimension (Hardware/Software) software (a) The software failure incident was primarily attributed to a software glitch rather than hardware issues. O2 and Ericsson issued a joint apology acknowledging that the disruption was caused by "faulty software" [79227]. Ericsson's president mentioned that the main issue was an expired certificate in the software versions installed with the customers, indicating a software-related root cause [79227]. (b) The software failure incident was specifically identified as a software glitch. O2 and Ericsson confirmed that the disruption was caused by "faulty software" [79227]. Ericsson's president mentioned that the main issue was an expired certificate in the software versions installed with the customers, highlighting a software-related root cause [79227].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in the articles was non-malicious. The incident was attributed to a "faulty software" issue caused by an expired certificate in the software versions installed with the customers [79227]. The joint apology issued by O2 and Ericsson acknowledged the software glitch as the cause of the disruption, and they were working on decommissioning the faulty software [79227]. The article also mentions that Ericsson was conducting a comprehensive root cause analysis to address the issue [79227].
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident reported in the articles was primarily due to poor decisions. The joint apology issued by O2 and Ericsson mentioned that the disruption was caused by "faulty software" that was being decommissioned [79227]. Additionally, Ericsson president Börje Ekholm stated that the main issue was an expired certificate in the software versions installed with these customers, indicating a failure related to poor decisions in managing software certificates [79227].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident was primarily attributed to a development incompetence issue. O2 and Ericsson issued a joint apology acknowledging that the disruption was caused by "faulty software" that needed to be decommissioned [79227]. This indicates that the failure was a result of errors or issues introduced during the development process, possibly due to a lack of professional competence in ensuring the software's reliability and stability. (b) Additionally, the incident was also linked to accidental factors. Ericsson president Börje Ekholm mentioned that the main issue was an expired certificate in the software versions installed with customers, indicating an accidental oversight that led to the disruption [79227]. This suggests that accidental factors, such as overlooking the expiration of a certificate, played a role in the software failure incident.
Duration temporary (a) The software failure incident reported in the articles was temporary. The O2 4G data network was affected by a software glitch that caused disruption to smartphone users for a day. The incident started around 05:30 GMT on Thursday and the 4G network was restored by early Friday [79227]. The issue was attributed to faulty software, specifically an expired certificate in the software versions installed with the customers, which was being decommissioned by Ericsson [79227]. Customers experienced issues with texting, such as error messages and duplicated sendings, indicating a temporary disruption in the service [79227].
Behaviour crash, omission, other (a) crash: The software failure incident in the article can be categorized as a crash as the O2 data network experienced a day-long outage, resulting in disruption for smartphone users. The 4G network was affected from about 05:30 GMT on Thursday, and the slower 3G data service was also impacted [79227]. (b) omission: Users reported issues with texting, such as error messages, duplicated sendings, and texts not being sent despite being received by the recipients. This indicates an omission in the system's intended function of sending texts successfully without errors [79227]. (c) timing: There is no specific mention of timing-related failures in the articles provided. (d) value: The software failure incident did not involve the system performing its intended functions incorrectly. (e) byzantine: The software failure incident did not exhibit behaviors of inconsistent responses or interactions. (f) other: The other behavior observed in this software failure incident is the presence of a faulty software glitch that caused the disruption in the O2 data network. The joint apology issued by O2 and Ericsson mentioned a "faulty software" as the root cause of the issues, which was being decommissioned [79227].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay (d) property: People's material goods, money, or data was impacted due to the software failure - O2 customers faced issues with sending text messages, with some experiencing error messages and duplicated sendings [79227]. - Customers were advised they could claim for any out-of-pocket expenses resulting from being without their phones, such as a refund for the time they were without phone use and consequential losses due to breach of contract [79227]. - A taxi driver might be able to prove they lost out on fares due to the shutdown, indicating a financial impact on individuals [79227]. - Omeran Amirat, an O2 customer, mentioned that he had bids on eBay that he could not complete due to the network being down, impacting his ability to make purchases for Christmas presents [79227].
Domain information, transportation, utilities, government (a) The software failure incident affected the production and distribution of information as it disrupted services such as bus timetable information [79227]. (b) The transportation industry was impacted by the software failure incident as services like Transport for London's electronic timetable service at bus stops stopped working [79227]. (g) The utilities industry was also affected by the software failure incident as O2 provides services for networks like Sky, Giffgaff, and Lycamobile, which have millions of users relying on these services [79227]. (l) The government sector was impacted by the software failure incident as services like Transport for London's electronic timetable service at bus stops, which are essential public services, were disrupted [79227].

Sources

Back to List