Incident: T-Mobile Nationwide Outage Caused by IP Traffic Issue

Published Date: 2020-06-17

Postmortem Analysis
Timeline 1. The software failure incident with T-Mobile's network occurred on Monday, starting shortly after 9 a.m. PT and ending more than 12 hours later [101094]. Therefore, the software failure incident happened on a Monday in June 2020.
System 1. Fiber circuit failure leased from a third-party provider in the Southeast [101094] 2. Redundancy system that failed to handle the fiber circuit failure, leading to an "overload" situation [101094] 3. IP multimedia Subsystem (IMS) core network that supports VoLTE calls [101094]
Responsible Organization 1. The software failure incident affecting T-Mobile's network was caused by an "IP traffic related issue that has created significant capacity issues in the network core" as stated by T-Mobile CEO Mike Sievert [101094]. 2. The outage was triggered by a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast, leading to an "overload" situation in the network core [101094].
Impacted Organization 1. T-Mobile customers [101094] 2. Google Fi users [101094] 3. Metro prepaid brand customers [101094] 4. Mint Mobile customers [101094] 5. Simple Mobile customers [101094]
Software Causes 1. The software failure incident was caused by an "IP traffic related issue that created significant capacity issues in the network core" [101094]. 2. The outage was triggered by a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast, where the redundancy failed and created an "overload" situation [101094].
Non-software Causes 1. The T-Mobile outage was caused by an "IP traffic related issue that has created significant capacity issues in the network core," as stated by T-Mobile CEO Mike Sievert [101094]. 2. The outage was triggered by a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast, where the redundancy failed and created an "overload" situation [101094].
Impacts 1. The software failure incident at T-Mobile resulted in widespread issues across the country, impacting the ability to make calls and send text messages for more than 12 hours [101094]. 2. Customers experienced outages in calls and texts, while data services were working normally [101094]. 3. The outage caused T-Mobile to trend on Twitter with #TMobiledown rising to the top spot on the site's US Trending Topics for several hours [101094]. 4. The software failure incident led to T-Mobile CEO Mike Sievert confirming that the outage was caused by an "IP traffic related issue that created significant capacity issues in the network core" [101094]. 5. The outage was triggered by a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast, leading to an "overload" situation in the network core [101094]. 6. The failure impacted VoLTE calls and text services, with the carrier working through the night to resolve the issue [101094]. 7. The outage prompted T-Mobile to recommend users to use apps like FaceTime, WhatsApp, and iMessage for communication as an alternative to traditional SMS text messages and voice calls [101094]. 8. Other carriers like AT&T and Verizon reported normal network operations, except for issues when trying to text or call a T-Mobile phone [101094].
Preventions 1. Implementing more robust redundancy measures for critical network components to prevent overload situations like the one that occurred due to the failure of redundancies in the IMS core network [101094]. 2. Conducting regular audits and stress tests on leased third-party infrastructure, such as the fiber circuit in the Southeast, to ensure its reliability and performance under various conditions [101094]. 3. Enhancing network monitoring and alert systems to quickly identify and address IP traffic-related issues before they escalate into widespread outages [101094].
Fixes 1. Implementing better redundancy measures to ensure network stability in case of fiber circuit failures [101094]. 2. Conducting a thorough review of the network core to identify and address any potential capacity issues that could lead to similar incidents in the future [101094]. 3. Enhancing communication channels with customers during outages to provide timely updates and alternative communication methods [101094].
References 1. Neville Ray, T-Mobile's president of technology [101094] 2. Mike Sievert, T-Mobile CEO [101094] 3. FCC Chairman Ajit Pai [101094]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: - T-Mobile experienced a software failure incident that impacted the ability to make calls and send text messages, lasting for more than 12 hours [101094]. - T-Mobile's CEO mentioned that the outage was caused by an "IP traffic related issue that has created significant capacity issues in the network core" [101094]. - T-Mobile had a previous outage where the redundancy failed, resulting in an "overload" situation that affected VoLTE calls and text services [101094]. (b) The software failure incident having happened again at multiple_organization: - Downdetector.com noted issues with all major wireless carriers, including AT&T, Verizon, T-Mobile, and Sprint [101094]. - AT&T and Verizon confirmed that their networks were operating normally, but there were issues when trying to text or call a T-Mobile phone [101094]. - Verizon criticized Downdetector for spreading false reports about its network performance, emphasizing that their network was not experiencing outages [101094].
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase was due to an "IP traffic related issue that has created significant capacity issues in the network core" as mentioned by T-Mobile CEO Mike Sievert in a blog post [101094]. This issue was caused by an "overload" situation resulting from a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast, where redundancies set up to handle such issues failed in this case [101094]. (b) The software failure incident related to the operation phase was evident in the outage experienced by T-Mobile customers, impacting their ability to make calls and send text messages for more than 12 hours. This outage was attributed to an "IP traffic storm" that spread across the IMS core network supporting VoLTE calls, affecting the operation of the network [101094].
Boundary (Internal/External) within_system, outside_system (a) The software failure incident related to the T-Mobile outage was primarily within the system. The outage was caused by an "IP traffic related issue that has created significant capacity issues in the network core" [101094]. Additionally, the outage was triggered by a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast, and the redundancy within T-Mobile's system failed, leading to an "overload" situation [101094]. The failure was not attributed to a DDoS attack but rather to internal network issues and failures within T-Mobile's infrastructure.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident was primarily caused by a non-human action, specifically a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast. This failure triggered an overload situation in the network core, leading to significant capacity issues affecting calls and texts [101094]. (b) Human actions were also involved in the response to the incident. T-Mobile's CEO Mike Sievert mentioned that the outage was caused by an "IP traffic related issue" and assured customers that hundreds of engineers and vendor partner staff were working to resolve the issue. Additionally, T-Mobile's president of technology, Neville Ray, provided updates on the situation and recommended alternative communication methods to users [101094].
Dimension (Hardware/Software) hardware, software (a) The software failure incident related to hardware: - The outage was triggered by a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast, leading to an "overload" situation in the network core [101094]. - T-Mobile had redundancies set up to handle such issues, but in this case, the redundancy failed, exacerbating the problem [101094]. (b) The software failure incident related to software: - T-Mobile CEO Mike Sievert mentioned that the outage was caused by an "IP traffic related issue that has created significant capacity issues in the network core" [101094]. - The overload resulted in an IP traffic storm across the IMS core network supporting VoLTE calls, indicating a software-related issue [101094].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident in this case was non-malicious. The outage experienced by T-Mobile was caused by an "IP traffic related issue that has created significant capacity issues in the network core" [101094]. The issue stemmed from a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast, leading to an "overload" situation in the network core [101094]. Additionally, T-Mobile CEO Mike Sievert mentioned that the outage was not a result of a distributed denial-of-service (DDoS) attack [101094]. This indicates that the failure was not caused by malicious intent to harm the system but rather by technical issues and failures in the network infrastructure.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident related to the T-Mobile outage was not primarily due to poor decisions but rather a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast. The redundancy set up to handle such issues failed, leading to an "overload" situation that caused significant capacity issues across the network core [101094]. The incident was attributed to an "IP traffic related issue" that created capacity problems in the network core, rather than poor decisions [101094].
Capability (Incompetence/Accidental) accidental (a) The software failure incident related to development incompetence is not explicitly mentioned in the provided article. Therefore, it is unknown whether the T-Mobile outage was due to factors introduced by lack of professional competence by humans or the development organization. (b) The software failure incident related to accidental factors is evident in the article. The outage at T-Mobile was caused by an "IP traffic related issue that has created significant capacity issues in the network core" [101094]. This issue stemmed from a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast, leading to an overload situation that affected the network's core infrastructure. The redundancy set up to handle such issues failed in this case, resulting in the outage.
Duration temporary (a) The software failure incident in this case was temporary. The T-Mobile service outage impacting calls and texts started shortly after 9 a.m. PT on Monday and ended more than 12 hours later at 10:03 p.m. PT the same day [101094]. The outage was caused by an "IP traffic related issue that has created significant capacity issues in the network core" and was triggered by a fiber circuit failure that T-Mobile leases from a third-party provider in the Southeast. The redundancy failed in handling the issue, leading to an "overload" situation affecting the network core supporting VoLTE calls [101094].
Behaviour crash, omission, value, other (a) crash: The software failure incident in the T-Mobile outage can be categorized as a crash. The incident led to a widespread issue impacting the ability to make calls and send text messages for more than 12 hours, indicating a failure of the system to perform its intended functions [101094]. (b) omission: The software failure incident can also be categorized as an omission. Users reported that calls and texts were not working, while data services appeared to be working normally. This indicates an omission in performing the intended functions of calls and texts [101094]. (c) timing: The software failure incident does not align with a timing failure as there is no indication that the system performed its intended functions too late or too early [101094]. (d) value: The software failure incident can be categorized as a value failure. The outage caused the system to perform its intended functions incorrectly, leading to significant capacity issues in the network core and affecting the ability to make calls and send text messages [101094]. (e) byzantine: The software failure incident does not align with a byzantine failure as there is no mention of inconsistent responses or interactions from the system [101094]. (f) other: The other behavior observed in the software failure incident is a redundancy failure. Despite having redundancies set up to handle issues like the fiber circuit failure, the redundancy failed in this case, leading to an overload situation and contributing to the network core capacity issues [101094].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay, theoretical_consequence The consequence of the software failure incident reported in the articles was primarily related to the impact on communication services, specifically calls and texts. Users experienced disruptions in their ability to make calls and send text messages due to the outage at T-Mobile [101094]. This resulted in inconvenience and frustration for customers who relied on these services for communication. Additionally, there were discussions about potential consequences such as network capacity issues and the need for network redundancies to handle such failures [101094]. The outage did not lead to any reported physical harm, deaths, impact on basic needs, or significant property loss.
Domain unknown (a) The failed system in this incident was related to the telecommunications industry, specifically impacting T-Mobile's wireless network services for making calls and sending text messages [101094].

Sources

Back to List