Incident: Telstra Nationwide Network Outage: Congestion Causes Mobile Service Disruption

Published Date: 2016-03-17

Postmortem Analysis
Timeline 1. The software failure incident at Telstra happened on February 8, as mentioned in the article [42115]. 2. The incident occurred on February 8, 2016.
System [42115] The software failure incident at Telstra was caused by a problem that triggered a significant number of customers to be disconnected from the network, leading to congestion. The specific systems/components that failed in this incident were: 1. Mobile nodes - One of the company's mobile nodes went offline due to an "embarrassing human error" in a previous outage [42115]. 2. Network infrastructure - The network experienced congestion and disconnections due to the triggering of a problem that caused a significant number of customers to be disconnected [42115].
Responsible Organization 1. Telstra [42115]
Impacted Organization 1. Telstra's mobile customers, affecting roughly 8 million people [42115].
Software Causes 1. The software causes of the failure incident were not explicitly mentioned in the article [42115]. Therefore, the specific software causes of the Telstra network outage remain unknown based on the provided information.
Non-software Causes 1. Congestion on the network affecting roughly 8 million people due to a problem triggering a significant number of customers to be disconnected and automatically reconnecting at the same time [42115]. 2. An "embarrassing human error" that knocked one of Telstra's mobile nodes offline, causing the February 8 outage [42115].
Impacts 1. Approximately 8 million people, half of Telstra's mobile customers, were unable to make calls due to congestion on the network [42115]. 2. The outage caused significant customer disconnection from the network, leading to congestion when customers attempted to reconnect [42115]. 3. The incident resulted in a nationwide outage, impacting Telstra's reputation and customer trust [42115].
Preventions 1. Implementing better network monitoring and alert systems to quickly identify and address congestion issues before they escalate [42115]. 2. Conducting regular audits and testing of the network infrastructure to catch any potential vulnerabilities or weaknesses that could lead to outages [42115]. 3. Enhancing redundancy and failover mechanisms within the network to ensure that a single point of failure, such as a mobile node going offline, does not cause widespread disruptions [42115].
Fixes 1. Implementing better network monitoring and alert systems to quickly identify and address issues [42115]. 2. Conducting a thorough investigation to determine the root cause of the failure and implementing measures to prevent similar incidents in the future [42115]. 3. Enhancing network capacity and redundancy to handle sudden spikes in traffic and prevent congestion issues [42115].
References 1. Telstra CEO Andy Penn [42115] 2. Telstra's official statements [42115]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: The article mentions that Telstra experienced a massive outage for the second time in as many months. The first outage occurred on February 8, caused by a human error that knocked one of the company's mobile nodes offline. The second outage, which affected roughly 8 million people, was due to a problem that triggered a significant number of customers to be disconnected from the network, causing congestion [42115]. (b) The software failure incident having happened again at multiple_organization: There is no mention in the article of similar incidents happening at other organizations or with their products and services. Therefore, it is unknown if similar software failure incidents have occurred at multiple organizations.
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase can be seen in the article where Telstra experienced a network outage affecting millions of customers. The February 8 outage was caused by a human error that knocked one of the company's mobile nodes offline, leading to congestion when users tried to reconnect [42115]. (b) The software failure incident related to the operation phase is evident in the same article where Telstra mentioned that a problem triggered a significant number of customers to be disconnected from the network, causing congestion as they all tried to automatically reconnect at the same time [42115].
Boundary (Internal/External) within_system (a) The software failure incident reported in Article 42115 was primarily within_system. The outage affecting Telstra's network was caused by a problem within the system that triggered a significant number of customers to be disconnected and then automatically reconnecting at the same time, leading to congestion on the network [42115]. The CEO mentioned that the cause of the outage was "unrelated" to a previous incident caused by a human error, indicating that the issue originated within the system itself.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: - The article mentions that the outage was caused by a problem that triggered a significant number of customers to be disconnected from the network, leading to congestion as they all automatically reconnected at the same time. This indicates a failure due to contributing factors introduced without human participation [42115]. (b) The software failure incident occurring due to human actions: - The article reports a previous major outage on Telstra's network caused by an "embarrassing human error" that knocked one of the company's mobile nodes offline, resulting in massive congestion when users attempted to reconnect. This incident highlights a failure due to contributing factors introduced by human actions [42115].
Dimension (Hardware/Software) hardware, software (a) The software failure incident related to hardware: - The February 8 outage was caused by an "embarrassing human error" that knocked one of Telstra's mobile nodes offline, leading to massive congestion when users attempted to reconnect [42115]. (b) The software failure incident related to software: - The recent outage affecting 8 million people was due to a problem that triggered a significant number of customers to be disconnected from the network, causing congestion as they all automatically tried to reconnect at the same time [42115].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident related to the Telstra outage mentioned in Article 42115 was non-malicious. The outage was caused by a problem that triggered a significant number of customers to be disconnected from the network, leading to congestion when they all attempted to reconnect simultaneously. Telstra CEO Andy Penn stated that the cause of the outage was unrelated to a previous incident caused by a human error [42115].
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident related to Telstra's network outage was not explicitly attributed to poor decisions. The February 8 outage was caused by an "embarrassing human error" that knocked one of the company's mobile nodes offline, leading to massive congestion when users attempted to reconnect [42115]. Telstra CEO Andy Penn mentioned that the cause of the recent outage was "unrelated" to the previous problems, indicating that poor decisions were not the primary contributing factor to the incident.
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident related to development incompetence is evident in the article as Telstra experienced a major outage affecting millions of customers due to what the company called an "embarrassing human error" that caused one of the mobile nodes to go offline, leading to network congestion and disconnections [42115]. This incident highlights a failure caused by a lack of professional competence in managing the network infrastructure. (b) The software failure incident related to accidental factors is also present in the article, where Telstra faced another outage that was described as "unrelated" to the previous one caused by a human error. The company is still investigating how the service disruption occurred, indicating that the incident was accidental in nature rather than intentional [42115].
Duration temporary (a) The software failure incident described in the article was temporary. It mentions that the problem affecting roughly 8 million people, half of Telstra's mobile customers, was due to congestion on the network. The issue was first identified at 6 p.m. and customers started to be reconnected within 2 hours, with full service returning by 10 p.m. This indicates that the failure was not permanent but rather a temporary disruption in service [42115].
Behaviour crash (a) crash: The article mentions a network failure incident at Telstra that caused congestion on the network, leading to roughly 8 million people being unable to make calls. The problem was first identified at 6 p.m., and customers started to be reconnected within 2 hours, with full service returning by 10 p.m. This indicates a crash where the system lost its state and was not performing its intended functions [42115]. (b) omission: The article does not specifically mention any instance of the system omitting to perform its intended functions at a particular instance. (c) timing: The article does not indicate any timing-related failures where the system performed its intended functions too late or too early. (d) value: The article does not mention any instances of the system performing its intended functions incorrectly. (e) byzantine: The article does not describe any behavior of the system with inconsistent responses and interactions. (f) other: The behavior of the software failure incident described in the article falls under the category of a crash, where the system lost its state and was not performing its intended functions as customers were disconnected from the network due to congestion issues [42115].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay, non-human, theoretical_consequence (a) death: There is no mention of any deaths resulting from the software failure incident reported in the article [42115]. (b) harm: The article does not mention any physical harm caused to individuals due to the software failure incident [42115]. (c) basic: The incident did not impact people's access to food or shelter [42115]. (d) property: The software failure incident did not result in any direct impact on people's material goods, money, or data [42115]. (e) delay: People did experience a delay in their mobile services as they were unable to make calls due to congestion on the network [42115]. (f) non-human: The software failure incident impacted the network infrastructure and services provided by Telstra, which are non-human entities [42115]. (g) no_consequence: The article mentions that the software failure incident caused inconvenience to customers, but there is no mention of any significant real observed consequences beyond that [42115]. (h) theoretical_consequence: There were potential consequences discussed, such as the impact on customer trust and the expectation of free data days in the future in case of network outages, but these were not realized consequences at the time of reporting [42115]. (i) other: The article does not mention any other specific consequences of the software failure incident beyond those discussed in the options (a) to (h) [42115].
Domain unknown (a) The software failure incident reported in Article 42115 is related to the telecommunications industry, specifically affecting Telstra, Australia's biggest mobile communications company. The incident caused a massive outage that impacted roughly 8 million people, half of Telstra's mobile customers, who were unable to make calls due to network congestion [42115]. The outage was a significant disruption for a company that prides itself on offering world-class technology and the best telecommunication networks [42115].

Sources

Back to List