Incident: DNS Malfunction at Plusnet HQ Causes Service Outages.

Published Date: 2015-09-10

Postmortem Analysis
Timeline 1. The software failure incident at Plusnet happened on September 10, 2015, as mentioned in Article 51408.
System 1. Domain Name System (DNS) - Plusnet experienced a DNS malfunction at their headquarters, leading to intermittent service outages for users [51408].
Responsible Organization 1. Plusnet's DNS servers were responsible for causing the software failure incident [51408].
Impacted Organization 1. Plusnet users [51408]
Software Causes 1. The software cause of the failure incident was a DNS malfunction at Plusnet's HQ, leading to intermittent service outages for users [51408].
Non-software Causes 1. DNS malfunction at the internet service provider’s HQ [51408]
Impacts 1. Users experienced intermittent service outages, leaving them unable to access websites from their broadband connection [51408]. 2. Users were nominally connected to the internet but unable to resolve URLs to working web pages [51408]. 3. The outage affected Plusnet's phone and broadband services as well as their website [51408]. 4. Customers faced inconvenience and disruption due to the software failure incident [51408].
Preventions 1. Implementing redundancy and failover mechanisms in the DNS infrastructure to ensure continuous service availability [51408]. 2. Regularly monitoring and maintaining the DNS servers to detect and address any potential issues before they escalate [51408]. 3. Conducting thorough testing and quality assurance checks on DNS configurations and updates to prevent unintended malfunctions [51408].
Fixes 1. Switching the DNS server to Google Public DNS could fix the software failure incident [51408].
References 1. Plusnet's official statement [51408]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident related to DNS malfunction at Plusnet's HQ has happened again within the same organization. The article mentions that Plusnet users have been experiencing intermittent service outages due to a DNS malfunction at the internet service provider’s HQ, leaving them unable to access websites from their broadband connection. Plusnet apologized for the inconvenience caused and mentioned that they have experienced an outage affecting their phone and broadband service as well as their website [51408]. (b) There is no information in the provided article indicating that a similar incident has happened before or again at other organizations or with their products and services.
Phase (Design/Operation) design, operation (a) The software failure incident in this case was primarily due to a DNS malfunction at Plusnet's HQ, which affected the domain name system (DNS) functionality. This malfunction was likely a result of issues introduced during system development, system updates, or procedures to operate or maintain the system. The outage left users unable to access websites due to the DNS server being down, which disrupted the conversion of web addresses into IP addresses necessary for connecting to servers [51408]. (b) Additionally, the software failure incident can also be attributed to factors related to the operation of the system. The outage impacted Plusnet's phone and broadband services, as well as their website, indicating that the failure was affecting the operational aspects of the services provided by the company. The need for users to switch to Google Public DNS as a short-term fix also points to issues related to the operation or misuse of the system [51408].
Boundary (Internal/External) within_system (a) within_system: The software failure incident reported in Article 51408 was within the system. The issue was specifically related to a DNS malfunction at Plusnet's HQ, affecting their phone and broadband services as well as their website. The malfunction of the domain name system (DNS) within Plusnet's infrastructure led to users being unable to access websites from their broadband connection. Plusnet acknowledged the outage and worked to resolve the issue internally, eventually restoring their services [51408].
Nature (Human/Non-human) non-human_actions (a) The software failure incident in this case was due to a DNS malfunction at Plusnet's HQ, which is a non-human action as it was not introduced by human participation [51408]. The malfunction of the domain name system (DNS) caused users to experience intermittent service outages, making it impossible for their machines to resolve URLs to working web pages. This non-human factor disrupted the phone and broadband services as well as the Plusnet website, leading to inconvenience for customers.
Dimension (Hardware/Software) software (a) The software failure incident reported in Article #51408 was not due to hardware issues but rather a DNS malfunction at the internet service provider's HQ. The issue was specifically related to the domain name system (DNS) technology, which converts web addresses into IP addresses. This malfunction in the DNS server caused users to experience intermittent service outages, making them unable to access websites from their broadband connection. The outage was resolved by switching to Google's public DNS servers, indicating that the root cause of the failure was not hardware-related [51408].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in Article 51408 was non-malicious. The issue was identified as a DNS malfunction at the internet service provider's HQ, specifically affecting Plusnet users. The outage was caused by a technical problem with the domain name system (DNS), which converts web addresses into IP addresses. Plusnet acknowledged the problem and worked to resolve it, with services being restored after some time. There is no indication in the article that the failure was due to malicious intent or actions by individuals [51408].
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident at Plusnet was not directly attributed to poor decisions. Instead, it was caused by a DNS malfunction at the internet service provider's HQ, leading to intermittent service outages for users [51408]. The incident was described as an outage affecting phone and broadband services as well as the website, with the company working to resolve the issue and apologizing to customers for the inconvenience caused. There is no indication in the article that poor decisions were a contributing factor to the software failure incident.
Capability (Incompetence/Accidental) unknown (a) The software failure incident reported in Article 51408 was not attributed to development incompetence. The issue was specifically related to a DNS malfunction at the internet service provider's HQ, which caused intermittent service outages for Plusnet users. The outage was due to the failure of the domain name system, which is a critical component for translating web addresses into IP addresses. The incident was described as an outage affecting phone and broadband services, with efforts being made to resolve the issue promptly [51408]. (b) The software failure incident in Article 51408 was not described as accidental. The outage was clearly linked to a DNS malfunction at Plusnet's headquarters, leading to service disruptions for users. The company acknowledged the issue and worked to restore services, indicating a technical problem rather than an accidental failure [51408].
Duration temporary (a) The software failure incident described in the article was temporary. Plusnet users experienced intermittent service outages due to a DNS malfunction at the internet service provider’s HQ. The company worked to resolve the issue, and services were restored on Thursday afternoon, indicating that the failure was not permanent [51408].
Behaviour crash (a) crash: The software failure incident described in the article can be categorized as a crash. Users were experiencing intermittent service outages due to a DNS malfunction at the internet service provider's HQ, leaving them unable to access websites from their broadband connection. This indicates a failure of the system losing its state and not performing its intended functions [51408].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property (d) property: People's material goods, money, or data was impacted due to the software failure. The software failure incident at Plusnet resulted in users experiencing intermittent service outages due to a DNS malfunction, leaving them unable to access websites from their broadband connection [51408]. This outage affected the phone and broadband service as well as the Plusnet website, impacting users' ability to connect to the internet and access online resources. Additionally, the malfunction disrupted the normal functioning of the domain name system, which converts web addresses into IP addresses, thereby affecting users' ability to navigate the internet [51408].
Domain information, utilities (a) The software failure incident reported in Article #51408 affected the production and distribution of information as Plusnet users were unable to access websites from their broadband connection due to a DNS malfunction at the internet service provider’s HQ. This issue disrupted the flow of information over the internet for the affected users [51408]. (g) Additionally, the incident impacted utilities as Plusnet's phone and broadband services were affected by the outage, causing disruption in the provision of power, gas, steam, water, and sewage services to the affected customers [51408]. (m) The failed system in this case was related to the telecommunications industry, which falls under the "other" category as it was not explicitly mentioned in the provided options. The incident specifically affected Plusnet's telecommunications services, highlighting the impact on this particular sector [51408].

Sources

Back to List