Incident: BBC Website Outage: Major Network Problem Causes Total Outage

Published Date: 2011-03-29

Postmortem Analysis
Timeline 1. The software failure incident happened on the evening of March 28, 2011, lasting for almost an hour until just before midnight [4721].
System The system that failed in the software failure incident reported in Article 4721 was: 1. Routing system - The routing system experienced a problem, leading to the outage of the BBC website [4721].
Responsible Organization 1. Software configuration or hardware problem [4721] 2. Major network problem [4721]
Impacted Organization 1. BBC's website users [4721]
Software Causes 1. Software configuration or hardware problem [4721]
Non-software Causes 1. Network problem: The incident was attributed to a major network problem that caused a total outage of all BBC websites [4721]. 2. Hardware problem: Nevali, a metadata magician at the BBC, mentioned that the issue could be related to a software config or hardware problem [4721].
Impacts 1. The BBC's website, bbc.co.uk, went offline for almost an hour, causing a total outage of all BBC websites, impacting millions of users who couldn't access the site during that time [4721]. 2. Users experienced difficulties accessing the World Cup and Wimbledon live streaming, as well as encountering a 500 internal server error when trying to access the site [4721]. 3. The incident led to speculation and conspiracy theories on social media platforms like Twitter, with users attributing the outage to various reasons, including potential attacks by groups like Anonymous [4721].
Preventions 1. Implementing robust network monitoring and alerting systems to quickly identify and address routing issues or hardware problems that may lead to outages [4721]. 2. Conducting regular software configuration audits to ensure proper settings and configurations are in place to prevent unexpected failures [4721]. 3. Performing thorough testing, including stress testing and load testing, to identify and address potential network and server issues before they impact users [4721].
Fixes 1. Conduct a thorough investigation to identify the root cause of the network problem that led to the outage, whether it was a software configuration issue or a hardware problem [4721]. 2. Implement measures to prevent similar network failures in the future, such as enhancing network redundancy, improving software configurations, or upgrading hardware components if necessary [4721].
References 1. Nevali, described as a "metadata magician at the BBC" [4721] 2. GaryDelaney [4721] 3. Peter Horrocks, director of the BBC World Service [4721] 4. Steve Herrmann, editor of the BBC news website [4721]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident has happened again at one_organization: The article mentions that in 2009, similar problems appeared to afflict the BBC website, caused by a network failure that slowed down access to the site and prevented some people from visiting its home page. This indicates that the BBC website has experienced similar software failure incidents in the past [4721]. (b) The software failure incident has happened again at multiple_organization: There is no specific mention in the article about similar incidents happening at other organizations or with their products and services. Therefore, it is unknown if the software failure incident has occurred at multiple organizations.
Phase (Design/Operation) design (a) The software failure incident mentioned in the articles seems to be related to the design phase. The incident was attributed to a major network problem, with the editor of the BBC news website mentioning that they received a message from the BBC's technical support teams stating 'Total outage of all BBC websites' [4721]. This indicates that the failure was due to contributing factors introduced by system development or updates, rather than operation or misuse of the system.
Boundary (Internal/External) within_system The software failure incident reported in Article 4721 was primarily within_system. The incident was attributed to a major network problem within the BBC's system, causing a total outage of all BBC websites for almost an hour. The BBC's technical support teams confirmed the total outage, indicating an internal issue with the network that affected the accessibility of the BBC News website [4721].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident was attributed to a network problem, specifically a major network problem that caused a total outage of all BBC websites. Steve Herrmann, the editor of the BBC news website, mentioned in a blog post that they received a message from the BBC's technical support teams indicating the total outage. It was clarified that it was not a DoS (Denial of Service) attack but rather a routing issue or a software configuration/hardware problem [4721]. (b) Human actions were also speculated to be a possible cause of the software failure incident. There were conspiracy theories circulating on Twitter linking the outage to budget cuts at the BBC and suggesting potential attacks by groups like Anonymous. Additionally, the mention of technical problems with the BBC website by Peter Horrocks, director of the BBC World Service, and the investigation into the incident indicate human involvement in the failure [4721].
Dimension (Hardware/Software) hardware, software (a) The software failure incident was attributed to a hardware problem. Nevali, a metadata magician at the BBC, mentioned that it was not a DoS attack but rather a routing issue, potentially caused by a software configuration or hardware problem [4721]. (b) The software failure incident was also acknowledged to be a major network problem by Steve Herrmann, the editor of the BBC news website. He mentioned that the technical support teams reported a total outage of all BBC websites, indicating a software failure that led to the network issue [4721].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident mentioned in the articles does not seem to be malicious. There is no indication or evidence provided in the articles that the outage of the BBC website was caused by any intentional malicious activity aimed at harming the system. The incident was attributed to a major network problem, routing issues, software configuration, or hardware problems, as mentioned by individuals within the BBC organization [4721]. (b) The software failure incident appears to be non-malicious in nature, stemming from technical issues rather than any deliberate attempt to disrupt the BBC website. The outage was described as a major network problem, with statements from BBC representatives acknowledging technical difficulties and apologizing for the inconvenience caused to users [4721].
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident related to poor decisions can be inferred from the article as there were rumors circulating that the BBC website outage might have been due to budget cuts affecting the online services. The article mentions that the BBC had confirmed it would cut its online budget by 25%, which fueled some outlandish rumors about the outage [4721]. This indicates that the incident could have been influenced by poor decisions related to budget cuts impacting the technical infrastructure. (b) The software failure incident related to accidental decisions can be seen in the article where it is mentioned that the outage was due to a major network problem. Steve Herrmann, editor of the BBC news website, mentioned that they received a message from the BBC's technical support teams stating 'Total outage of all BBC websites' and that it was a major network problem [4721]. This suggests that the failure was accidental and not intentional.
Capability (Incompetence/Accidental) accidental (a) The software failure incident mentioned in the articles does not directly point to development incompetence as the cause. The incident was attributed to a major network problem, specifically a routing issue or software configuration/hardware problem [4721]. (b) The software failure incident was described as a major network problem that led to a total outage of all BBC websites for almost an hour. This outage was not intentional but rather an accidental failure that affected users' access to the BBC News website [4721].
Duration temporary The software failure incident reported in Article 4721 was temporary. The BBC's website went offline for almost an hour before being restored [4721]. The incident was described as a major network problem, and the BBC's technical support teams reported a total outage of all BBC websites during that time [4721]. The temporary nature of the failure is evident from the fact that the site was back online after the outage, indicating that the issue was resolved within a relatively short period.
Behaviour crash, omission, value, other (a) The software failure incident described in Article 4721 can be categorized as a crash. The BBC website went offline for almost an hour, resulting in a total outage of all BBC websites. During this time, the system lost its state and was not performing any of its intended functions [4721]. (b) The incident can also be linked to omission. Users reported not being able to access the BBC website and receiving a 500 internal server error instead, indicating that the system omitted to perform its intended functions at that instance [4721]. (c) There is no specific mention of timing-related issues in the software failure incident described in the article. (d) The failure can be associated with a value issue as well. Some users experienced problems accessing the World Cup and Wimbledon live streaming, indicating that the system was performing its intended functions incorrectly for those users [4721]. (e) The incident does not align with a byzantine behavior where the system behaves erroneously with inconsistent responses and interactions. (f) The other behavior exhibited by the system in this incident is related to a major network problem. The software failure was attributed to a network failure that caused the outage, indicating a network-related issue impacting the system's performance [4721].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay The consequence of the software failure incident reported in Article 4721 was primarily categorized under option (e) delay. The incident resulted in the BBC's website going offline for almost an hour, causing inconvenience to users who were unable to access the site during that time. This delay in accessing the website was the main consequence observed from the software failure incident [4721].
Domain information, entertainment (a) The failed system in this incident was related to the information industry, specifically the BBC's online platform, bbc.co.uk, which provides news, entertainment, and various other information services to users [4721]. The incident caused the BBC website to go offline for almost an hour due to a major network problem, impacting users' access to the site and resulting in a total outage of all BBC websites [4721].

Sources

Back to List