Incident: BlackBerry Outage: Redundant Systems Failure Causes Global Service Disruption

Published Date: 2011-10-12

Postmortem Analysis
Timeline 1. The software failure incident happened three days before the article was published on October 12, 2011 [Article 8231]. Therefore, the software failure incident occurred around October 9, 2011.
System 1. Core switch at RIM failed 2. Redundant systems at RIM failed 3. Failover systems at RIM did not function as expected [8231]
Responsible Organization 1. Research in Motion (RIM) - The software failure incident was primarily caused by a failure of one of RIM's core switches, followed by the failure of redundant systems [8231].
Impacted Organization 1. RIM customers in Europe [8231] 2. BlackBerry users around the world [8231]
Software Causes 1. Failure of one of RIM's core switches. 2. Redundant systems failing to function as expected despite regular testing. 3. Throttling service in the impacted area leading to a backup of mail in other regions. 4. Backlog of messages causing a cascading outage effect for BlackBerry users worldwide. [Cited from Article 8231]
Non-software Causes 1. The failure of one of RIM's core switches [8231] 2. Redundant systems failing to function as expected [8231]
Impacts 1. Major outages for RIM customers in Europe for days, with complaints of mail delays and inaccessibility on BlackBerry devices globally [8231]. 2. Failures of RIM's core switches and redundant systems, leading to a significant backup of mail and service interruptions for customers [8231]. 3. Throttling of service in impacted areas to stabilize service, resulting in a backup of mail in other regions trying to reach RIM's European customers [8231]. 4. Many customers impacted in various ways, with some experiencing delays and service interruptions [8231]. 5. Focus on clearing out the backlog of email, particularly in Europe and for anyone trying to deliver email to Europe [8231].
Preventions 1. Implementing more robust redundancy systems and failover mechanisms to ensure that redundant systems function as expected in case of a core switch failure [8231]. 2. Conducting more frequent and rigorous testing of failover systems to identify and address any potential issues before they lead to service outages [8231]. 3. Improving communication and transparency with customers by providing timely updates and information about the status of the incident and the steps being taken to resolve it [8231].
Fixes 1. Implementing a more robust failover system to ensure redundancy in case of core switch failures [8231] 2. Conducting thorough testing of failover systems to ensure they function as expected in real-world scenarios [8231] 3. Clearing out the backlog of email messages to restore service for affected customers [8231]
References 1. Research in Motion CTO for Software David Yach [8231] 2. RIM's official statements and press conference [8231]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident has happened again at one_organization: The article [8231] reports that Research in Motion (RIM) experienced a significant software failure incident with its BlackBerry services. This incident involved a failure of one of RIM's core switches, followed by the failure of redundant systems, leading to a major outage affecting customers in Europe and the Americas. Despite regular testing of failover systems, the failover did not function as expected, resulting in a significant backlog of mail and service interruptions for customers. RIM is working to restore service and clear out the backlog of email messages. (b) The software failure incident has happened again at multiple_organization: There is no information in the provided article to suggest that a similar software failure incident has happened at other organizations or with their products and services.
Phase (Design/Operation) design, operation (a) The software failure incident in the article was related to the design phase. The root cause of the initial European BlackBerry e-mail service outage was described as a failure of one of RIM's core switches, which then cascaded into a more significant issue when the redundant systems also failed to function as expected despite regular testing of failover systems [8231]. (b) The software failure incident in the article was also related to the operation phase. RIM responded to the outage by throttling service in the impacted area to stabilize service, which resulted in a backup of mail in other regions trying to reach RIM's European customers. This operational decision led to delays and service interruptions for many customers [8231].
Boundary (Internal/External) within_system (a) within_system: The software failure incident involving BlackBerry services was primarily within the system. The root cause was identified as a failure of one of RIM's core switches, which then cascaded into a more significant issue when the redundant systems also failed to function as expected. This internal failure led to a backlog of mail and service interruptions for customers [8231]. (b) outside_system: There is no evidence in the article to suggest that the software failure incident was caused by contributing factors originating from outside the system. The focus of the incident was on internal system failures and the efforts to restore service within the company's infrastructure [8231].
Nature (Human/Non-human) non-human_actions (a) The software failure incident in the article was primarily due to non-human actions. The root cause of the initial European BlackBerry e-mail service outage was described as a failure of one of RIM's core switches, followed by the failure of redundant systems despite regular testing of failover systems [8231]. (b) Human actions were not mentioned as contributing factors to the software failure incident reported in the article.
Dimension (Hardware/Software) hardware (a) The software failure incident was primarily attributed to hardware issues. The initial outage was described as a failure of one of RIM's core switches, indicating a hardware-related problem. Additionally, the redundant systems also failed to function as expected, further emphasizing the hardware aspect of the failure [8231]. (b) While the incident involved software components such as failover systems, the root cause and primary contributing factors were related to hardware issues, specifically the failure of core switches and redundant systems [8231].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in the article does not indicate any malicious intent. There is no evidence of a hack or security breach involved in the outage. The root cause was identified as a failure of one of RIM's core switches, followed by a failure of redundant systems, leading to a significant backlog of mail. RIM's CTO for Software mentioned that the failover did not function as expected, despite regular testing of failover systems. The company is focused on restoring service and clearing out the backlog of email messages without dropping any messages [8231]. (b) The software failure incident is categorized as non-malicious.
Intent (Poor/Accidental Decisions) unknown (a) The software failure incident described in the article does not explicitly mention poor decisions as a contributing factor. The root cause of the initial European BlackBerry e-mail service outage was identified as a failure of one of RIM's core switches, followed by the failure of redundant systems. The article highlights that the failover systems did not function as expected, despite regular testing. This indicates a technical failure rather than poor decisions as the primary cause of the incident [8231]. (b) The incident does not provide specific details indicating that the failure was due to accidental decisions or mistakes. The focus of the article is on technical issues such as the failure of core switches and redundant systems, as well as the challenges faced in restoring service and clearing out the backlog of emails. Therefore, the software failure incident appears to be more related to technical failures rather than accidental decisions [8231].
Capability (Incompetence/Accidental) accidental (a) The software failure incident in Article 8231 was not explicitly attributed to development incompetence. The root cause of the initial European BlackBerry e-mail service outage was described as a failure of one of RIM's core switches, with subsequent issues arising from the failure of redundant systems. The CTO mentioned that the failover did not function as expected, despite regular testing of failover systems. This indicates a technical failure rather than incompetence in development [8231]. (b) The software failure incident in Article 8231 was more aligned with an accidental failure rather than intentional. The article mentions that there was no evidence of a hack or security breach involved in the outage. The issues seemed to have stemmed from technical failures within RIM's systems, such as the core switch failure and subsequent problems with redundant systems. The CTO mentioned that the failover did not work as expected, leading to a significant backlog of mail. This points towards accidental technical failures rather than intentional actions [8231].
Duration temporary (a) The software failure incident described in the article was temporary. It was not a permanent failure as the article mentions that RIM was working "around the clock" to try and restore service [8231]. Additionally, the article states that RIM was focusing on clearing out the backlog of email, indicating efforts to resolve the issue and restore normal service [8231].
Behaviour crash, other (a) crash: The software failure incident described in the article can be categorized as a crash. It mentions that the initial outage was caused by a failure of one of RIM's core switches, which led to a cascading effect when the redundant systems also failed to function as expected, resulting in a significant backup of mail and service interruptions for customers [8231]. (b) omission: The incident does not specifically mention a failure due to the system omitting to perform its intended functions at an instance(s). (c) timing: The incident does not describe a failure due to the system performing its intended functions correctly, but too late or too early. (d) value: The incident does not indicate a failure due to the system performing its intended functions incorrectly. (e) byzantine: The incident does not suggest a failure due to the system behaving erroneously with inconsistent responses and interactions. (f) other: The behavior of the software failure incident can be categorized as a cascading effect where the initial failure of a core switch led to the failure of redundant systems, causing a significant backlog of mail and service interruptions for customers. This cascading effect is a notable aspect of the incident [8231].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay, theoretical_consequence (a) unknown (b) unknown (c) unknown (d) unknown (e) Customers experienced mail delays and inaccessibility on their BlackBerry devices due to the software failure incident [8231]. (f) unknown (g) Customers experienced delays and service interruptions due to the software failure incident, but there were no reports of any severe consequences such as death or physical harm [8231]. (h) The article mentions that RIM was working to clear out the backlog of emails caused by the software failure incident, indicating a potential theoretical consequence of email loss if the issue was not resolved [8231]. (i) unknown
Domain information [a8231] The software failure incident reported in the article is related to the information industry. Specifically, it affected BlackBerry users who rely on the BlackBerry email service provided by Research in Motion (RIM). The failure disrupted the production and distribution of information through email services, impacting users in Europe and the Americas.

Sources

Back to List