Incident: BlackBerry Global Network Outage: System Crash and Backup Failure

Published Date: 2011-10-13

Postmortem Analysis
Timeline 1. The software failure incident with BlackBerry services occurred for nearly four days, starting on a Monday at about 11 am BST [8277]. 2. The article was published on October 13, 2011 [8277]. 3. Estimation: The incident likely started on Monday, October 10, 2011.
System 1. Dual redundant high-capacity course switch designed to protect the infrastructure failed. 2. Backup switch did not function as intended. 3. Systems crash in RIM's Slough network center caused a "ripple effect" to the worldwide system. 4. Hardware and software upgrade for the system serving customers in Europe, the Middle East, Africa, and India failed. 5. Backup system failed when attempting to revert to it. [Article 8277]
Responsible Organization 1. RIM's Slough network center experienced a systems crash which caused a "ripple effect" leading to the worldwide system failure [8277]. 2. A dual redundant high-capacity course switch designed to protect the infrastructure failed, causing outages and delays for customers in various regions [8277]. 3. The backup switch did not function as intended, leading to a backlog of data in the system [8277].
Impacted Organization 1. BlackBerry customers worldwide were impacted by the software failure incident [8277].
Software Causes 1. A systems crash in RIM's Slough network center caused a "ripple effect" to the worldwide system, leading to outages and delays for customers in various regions [8277]. 2. A dual redundant high-capacity course switch designed to protect the infrastructure failed, causing a cascade failure in the system [8277]. 3. The backup switch did not function as intended, leading to a backlog of data in the system [8277]. 4. An attempted hardware and software upgrade for the system serving customers in Europe, the Middle East, Africa, and India failed, resulting in the outage [8277].
Non-software Causes 1. Hardware and software upgrade attempt failure during maintenance at RIM's network operations centre in Slough, Berkshire [8277]. 2. Failure of the dual redundant high-capacity course switch designed to protect the infrastructure [8277]. 3. Backup switch not functioning as intended during the system failure [8277].
Impacts 1. Delays in delivering emails and instant messages due to the backlog that built up during the near four-day outage of the BlackBerry service [8277]. 2. Millions of users did not receive emails or messages and had limited web surfing capability [8277]. 3. The outage affected customers in Europe, the Middle East, Africa, India, Brazil, Chile, Argentina, North America, and South America [8277]. 4. RIM may have to pay up to $100 million in compensation to its 70 million users worldwide [8277]. 5. The disruption led to discussions among UK carriers regarding whether RIM's terms and conditions exclude their corporate customers from any compensation [8277].
Preventions 1. Implementing thorough testing procedures for hardware and software upgrades to ensure they are successful before deployment could have prevented the software failure incident [8277]. 2. Regularly testing and verifying the functionality of backup systems to ensure they can seamlessly take over in case of a primary system failure could have prevented the software failure incident [8277]. 3. Conducting a comprehensive risk assessment and implementing additional redundancy measures in critical infrastructure components to prevent cascade failures in the system could have prevented the software failure incident [8277].
Fixes 1. Implementing a more robust and reliable backup system to prevent similar failures in the future [8277]. 2. Conducting a thorough root cause analysis to identify why the backup system failed to function as intended during the incident [8277]. 3. Enhancing the system's infrastructure to ensure quicker restoration in case of catastrophic failures [8277].
References 1. Joint chief executive Mike Lazaridis [Article 8277] 2. RIM executive involved in fixing the problem [Article 8277] 3. Analyst Malik Saadi from Informa [Article 8277] 4. Co-chief executive Jim Balsillie [Article 8277]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization a) The software failure incident at Research In Motion (RIM), the company behind BlackBerry, was not the first time such an incident had occurred. The article mentions that the outage was the first for 18 months and the largest in the company's history, indicating that previous incidents had taken place [8277]. b) The article does not provide specific information about similar incidents happening at other organizations or with their products and services. Therefore, it is unknown if similar incidents have occurred elsewhere based on the provided articles.
Phase (Design/Operation) design, operation (a) The software failure incident in the BlackBerry services was primarily attributed to a design failure. The incident was caused by a systems crash in RIM's Slough network center, which had a ripple effect on the worldwide system due to a dual redundant high-capacity course switch failure. The backup switch also did not function as intended, leading to a backlog of data in the system [8277]. (b) The operation of the system, specifically the attempt at a hardware and software upgrade by engineers at RIM's network operations center in Slough, Berkshire, also played a role in the failure. The update failed, and when the company tried to revert to a backup system, that also failed, resulting in millions of users not receiving emails or messages and experiencing limited web surfing capability [8277].
Boundary (Internal/External) within_system (a) The software failure incident with BlackBerry services was primarily within the system. The failure was caused by a systems crash in RIM's Slough network center, which led to a "ripple effect" affecting the worldwide system. The incident was further exacerbated by a dual redundant high-capacity course switch failure within RIM's infrastructure, causing outages and delays for customers in various regions [Article 8277]. The backup switch also failed to function as intended, resulting in a backlog of data in the system. RIM is conducting a root cause analysis to uncover why the system took longer to restore than expected, indicating an internal system issue. (b) The software failure incident did not have significant contributing factors originating from outside the system. While the incident affected millions of users globally, there is no mention in the article of external factors such as external attacks or third-party interference playing a significant role in the failure. The primary focus of the analysis and response from RIM was on internal system failures and infrastructure issues [Article 8277].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurred due to non-human actions, specifically a systems crash in RIM's Slough network center that caused a "ripple effect" to the worldwide system. The failure was caused by a dual redundant high-capacity course switch designed to protect the infrastructure that failed, leading to outages and delays for customers in various regions. Additionally, the backup switch did not function as intended, causing a backlog of data in the system [8277]. (b) The software failure incident also involved human actions, as engineers at RIM's network operations center in Slough, Berkshire, attempted a hardware and software upgrade for the system serving customers in Europe, the Middle East, Africa, and India. However, the update failed, and when the company tried to revert to a backup system, that also failed, resulting in millions of users not receiving emails or messages and experiencing limited web surfing capability. Co-chief executive Jim Balsillie mentioned that the carriers understood the complexity of the systems, and when such incidents happen, everyone pulls together to serve the customers [8277].
Dimension (Hardware/Software) hardware, software (a) The software failure incident in the BlackBerry services was primarily attributed to hardware issues. The incident was caused by a systems crash in RIM's Slough network center, where a dual redundant high-capacity core switch designed to protect the infrastructure failed, leading to outages and delays for customers in various regions. Additionally, the backup switch did not function as intended, resulting in a backlog of data in the system [8277]. (b) The software failure incident also had contributing factors originating in software. The problems began when engineers attempted a hardware and software upgrade for the system serving customers in certain regions, which ultimately failed. When the company tried to revert to a backup system, that also failed, causing millions of users to experience disruptions in receiving emails, messages, and limited web surfing capability. The incident highlighted the importance of software reliability and the need for thorough testing and backup systems in place [8277].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in the articles does not indicate any malicious intent. The failure was attributed to technical issues such as a systems crash in RIM's network center, a failed redundant switch, and a backup system that did not function as intended. The incident was characterized by a series of technical failures during a hardware and software upgrade attempt, leading to disruptions in services for millions of users globally [8277]. (b) The software failure incident was non-malicious in nature, stemming from technical issues and system failures rather than any deliberate attempt to harm the system. The failure was primarily attributed to hardware and software upgrade problems, backup system failures, and a cascade effect in the system due to a dual redundant switch failure [8277].
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident involving BlackBerry services was primarily due to poor decisions. The incident was triggered when engineers attempted a hardware and software upgrade that ultimately failed, leading to a cascade failure in the system. The backup systems also did not function as intended, causing a backlog of data in the system. Additionally, the failure was exacerbated by the fact that the backup switch designed to protect the infrastructure failed, leading to outages and delays for customers across various regions [8277].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident in the BlackBerry services was primarily attributed to a development incompetence factor. The incident was caused by a systems crash in RIM's Slough network center, which led to a "ripple effect" affecting the worldwide system. The failure was specifically linked to a dual redundant high-capacity course switch designed to protect the infrastructure that failed, causing outages and delays for customers in various regions. Additionally, the backup switch did not function as intended, leading to a backlog of data in the system [8277]. (b) The accidental aspect of the failure can be seen in the attempted hardware and software upgrade by engineers at RIM's network operations center in Slough, Berkshire. The update failed, and when the company tried to revert to a backup system, that also failed, resulting in millions of users not receiving emails or messages and experiencing limited web surfing capability. This accidental chain of events contributed to the widespread disruption of services [8277].
Duration temporary (a) The software failure incident described in the articles was temporary. The BlackBerry services were disrupted for nearly four days, causing delays in delivering emails and instant messages due to a backlog that built up during the outage [8277]. The outage began when engineers attempted a hardware and software upgrade, which failed, leading to a cascade failure in the system. The company tried to revert to a backup system, but that also failed, resulting in millions of users not receiving emails or messages and limited web surfing capability. The outage lasted for three days, affecting regions across Europe, the Middle East, Africa, India, Brazil, Chile, Argentina, North America, and South America [8277]. The company was working to clear the backlog of emails and messages and pledged that none of the messages sent during the disruption would be deleted [8277].
Behaviour crash, omission, value, other (a) crash: The software failure incident in the articles can be categorized as a crash. The incident involved a systems crash in RIM's Slough network center, which caused a "ripple effect" to the worldwide system, leading to outages and delays for customers in various regions [Article 8277]. (b) omission: The software failure incident can also be related to omission as the update attempted by engineers at RIM's network operations center failed, and when they tried to revert to a backup system, that also failed. This resulted in millions of users not receiving emails or messages and experiencing limited web surfing capability [Article 8277]. (c) timing: The timing of the software failure incident can be considered as a factor in the overall failure. The incident involved delays in delivering emails and instant messages due to the backlog that built up during the near four-day outage of the service. Additionally, the restoration of the system took longer than expected, leading to uncertainty about the estimated time of full recovery worldwide [Article 8277]. (d) value: The software failure incident can be linked to a failure in value as the system was not performing its intended functions correctly. Users were not receiving emails or messages, and there were limitations in web surfing capability during the outage period [Article 8277]. (e) byzantine: The software failure incident does not exhibit characteristics of a byzantine failure where the system behaves erroneously with inconsistent responses and interactions. The incident primarily involved a systems crash and subsequent failures in the backup systems, leading to outages and delays for users [Article 8277]. (f) other: The other behavior exhibited in the software failure incident is the failure of the redundant high-capacity course switch designed to protect the infrastructure. This failure caused outages and delays for customers in various regions and led to a cascade failure in the system, highlighting a critical point of failure in the system's design [Article 8277].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, theoretical_consequence (d) Property: People's material goods, money, or data was impacted due to the software failure. The software failure incident with BlackBerry services caused disruptions in delivering emails and instant messages, leading to a backlog of data in the system. This backlog resulted from a systems crash in RIM's network center, causing outages and delays for customers in various regions. The failure of a dual redundant high-capacity course switch and the backup system not functioning as intended led to the accumulation of data in the system [8277]. Analysts estimated that RIM might have to pay up to $100 million in compensation to its 70 million users worldwide, based on a $5 per month fee for the services, due to the service outage [8277].
Domain information, finance, other (a) The failed system was related to the information industry as it affected the production and distribution of information. The BlackBerry services disruption caused delays in delivering emails and instant messages due to a backlog that built up during the outage [Article 8277]. (h) The failed system also impacted the finance industry as there were discussions about potential compensation to customers, with an analyst estimating that RIM may have to pay up to $100 million in compensation to its 70 million users worldwide [Article 8277]. (m) The failed system could also be categorized under the "other" industry as it involved the technology and telecommunications sector, specifically affecting the services provided by Research In Motion (RIM) related to mobile communication devices like the BlackBerry [Article 8277].

Sources

Back to List