Incident: BlackBerry Network Outage: Global Service Disruption and Core Switch Failure

Published Date: 2011-10-12

Postmortem Analysis
Timeline 1. The software failure incident happened in October 2011 [8322, 8266, 8535, 8527, 8341].
System 1. Core switch failure in RIM's infrastructure [8322, 8266, 8535] 2. Backup system failure [8527] 3. Relay software failure [8266]
Responsible Organization 1. Research in Motion (RIM) - The software failure incident was caused by a core switch failure in RIM's infrastructure, leading to a massive network outage affecting BlackBerry users worldwide [8322, 8266, 8535, 8527, 8341].
Impacted Organization 1. BlackBerry users worldwide were impacted by the software failure incident [8322, 8266, 8535, 8527, 8341].
Software Causes 1. A core switch failure in RIM's infrastructure caused message delays and service disruptions [8322, 8266, 8535]. 2. A hardware error in the BlackBerry service's infrastructure led to a network disruption [8527]. 3. A core switch failure within RIM's infrastructure resulted in e-mail, messaging, and web service outages [8341].
Non-software Causes 1. Hardware error in the BlackBerry service's infrastructure [8527] 2. Core switch failure within RIM's infrastructure [8535] 3. Backup system not working as intended [8527] 4. Failure of the failover system to function as previously tested [8535] 5. Systemic failures in the Relay software directing traffic within the NOCs [8266]
Impacts 1. The software failure incident caused a three-day outage for BlackBerry users worldwide, affecting almost every one of its 70 million users and disrupting business and personal communications [8266, 8535]. 2. The outage led to frustrations among BlackBerry users, with many expressing anger and disappointment on social media platforms like Twitter [8535]. 3. The outage resulted in a massive backlog of data, causing delays in message delivery and internet access for users [8535]. 4. The outage impacted BlackBerry's reputation and customer trust, potentially leading to customers considering switching to other platforms like iPhone or Android [8527]. 5. The outage highlighted RIM's struggles in the smartphone market, with investors calling for management changes and the stock price declining significantly [8527]. 6. The outage occurred at a critical time for RIM, coinciding with increased competition from other smartphone manufacturers like Apple and Google [8527]. 7. The outage prompted RIM's co-CEO to issue a public apology to customers, acknowledging the failure to provide reliable real-time communications and expressing disappointment in the company's performance [8341].
Preventions 1. Implementing a more robust and efficient core networking software to handle the rapid growth in the smartphone market after 2005 instead of just increasing the number of servers running the software could have prevented the outage [8266]. 2. Completing the work on making the Egham center a full backup center as planned could have helped in shifting data transfer seamlessly in case of issues at the Slough NOC [8266]. 3. Conducting thorough and regular testing of failover systems and processes to ensure they function as expected during failures could have minimized the impact of the outage [8535]. 4. Addressing the fundamental issues with RIM's own software, such as the "Relay" that directs traffic within the NOCs, could have prevented the outage [8266]. 5. Having multiple backup systems and locations to avoid a single point of failure could have mitigated the impact of the outage [8535].
Fixes 1. Implementing a more efficient core networking software to handle the rapid growth in the smartphone market instead of just adding more servers [8266]. 2. Completing the backup system at the Egham center to ensure failover in case of issues at the Slough NOC [8266]. 3. Addressing the fundamental piece of RIM's software called the "Relay" that directs traffic within the NOCs [8266]. 4. Enhancing communication with customers during outages to manage expectations and prevent customer loss [8322, 8535]. 5. Ensuring multiple backups and disaster recovery plans to avoid single points of failure [8535]. 6. Continuous testing of failover systems and processes to minimize service impact on customers [8535]. 7. Resolving hardware errors and ensuring backup systems work as intended to prevent cascading problems [8527]. 8. Making sure that e-mails are not lost and all messages are eventually delivered to customers [8527]. 9. Providing a clear timeline for service restoration and keeping customers informed about progress [8535]. 10. Working with vendors to fix specific errors and improve network reliability [8527].
References 1. Research in Motion (RIM) official statements and press releases [8322, 8266, 8535, 8527, 8341] 2. Interviews with RIM executives and employees [8322, 8266, 8535, 8527, 8341] 3. Industry insiders and former RIM staff [8266] 4. Twitter accounts of RIM and BlackBerryHelp [8322] 5. Analysts and industry experts [8535, 8527] 6. Customers and users affected by the outage [8535, 8341]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: - Research in Motion (RIM) faced a major service outage in 2005, which was blamed on a component of RIM's systems experiencing a service interruption [8266]. - RIM experienced another major service outage in 2007, attributed to a "process error" [8527]. - RIM faced further outages in February 2008, December 2009, and March 2010, with disruptions in both the U.S. and internationally [8527]. (b) The software failure incident having happened again at multiple_organization: - Samsung and Nokia were mentioned as unveiling rival services to BlackBerry Messenger (BBM) [8322]. - Apple released iMessage as a rival messaging service to BBM [8322]. - BlackBerry's outage prompted users to consider switching to other platforms like iPhone and Android [8341].
Phase (Design/Operation) design, operation (a) The software failure incident occurring due to the development phases: - The BlackBerry outage was caused by a "core switch failure in RIM's infrastructure" [8322]. - RIM experienced its worst-ever outage due to a "core switch failure" at their network operations center [8266]. - The outage was caused by an "extremely critical issue" on the BlackBerry network, leading to service delays [8535]. - RIM's BlackBerry service infrastructure "suffered a hardware error" that cascaded into a larger problem [8527]. - The outages were caused by a "core switch failure within RIM's infrastructure" [8341]. (b) The software failure incident occurring due to the operation phases: - The outage affected BlackBerry users globally, disrupting business and personal communications [8535]. - RIM faced outcry from users displeased with the outage, with many considering switching to other platforms [8341]. - The outage primarily affected text messaging and Internet access, causing frustrations for users who rely on BlackBerry smartphones [8535]. - RIM's co-CEO acknowledged that customers "expect better" from the company and that the service issues have not been fully resolved yet [8341]. - The outage led to angry customers expressing their frustrations on social media platforms like Twitter [8527].
Boundary (Internal/External) within_system (a) within_system: - The software failure incident was primarily caused by internal factors within the system. Research in Motion (RIM) experienced a core switch failure in its infrastructure, leading to messaging and browsing delays for BlackBerry users [8535]. - RIM's founder, Mike Lazaridis, mentioned that the BlackBerry service's infrastructure suffered a hardware error, which cascaded into a larger problem when the backup system did not work as intended [8527]. - The outage was attributed to a "core switch failure within RIM's infrastructure," and the failover to a backup switch did not function as expected, causing a large backlog of data [8535]. - RIM's internal software component called the "Relay" was identified as a fundamental piece that failed, leading to traffic direction issues within the network operations centers [8266]. (b) outside_system: - The software failure incident was not primarily caused by external factors outside the system. The outage was mainly due to internal failures within RIM's infrastructure and network operations centers [unknown].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: - The software failure incident in the BlackBerry network was caused by a "core switch failure in RIM's infrastructure" [8322]. - RIM experienced its worst-ever outage lasting three days due to a "core switch failure" at their network operations center [8266]. - An "extremely critical issue" on the BlackBerry network caused the outage, leading to massive frustrations for users [8535]. - RIM's BlackBerry service faced a hardware error that cascaded into a network disruption, with a backup system not working as intended [8527]. - The outage was caused by a "core switch failure within RIM's infrastructure" and the failover to a backup switch did not function properly [8341]. (b) The software failure incident occurring due to human actions: - Former RIM staff and industry insiders mentioned that RIM's approach to its system and network setup contributed to the outage, indicating that the company had been storing up problems for years [8266]. - RIM's founder acknowledged that the company did not deliver on its goal of providing reliable real-time communications, indicating a failure on their part [8341].
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - The outage that affected BlackBerry users was caused by a "core switch failure in RIM's infrastructure" [8322]. - RIM's founder, Mike Lazaridis, mentioned that the BlackBerry service's infrastructure "suffered a hardware error" which led to the problem cascading [8527]. - The outage was attributed to a "core switch failure within RIM's infrastructure" which caused the failover to a backup switch to not function properly [8341]. (b) The software failure incident occurring due to software: - The outage was described as a "core switch failure" by RIM [8266]. - RIM mentioned that the messaging and browsing delays were caused by a core switch failure within their infrastructure, indicating a software-related issue [8535]. - The real point of failure was identified as a fundamental piece of RIM's own software called the "Relay" which directs traffic within each of the four NOCs [8266].
Objective (Malicious/Non-malicious) non-malicious (a) malicious: - There is no indication in the provided articles that the software failure incident was malicious or caused by contributing factors introduced by humans with the intent to harm the system. The articles primarily focus on technical failures, infrastructure issues, and system breakdowns rather than intentional malicious actions [8322, 8266, 8535, 8527, 8341]. (b) non-malicious: - The software failure incidents reported in the articles were non-malicious in nature. They were attributed to technical issues such as core switch failures, hardware errors, backup system failures, and system-wide breakdowns rather than intentional malicious acts [8322, 8266, 8535, 8527, 8341].
Intent (Poor/Accidental Decisions) poor_decisions (a) poor_decisions: Failure due to contributing factors introduced by poor decisions The software failure incident related to the BlackBerry outage was primarily attributed to poor decisions made by Research in Motion (RIM). According to Article 8266, industry insiders and former RIM staff mentioned that RIM grew in popularity too quickly and got complacent over the iPhone. Instead of rewriting its core networking software to efficiently handle the rapid growth in the smartphone market, RIM opted to increase the number of servers running the software without addressing the underlying issues. This decision led to problems accumulating over the years, ultimately resulting in the outage. The former staffer highlighted that the outage in 2011 was similar to one in 2005, indicating a lack of improvement despite past incidents [8266]. Additionally, the outage revealed that RIM's approach to its system, including its private network and the Relay software, was not adequately designed to handle the increasing demands and complexities of the BlackBerry service. The failure of the Relay software, which directs traffic within RIM's network operations centers, was a critical factor in the outage. The former RIM staffer mentioned that the Relay had reached a melting point, indicating a failure in the core software infrastructure [8266]. Furthermore, the outage highlighted RIM's lack of preparedness and poor decision-making regarding backup systems. The article mentioned that the failover system to the backup location in Egham, Surrey, had not been completed as planned, leaving the network vulnerable to failures in the primary location. This lack of redundancy and failure in implementing backup systems showcased poor decision-making in ensuring network reliability and continuity [8266]. In summary, the software failure incident related to the BlackBerry outage was primarily driven by poor decisions made by RIM in managing its network infrastructure, handling rapid growth, and implementing effective backup systems. Additionally, Article 8341 mentioned that RIM faced outcry from users displeased with the outage, with many expressing intentions to switch to other platforms. This user dissatisfaction and potential loss of customers due to the outage can also be attributed to poor decisions made by RIM in managing the situation and maintaining customer satisfaction [8341].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident occurring due to development incompetence: - The outage was attributed to a "core switch failure in RIM's infrastructure" [8322]. - RIM faced criticism for its approach to system development, with former staff mentioning that the company grew too quickly and did not rewrite core networking software to handle the smartphone market boom efficiently [8266]. - RIM's own software called the "Relay" was identified as a fundamental piece that directed traffic within the network operations centers [8266]. (b) The software failure incident occurring accidentally: - RIM experienced a hardware error in its infrastructure that led to the outage, which then cascaded due to a backup system failure [8527]. - The outage was described as an "extremely critical issue" on the BlackBerry network, with RIM working to understand why the backup system did not function as intended [8535]. - RIM's founder mentioned that the company is working with vendors to fix the specific error that occurred [8527].
Duration temporary The software failure incident related to the BlackBerry service outage was temporary. The outage lasted for several days, starting on Monday and affecting millions of users globally [Article 8266]. The outage began in Europe, the Middle East, and Africa, then spread to South America, Asia, and eventually to the United States and Canada [Article 8535]. The outage primarily impacted text messaging and Internet access, leaving some voice calling services operational [Article 8527]. The outage was caused by a core switch failure within RIM's infrastructure, and the failover to a backup switch did not function as expected [Article 8535]. RIM worked around the clock to fix the problem and clear the backlog of data generated by the outage [Article 8535]. RIM's co-CEO, Mike Lazaridis, issued a video apology to customers, acknowledging the service outages and stating that it was too soon to say the issue was fully resolved [Article 8341].
Behaviour crash, value, other (a) crash: Failure due to system losing state and not performing any of its intended functions - The BlackBerry service outage caused a significant disruption in the network, impacting millions of users globally and leading to frustrations for those relying on the smartphones for communication [Article 8535]. - RIM's BlackBerry service experienced its worst-ever outage lasting three days, affecting almost every one of its users due to a hardware error in the infrastructure and a backup system failure [Article 8527]. (b) omission: Failure due to system omitting to perform its intended functions at an instance(s) - The outage primarily affected text messaging and Internet access from the mobile phones, leaving some voice calling services operational [Article 8535]. - RIM faced outcry from users displeased with the outage, including many who took to Twitter to express their frustrations and potential plans to switch to another platform [Article 8341]. (c) timing: Failure due to system performing its intended functions correctly, but too late or too early - The outage caused delays in messaging and browsing for BlackBerry users in various regions, leading to a large backlog of data that needed to be cleared [Article 8535]. - RIM's co-CEO Mike Lazaridis acknowledged the failure to provide reliable real-time communications, indicating a timing issue in delivering on the goal of real-time communication [Article 8341]. (d) value: Failure due to system performing its intended functions incorrectly - The outage was caused by a core switch failure within RIM's infrastructure, leading to messaging and browsing delays for users [Article 8535]. - RIM's BlackBerry service infrastructure suffered a hardware error, causing a cascade of problems and a failure of the backup system to work as intended [Article 8527]. (e) byzantine: Failure due to system behaving erroneously with inconsistent responses and interactions - The outage spread to various continents, impacting users globally and causing disruptions in communication services [Article 8535]. - RIM's outage prompted outcry from users, with many expressing frustration and potential plans to switch to other platforms, indicating inconsistent responses from the system [Article 8341]. (f) other: Failure due to system behaving in a way not described in the (a to e) options - The outage led to a massive backlog of data waiting to be delivered, requiring a restoration process that posed potential risks if data was lost, impacting the reliability and trustworthiness of the service [Article 8266]. - RIM's approach to system setup and network management was criticized for storing up problems over the years, leading to the outage being expected due to the company's handling of its system growth and demands [Article 8266].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, theoretical_consequence The consequence of the software failure incident described in the articles is as follows: (d) property: People's material goods, money, or data was impacted due to the software failure. - The software failure incident caused disruptions in BlackBerry services, affecting millions of users globally and impacting their ability to communicate effectively [8322, 8266, 8535]. - Users were unable to access email, BBM (BlackBerry Messenger), and Internet services, leading to frustrations and disruptions in both personal and business communications [8322, 8535]. - RIM faced criticism and backlash from users who were displeased with the outage, with many expressing their frustration on social media platforms like Twitter [8341]. - The outage also had financial implications for RIM, as the company's stock value dropped significantly following the incident [8527]. (e) delay: People had to postpone an activity due to the software failure. - The software failure incident caused delays in message deliveries, with a large backlog of data being generated and users having to wait for services to be restored [8266, 8535]. - RIM acknowledged that there was a delay in clearing the backlog of data and restoring normal service levels for users [8341]. (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur. - There were discussions about the potential long-term impact of the outage on RIM's reputation and customer loyalty, with concerns raised about customers potentially switching to other smartphone platforms due to the repeated service disruptions [8266, 8527]. - Industry analysts and insiders highlighted the challenges RIM faced in addressing the outage and the need to quickly resolve the issues to prevent further damage to the company's standing in the smartphone market [8527].
Domain information, finance, government (a) The software failure incident related to the production and distribution of information. The BlackBerry service outage affected millions of users globally, disrupting text messaging, Internet access, and email services provided by Research in Motion's BlackBerry network [Article 8535]. The outage caused frustrations for users who rely on these smartphones for business and personal communications, highlighting the impact on information exchange and communication channels [Article 8535]. (h) The failed system was intended to support the finance industry. BlackBerry's service outage impacted users who heavily rely on BlackBerry devices for secure messaging, particularly in sensitive sectors like banking and government, due to the encryption and security features provided by BlackBerry services [Article 8266]. The outage raised concerns about the reliability and trustworthiness of BlackBerry services, especially in industries where secure communication is crucial [Article 8266]. (l) The software failure incident was related to the government sector. BlackBerry's outage affected users in various regions, including Europe, the Middle East, Africa, India, Brazil, Chile, Argentina, the United States, and Canada, indicating a widespread impact on government agencies and officials who use BlackBerry devices for secure communication [Article 8322]. The outage disrupted essential communications for users in government roles, highlighting the significance of BlackBerry services in governmental operations [Article 8322].

Sources

Back to List