Recurring |
one_organization |
a) The software failure incident at Research In Motion (RIM), the company behind BlackBerry, was not the first time such an incident had occurred. The article mentions that the outage was the first for 18 months and the largest in the company's history, indicating that previous incidents had taken place [8277].
b) The article does not provide specific information about similar incidents happening at other organizations or with their products and services. Therefore, it is unknown if similar incidents have occurred elsewhere based on the provided articles. |
Phase (Design/Operation) |
design, operation |
(a) The software failure incident in the BlackBerry services was primarily attributed to a design failure. The incident was caused by a systems crash in RIM's Slough network center, which had a ripple effect on the worldwide system due to a dual redundant high-capacity course switch failure. The backup switch also did not function as intended, leading to a backlog of data in the system [8277].
(b) The operation of the system, specifically the attempt at a hardware and software upgrade by engineers at RIM's network operations center in Slough, Berkshire, also played a role in the failure. The update failed, and when the company tried to revert to a backup system, that also failed, resulting in millions of users not receiving emails or messages and experiencing limited web surfing capability [8277]. |
Boundary (Internal/External) |
within_system |
(a) The software failure incident with BlackBerry services was primarily within the system. The failure was caused by a systems crash in RIM's Slough network center, which led to a "ripple effect" affecting the worldwide system. The incident was further exacerbated by a dual redundant high-capacity course switch failure within RIM's infrastructure, causing outages and delays for customers in various regions [Article 8277]. The backup switch also failed to function as intended, resulting in a backlog of data in the system. RIM is conducting a root cause analysis to uncover why the system took longer to restore than expected, indicating an internal system issue.
(b) The software failure incident did not have significant contributing factors originating from outside the system. While the incident affected millions of users globally, there is no mention in the article of external factors such as external attacks or third-party interference playing a significant role in the failure. The primary focus of the analysis and response from RIM was on internal system failures and infrastructure issues [Article 8277]. |
Nature (Human/Non-human) |
non-human_actions, human_actions |
(a) The software failure incident occurred due to non-human actions, specifically a systems crash in RIM's Slough network center that caused a "ripple effect" to the worldwide system. The failure was caused by a dual redundant high-capacity course switch designed to protect the infrastructure that failed, leading to outages and delays for customers in various regions. Additionally, the backup switch did not function as intended, causing a backlog of data in the system [8277].
(b) The software failure incident also involved human actions, as engineers at RIM's network operations center in Slough, Berkshire, attempted a hardware and software upgrade for the system serving customers in Europe, the Middle East, Africa, and India. However, the update failed, and when the company tried to revert to a backup system, that also failed, resulting in millions of users not receiving emails or messages and experiencing limited web surfing capability. Co-chief executive Jim Balsillie mentioned that the carriers understood the complexity of the systems, and when such incidents happen, everyone pulls together to serve the customers [8277]. |
Dimension (Hardware/Software) |
hardware, software |
(a) The software failure incident in the BlackBerry services was primarily attributed to hardware issues. The incident was caused by a systems crash in RIM's Slough network center, where a dual redundant high-capacity core switch designed to protect the infrastructure failed, leading to outages and delays for customers in various regions. Additionally, the backup switch did not function as intended, resulting in a backlog of data in the system [8277].
(b) The software failure incident also had contributing factors originating in software. The problems began when engineers attempted a hardware and software upgrade for the system serving customers in certain regions, which ultimately failed. When the company tried to revert to a backup system, that also failed, causing millions of users to experience disruptions in receiving emails, messages, and limited web surfing capability. The incident highlighted the importance of software reliability and the need for thorough testing and backup systems in place [8277]. |
Objective (Malicious/Non-malicious) |
non-malicious |
(a) The software failure incident described in the articles does not indicate any malicious intent. The failure was attributed to technical issues such as a systems crash in RIM's network center, a failed redundant switch, and a backup system that did not function as intended. The incident was characterized by a series of technical failures during a hardware and software upgrade attempt, leading to disruptions in services for millions of users globally [8277].
(b) The software failure incident was non-malicious in nature, stemming from technical issues and system failures rather than any deliberate attempt to harm the system. The failure was primarily attributed to hardware and software upgrade problems, backup system failures, and a cascade effect in the system due to a dual redundant switch failure [8277]. |
Intent (Poor/Accidental Decisions) |
poor_decisions |
(a) The software failure incident involving BlackBerry services was primarily due to poor decisions. The incident was triggered when engineers attempted a hardware and software upgrade that ultimately failed, leading to a cascade failure in the system. The backup systems also did not function as intended, causing a backlog of data in the system. Additionally, the failure was exacerbated by the fact that the backup switch designed to protect the infrastructure failed, leading to outages and delays for customers across various regions [8277]. |
Capability (Incompetence/Accidental) |
development_incompetence, accidental |
(a) The software failure incident in the BlackBerry services was primarily attributed to a development incompetence factor. The incident was caused by a systems crash in RIM's Slough network center, which led to a "ripple effect" affecting the worldwide system. The failure was specifically linked to a dual redundant high-capacity course switch designed to protect the infrastructure that failed, causing outages and delays for customers in various regions. Additionally, the backup switch did not function as intended, leading to a backlog of data in the system [8277].
(b) The accidental aspect of the failure can be seen in the attempted hardware and software upgrade by engineers at RIM's network operations center in Slough, Berkshire. The update failed, and when the company tried to revert to a backup system, that also failed, resulting in millions of users not receiving emails or messages and experiencing limited web surfing capability. This accidental chain of events contributed to the widespread disruption of services [8277]. |
Duration |
temporary |
(a) The software failure incident described in the articles was temporary. The BlackBerry services were disrupted for nearly four days, causing delays in delivering emails and instant messages due to a backlog that built up during the outage [8277]. The outage began when engineers attempted a hardware and software upgrade, which failed, leading to a cascade failure in the system. The company tried to revert to a backup system, but that also failed, resulting in millions of users not receiving emails or messages and limited web surfing capability. The outage lasted for three days, affecting regions across Europe, the Middle East, Africa, India, Brazil, Chile, Argentina, North America, and South America [8277]. The company was working to clear the backlog of emails and messages and pledged that none of the messages sent during the disruption would be deleted [8277]. |
Behaviour |
crash, omission, value, other |
(a) crash: The software failure incident in the articles can be categorized as a crash. The incident involved a systems crash in RIM's Slough network center, which caused a "ripple effect" to the worldwide system, leading to outages and delays for customers in various regions [Article 8277].
(b) omission: The software failure incident can also be related to omission as the update attempted by engineers at RIM's network operations center failed, and when they tried to revert to a backup system, that also failed. This resulted in millions of users not receiving emails or messages and experiencing limited web surfing capability [Article 8277].
(c) timing: The timing of the software failure incident can be considered as a factor in the overall failure. The incident involved delays in delivering emails and instant messages due to the backlog that built up during the near four-day outage of the service. Additionally, the restoration of the system took longer than expected, leading to uncertainty about the estimated time of full recovery worldwide [Article 8277].
(d) value: The software failure incident can be linked to a failure in value as the system was not performing its intended functions correctly. Users were not receiving emails or messages, and there were limitations in web surfing capability during the outage period [Article 8277].
(e) byzantine: The software failure incident does not exhibit characteristics of a byzantine failure where the system behaves erroneously with inconsistent responses and interactions. The incident primarily involved a systems crash and subsequent failures in the backup systems, leading to outages and delays for users [Article 8277].
(f) other: The other behavior exhibited in the software failure incident is the failure of the redundant high-capacity course switch designed to protect the infrastructure. This failure caused outages and delays for customers in various regions and led to a cascade failure in the system, highlighting a critical point of failure in the system's design [Article 8277]. |