Incident: Hotmail Service Bug Caused Inbox Disappearance for Users.

Published Date: 2011-01-06

Postmortem Analysis
Timeline 1. The software failure incident happened in January 2011.
System 1. Automated script used for testing the service [3922]
Responsible Organization 1. The software failure incident was caused by an error with an automated script used by Microsoft to test the service for errors in everyday usage [3922].
Impacted Organization 1. Windows Live Hotmail users [3922]
Software Causes 1. The software failure incident was caused by an error with an automated script used by Microsoft to test the service for errors in everyday usage [3922]. 2. The script's function to clean its tracks after creating test accounts failed, leading to the testing jumping from the test group to real user accounts [3922].
Non-software Causes 1. Lack of proper segregation between testing accounts and real user accounts [3922] 2. Human error in the execution of the automated script [3922]
Impacts 1. Some Windows Live Hotmail users were left without access to new messages and entire folders for days [3922]. 2. Impacted users saw empty mailboxes and received a 'Welcome to Hotmail' message when logging in [3922]. 3. Messages sent to affected users during the incident would bounce back to the senders as if the account was shut down [3922]. 4. 16,035 out of the 17,355 affected users had their accounts fixed a day after the company began addressing the issue, while the remaining 1,320 took an additional three days to get sorted out [3922].
Preventions 1. Implement stricter access controls and permissions for automated testing scripts to prevent unauthorized access to real user accounts [3922]. 2. Conduct thorough testing and validation of automated scripts before deployment to ensure they function as intended and do not have unintended consequences [3922]. 3. Enhance monitoring and alert systems to quickly detect and respond to anomalies or unauthorized activities in the system, such as the script jumping to real user accounts [3922]. 4. Implement regular audits and reviews of automated testing processes to identify and address potential vulnerabilities or errors that could lead to incidents like the one experienced with the Hotmail bug [3922].
Fixes 1. Splitting up service testing accounts from normal user accounts to prevent testing scripts from affecting real user data [3922]. 2. Adding a service status to support forums and bug reporting tools to improve communication and transparency during incidents [3922].
References 1. Windows Team Blog [3922] 2. Mike Schackwitz of the Hotmail team [3922]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: The incident mentioned in Article 3922 occurred with Windows Live Hotmail, which is a service provided by Microsoft. This incident involved a bug in an automated script used for testing the service, which led to some users losing access to their new messages and folders. To prevent such incidents from happening again, Microsoft mentioned that they are making changes to their testing procedures by splitting up service testing accounts from normal user accounts [3922]. (b) The software failure incident having happened again at multiple_organization: There is no information in the provided article about similar incidents happening at other organizations or with their products and services.
Phase (Design/Operation) design (a) The software failure incident in Article 3922 was related to the design phase. The issue stemmed from an error with an automated script used by Microsoft to test the service for errors in everyday usage. The script's function was to clean its tracks after creating test accounts, but in this instance, it jumped to real user accounts, causing the problem. This indicates a failure due to contributing factors introduced by system development or procedures to operate the system [3922]. (b) The software failure incident in Article 3922 was not directly related to the operation phase or misuse of the system. The issue was caused by an error in the automated script used for testing, which led to certain users losing access to new messages and folders. There was no mention of user misuse or operational errors contributing to the failure [3922].
Boundary (Internal/External) within_system (a) The software failure incident described in Article 3922 was within_system. The issue stemmed from an error with an automated script used by Microsoft to test the Hotmail service. The script, which was supposed to create test accounts and clean up after itself, mistakenly affected real user accounts, causing the users to lose access to new messages and entire folders. The bug impacted 17,355 users, with 16,035 having their accounts fixed within a day and the remaining 1,320 taking three more days to resolve the issue. To prevent such incidents in the future, Microsoft is implementing measures like separating service testing accounts from normal user accounts and enhancing service status visibility on support forums and bug reporting tools [3922].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident in this case was primarily due to non-human actions. The incident was caused by an error with an automated script used by Microsoft to test the service for errors. The script's function was to clean its tracks after creating test accounts, but it mistakenly affected real user accounts, leading to the issue where some users lost access to new messages and folders [3922]. (b) Human actions were also involved in addressing the software failure incident. The Hotmail team had to work on fixing the accounts of the impacted users. Additionally, measures were taken by Microsoft to prevent such incidents in the future, such as splitting up service testing accounts from normal user accounts and adding service status to support forums and bug reporting tools [3922].
Dimension (Hardware/Software) software (a) The software failure incident reported in Article 3922 was not attributed to hardware issues. The incident was explained to have stemmed from an error with an automated script used by Microsoft to test the service for errors in everyday usage. The problem occurred when the script, which was supposed to create test accounts and clean up after itself, mistakenly affected real user accounts, leading to the issue where some users were unable to access new messages and folders in Windows Live Hotmail [3922]. (b) The software failure incident in Article 3922 was clearly attributed to software factors. Specifically, the issue was caused by an error in the automated script used for testing the Hotmail service. This software bug led to the disruption in access for a group of users, impacting their ability to receive new messages and access folders. The bug affected a significant number of users and required several days to rectify, highlighting the software-related nature of the failure incident [3922].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in Article 3922 was non-malicious. The incident was caused by a service bug that stemmed from an error with an automated script used by Microsoft to test the service for errors in everyday usage. The bug resulted in some Windows Live Hotmail users being unable to access new messages and entire folders for days. The issue was explained by Mike Schackwitz of the Hotmail team, who mentioned that the problem occurred when the testing script jumped from test accounts to real user accounts, impacting 17,355 users. The data of the impacted users was not deleted, but their inbox location in the directory servers was removed, leading to empty mailboxes and bounce-back messages for those who were affected [3922].
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions (a) The software failure incident described in Article 3922 was primarily due to poor_decisions. The incident was caused by an error with an automated script used by Microsoft to test the Hotmail service. The script was supposed to create test accounts and clean up after itself, but it mistakenly affected real user accounts instead. This poor decision in the implementation of the script led to 17,355 users being impacted, with some experiencing issues accessing their emails and folders. To prevent such incidents in the future, Microsoft decided to separate the testing accounts from normal user accounts and implement additional measures to enhance service testing and monitoring [3922].
Capability (Incompetence/Accidental) accidental (a) The software failure incident in Article 3922 was not explicitly attributed to development incompetence. The issue stemmed from an error with an automated script used for testing the service, which inadvertently affected real user accounts instead of test accounts. This indicates a failure in the testing process rather than development incompetence. (b) The software failure incident in Article 3922 was categorized as accidental. The problem was caused by an error in the automated script used for testing, which led to the unintended impact on real user accounts. This was not a deliberate action but a mistake that occurred during the testing process, resulting in the disruption experienced by the affected users.
Duration temporary (a) The software failure incident described in the article was temporary. It lasted for a specific duration during which a group of Windows Live Hotmail users were affected by a service bug that left them without access to new messages and entire folders. The issue was caused by an error with an automated script used by Microsoft to test the service, which inadvertently affected real user accounts. The impact was mitigated over time as the company worked on fixing the problem, with affected accounts being gradually restored. For example, 16,035 users had their accounts fixed a day after the company started addressing the issue, while the remaining 1,320 users took an additional three days to have their accounts sorted out [3922].
Behaviour omission, value, other (a) crash: The software failure incident described in the article was not a crash. The issue stemmed from an error with an automated script that led to certain users losing access to new messages and folders, rather than the system losing state and not performing any of its intended functions [3922]. (b) omission: The software failure incident can be categorized as an omission. The error with the automated script caused the system to omit performing its intended functions correctly for a group of Windows Live Hotmail users, resulting in them not having access to new messages and entire folders for days [3922]. (c) timing: The software failure incident was not related to timing issues. The issue was not about the system performing its intended functions too late or too early, but rather about certain functions not being performed correctly due to the script error [3922]. (d) value: The software failure incident can be classified as a value failure. The system performed its intended functions incorrectly for the affected users, leading to their inbox location in the directory servers being removed and causing their accounts not to match up with Hotmail's database [3922]. (e) byzantine: The software failure incident was not a byzantine failure. There were no mentions of inconsistent responses or interactions by the system in the article [3922]. (f) other: The software failure incident can be described as a bug in the automated script that caused certain users to lose access to their new messages and folders. This behavior falls under the "other" category as it was a specific issue with the script's function that led to the omission of performing the system's intended functions correctly for the affected users [3922].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay The consequence of the software failure incident described in the article was primarily related to the impact on the affected users' access to their email accounts. Specifically, the incident resulted in users being unable to access new messages and entire folders in their Windows Live Hotmail accounts for several days. The issue was caused by a bug in an automated testing script that mistakenly affected real user accounts instead of test accounts. As a result, impacted users experienced empty mailboxes and received a 'Welcome to Hotmail' message when logging in. Additionally, any messages sent to the affected accounts during the incident bounced back to the senders as if the accounts were shut down. The incident affected a total of 17,355 users, with the majority having their accounts fixed within a day, while a smaller group required three days for resolution [3922].
Domain information (a) The failed system in this incident was related to the information industry, specifically the email service provided by Windows Live Hotmail [3922].

Sources

Back to List