Incident: Leap-Year Bug Causes Windows Azure Outage.

Published Date: 2012-03-01

Postmortem Analysis
Timeline 1. The software failure incident happened on February 29, as it was triggered by the leap-year date [10177].
System 1. Windows Azure system 2. Point-of-sale terminals in New Zealand supermarkets
Responsible Organization 1. Microsoft - The software failure incident in the Windows Azure outage was caused by a software bug triggered by the Feb. 29 leap-year date [10177].
Impacted Organization 1. Windows Azure customers were impacted by the software failure incident [10177]. 2. Point-of-sale terminals in New Zealand supermarkets were also affected by leap-year bugs [10177].
Software Causes 1. The Windows Azure outage was caused by a software bug triggered by the Feb. 29 leap-year date that prevented systems from calculating the correct time [10177]. 2. The bug involved a "cert issue" which hampered functions that inspect digital certificates used for authentication, leading to critical systems being unable to communicate [10177]. 3. One possibility is that the certificates Azure relied on allotted years consisting of only 365 days, rather than the 366 days needed for leap years, causing the cloud platform to shut down [10177].
Non-software Causes 1. The failure incident was triggered by the Feb. 29 leap-year date, which caused a software bug in Windows Azure and point-of-sale terminals in New Zealand supermarkets [10177].
Impacts 1. The software failure incident caused a Windows Azure outage that left some customers in the dark for more than 12 hours, impacting their ability to access services [10177]. 2. Point-of-sale terminals in New Zealand supermarkets were also affected by leap-year bugs, potentially disrupting their operations [10177]. 3. The glitch involving digital certificates affected the functioning of critical systems, potentially hindering communication between trusted nodes [10177].
Preventions 1. Proper testing and validation of the software to ensure it can handle leap years correctly [10177]. 2. Implementing robust error handling mechanisms to gracefully handle unexpected situations like leap-year bugs [10177]. 3. Regularly updating and maintaining SSL certificates to avoid expiration-related issues [10177].
Fixes 1. Implementing a fix for the software bug triggered by the Feb. 29 leap-year date that prevented systems from calculating the correct time [10177].
References 1. Microsoft Azure lead engineer Bill Laing 2. Software developer Marsh Ray 3. The Daily WTF blog 4. Flickr staff member yflickerboy

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident related to leap-year bugs affecting systems has happened before at one_organization, specifically at Microsoft's Windows Azure platform. The incident in question was triggered by a software bug related to the leap-year date, causing systems to fail in calculating the correct time [10177]. (b) The article mentions that point-of-sale terminals in New Zealand supermarkets were also affected by leap-year bugs, indicating that similar incidents have occurred at multiple organizations beyond just Microsoft's Azure platform [10177].
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase can be attributed to a software bug triggered by the Feb. 29 leap-year date that prevented systems from calculating the correct time. This bug was a result of a leap-year issue with the certificates Azure relied on, causing critical systems to be unable to communicate properly [10177]. (b) The software failure incident related to the operation phase can be seen in the outage that left some customers in the dark for more than 12 hours. This was due to the software bug triggered by the leap-year date, impacting the operation and availability of the Azure cloud platform [10177].
Boundary (Internal/External) within_system (a) The software failure incident related to the Windows Azure outage was within the system. The incident was caused by a software bug triggered by the leap-year date, which prevented systems from calculating the correct time [10177]. The bug affected the system's ability to inspect digital certificates used for authentication, leading to critical systems being unable to communicate [10177].
Nature (Human/Non-human) non-human_actions (a) The software failure incident in the Azure outage was due to a software bug triggered by the Feb. 29 leap-year date, which is a non-human action [10177]. The bug prevented systems from calculating the correct time, leading to the outage. Additionally, the leap-year bug involved a "cert issue," which likely hampered functions that inspect digital certificates used for authentication, further contributing to the failure. (b) The article does not provide specific information about the software failure incident being directly caused by human actions.
Dimension (Hardware/Software) software (a) The software failure incident reported in the articles was due to a software bug triggered by the Feb. 29 leap-year date, which prevented systems from calculating the correct time. This bug originated in the software itself, not in the hardware [10177]. (b) The software failure incident was specifically attributed to a software bug related to the leap-year date causing issues with the calculation of time and digital certificates. The bug in the software led to the outage and affected the communication between critical systems, indicating that the failure originated in the software [10177].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in the articles is non-malicious. It was caused by a software bug triggered by the leap-year date, which prevented systems from calculating the correct time. This bug affected Windows Azure and point-of-sale terminals in New Zealand supermarkets. The incident was attributed to a leap-year bug involving a "cert issue," which hampered functions that inspect digital certificates used for authentication. The bug led to critical systems being unable to communicate, resulting in the outage. The incident was not a result of malicious intent but rather a technical glitch related to date calculation and certificate validation [10177].
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident related to the Windows Azure outage was primarily caused by a software bug triggered by the leap-year date, which prevented systems from calculating the correct time. This bug was likely a result of poor decisions in the coding or implementation of the system, as it led to critical systems being unable to communicate due to issues with inspecting digital certificates [10177]. The incident highlights the complexity and potential pitfalls in handling date calculations in software systems, emphasizing the importance of robust coding practices and thorough testing to prevent such failures.
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident related to the leap-year bug in Windows Azure and the point-of-sale terminals in New Zealand supermarkets can be attributed to development incompetence. The bug was triggered by the Feb. 29 leap-year date, which prevented systems from calculating the correct time. This issue arose due to a software bug that was not able to handle the leap-year date correctly, leading to a significant outage affecting customers [10177]. (b) The software failure incident can also be considered accidental as it was triggered by a leap-year bug that occurred due to the specific date of February 29. This unexpected event caused disruptions in the systems' ability to calculate the correct time, leading to the outage experienced by Windows Azure and the point-of-sale terminals in New Zealand supermarkets [10177].
Duration temporary (a) The software failure incident reported in the news article was temporary. It was caused by a software bug triggered by the Feb. 29 leap-year date that prevented systems from calculating the correct time. Microsoft was able to put a fix in place that restored service to most customers around 3 a.m. PST on Wednesday, a little more than nine hours after becoming aware of the issue [10177].
Behaviour crash (a) The software failure incident described in Article 10177 can be categorized as a crash. The Windows Azure outage was caused by a software bug triggered by the leap-year date, which prevented systems from calculating the correct time. This resulted in the system losing state and not performing its intended functions, leading to a significant outage for some customers [10177].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay (a) death: People lost their lives due to the software failure (b) harm: People were physically harmed due to the software failure (c) basic: People's access to food or shelter was impacted because of the software failure (d) property: People's material goods, money, or data was impacted due to the software failure (e) delay: People had to postpone an activity due to the software failure (f) non-human: Non-human entities were impacted due to the software failure (g) no_consequence: There were no real observed consequences of the software failure (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? The articles do not mention any direct consequences such as death, harm, basic needs impact, property loss, or non-human entities being impacted due to the software failure incident. The main consequence discussed is the delay caused by the software bug affecting Windows Azure and point-of-sale terminals in New Zealand supermarkets. The delay resulted in customers being left in the dark for more than 12 hours and systems being unable to communicate properly due to the leap-year bug triggered by a software issue [10177].
Domain information, utilities, other (a) The software failure incident affected the production and distribution of information as it impacted Microsoft's Windows Azure cloud platform, which provides various cloud services to customers [10177]. (g) The incident also had implications for utilities as it disrupted the services provided by Azure, which could include power, gas, water, and other essential services for customers relying on the cloud platform [10177]. (m) The software failure incident could be categorized under "other" as it involved the functioning of point-of-sale terminals in New Zealand supermarkets, which are crucial for the sales and retail industry [10177].

Sources

Back to List