Incident: Leap Second Glitch Causes Major Software Failures in Various Systems

Published Date: 2012-07-02

Postmortem Analysis
Timeline 1. The software failure incident involving the Amadeus airline reservation system and other high-profile websites occurred on June 30, 2012 [Article 13163]. 2. The incident related to the leap second glitch affecting Google, Microsoft, Meta, and Amazon happened in July 2022 [Article 131435].
System 1. Linux kernel's hrtimer subsystem [13163] 2. Cassandra database running on Reddit's servers [13059] 3. Tomcat servers on Gawker [13059] 4. Hadoop and ElasticSearch servers [13163]
Responsible Organization 1. The leap second added to the world's atomic clocks caused the software failure incident [32853, 37326, 131435]. 2. The Linux kernel glitch, specifically the hrtimer subsystem, was a key factor in the software failure incident [32853, 13059]. 3. The Network Time Protocol (NTP) used by systems to synchronize with atomic clocks also played a role in the software failure incident [13059]. 4. Amadeus IT Group, the company responsible for the software that failed, was involved in the incident [32853, 13163].
Impacted Organization 1. Amadeus airline reservation system [Article 13163] 2. Reddit, Mozilla, Gawker, LinkedIn, FourSquare, Yelp [Article 13163] 3. Google, Microsoft, Meta, Amazon [Article 131435]
Software Causes 1. The failure incident was caused by a glitch in the Linux kernel, specifically the hrtimer subsystem, which got confused by the leap second added to the world's atomic clocks, leading to hyperactivity on servers and CPU overload [#13163]. 2. Systems running the open-source database Cassandra built with Java were failing to pause Java processes, causing them to get caught in spinning loops and consuming CPU power, leading to server issues [#13059]. 3. Some servers experienced problems due to the Network Time Protocol (NTP) warning servers about the upcoming leap second, causing them to lock up when receiving the announcement [#13059].
Non-software Causes 1. The failure incident was caused by the insertion of a leap second into the world's atomic clocks to keep them in sync with the Earth's rotation, which is gradually slowing down due to tidal effects from the moon [Article 37326]. 2. The Earth's rotation slowing down by about two-thousandths of a second per day was a contributing factor to the failure incident [Article 37326]. 3. The discrepancy between atomic time and the mathematically calculated time of Earth's date was another non-software cause of the failure incident [Article 37326]. 4. The Earth's rotation slowing down due to the tidal pull from the moon was mentioned as a factor affecting the leap second adjustment and causing problems with computing systems [Article 13163].
Impacts 1. The software failure incident caused long delays in Qantas Airways flights in Brisbane, Perth, and Melbourne, with flight attendants having to check passengers in by hand [32853]. 2. The leap second glitch in 2012 led to disruptions in the Amadeus airline reservation system for more than two hours, affecting over 3 million bookings and 1 billion transactions per day for major airlines worldwide [13163]. 3. The leap second glitch in 2012 affected high-profile websites such as Mozilla, Reddit, Gawker, LinkedIn, FourSquare, and Yelp, causing outages and confusion among web servers around the world [13163]. 4. The leap second glitch in 2012 caused problems with computing systems that plug into atomic clocks, leading to issues with Linux subsystems, Java processes, and CPU overloads on servers [13059]. 5. The leap second glitch in 2012 resulted in Reddit being inoperable for about 30 to 40 minutes and entirely offline for about an hour and a half, with similar issues experienced by other web outfits like Gawker and Mozilla [13059].
Preventions 1. Implementing the "leap smear" technique pioneered by Google, which adds the extra second in small increments over a period of time to avoid sudden disruptions [Article 37326]. 2. Updating systems to include the patch for the hrtimer glitch in the Linux kernel, which caused problems with the leap second adjustment [Article 13059]. 3. Developing more agile computing systems that can handle the occasional leap second adjustments without overloading CPUs or causing system errors [Article 13163].
Fixes 1. Implementing a method like "leap smear" where the extra second is added gradually over a period of time instead of all at once could fix the software failure incident caused by leap seconds [131435]. 2. Updating systems to include patches that address glitches related to leap seconds, such as the hrtimer glitch in the Linux kernel, could help prevent similar incidents in the future [13059]. 3. Developing more reliable and robust systems that can handle leap seconds without causing disruptions, such as enhancing the ability to detect and address bugs in advance [13163]. 4. Considering the proposal to abolish leap seconds altogether to avoid the challenges and problems associated with adjusting systems for leap seconds [131435].
References 1. Interviews with system administrators and engineers familiar with the problem [#13059] 2. Linux kernel mailing list posts [#13059] 3. Statements from company representatives (e.g., Amadeus spokesperson) [#13163] 4. Bug reports and discussions from companies affected by the software failure incident (e.g., Mozilla, Reddit) [#13059] 5. Statements from tech giants and key agencies (e.g., Google, NIST, BIPM) [#131435]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: - The Amadeus airline reservation system experienced a disruption due to the "leap second" glitch in 2012, causing delays and manual check-ins for Qantas flights [Article 13163]. - Amadeus faced a similar issue in January of the same year, resulting in an outage and disruptions to their services [Article 13163]. (b) The software failure incident having happened again at multiple_organization: - Other companies such as Mozilla, Reddit, Gawker, LinkedIn, FourSquare, Yelp, and Amazon's cloud services also faced issues due to the "leap second" glitch in 2012, causing outages and system confusion [Article 13163]. - In 2015, companies like Reddit, Foursquare, and Yelp experienced glitches due to the leap second affecting the underlying Linux operating system [Article 32853]. - Google, Microsoft, Meta, and Amazon launched a public effort in 2015 to scrap the leap second due to the havoc it could cause to systems powering the internet [Article 131435].
Phase (Design/Operation) design (a) The software failure incident related to the development phase can be seen in the incident caused by the leap second adjustment in the Linux kernel. The glitch in the Linux kernel's hrtimer subsystem, which was patched in March by a Linux kernel hacker named John Stultz, caused problems for systems like Reddit, Mozilla, and others [#13163]. This incident highlights a failure due to contributing factors introduced during system development. (b) The software failure incident related to the operation phase can be observed in the leap second glitch that affected Reddit, Mozilla, and other web services. The glitch caused servers to lock up, CPUs to overload, and applications to malfunction due to the leap second added to the world's atomic clocks, impacting the operation of these systems [#13059]. This incident showcases a failure due to contributing factors introduced during the operation or misuse of the system.
Boundary (Internal/External) within_system, outside_system (a) within_system: The software failure incident related to the leap second glitch experienced by various companies like Reddit, Mozilla, Gawker, LinkedIn, FourSquare, and Yelp was primarily caused by issues within the system. The leap second addition caused problems with computing systems that were not agile enough to handle the extra second, leading to hyperactivity on servers, CPU overloads, and system errors [Article 13163]. The glitch was traced back to a Linux kernel issue, specifically with the hrtimer subsystem, which got confused by the time change and caused the servers to lock up [Article 13059]. (b) outside_system: The leap second glitch itself was caused by external factors related to the Earth's rotation and the need to keep atomic clocks in sync with the planet's rotation. The addition of the leap second was necessary to prevent drifting away from accurate timekeeping due to the Earth's gradual slowing down [Article 37326]. The decision to add leap seconds is made by timekeeping authorities to ensure coordination between atomic time and Earth time, which is influenced by external factors like the Earth's rotation speed [Article 131435].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: - The software failure incident related to the leap second glitch in 2012 was caused by the addition of an extra second to the world's atomic clocks to keep them in sync with the Earth's rotation. This caused problems with computing systems that plug into these clocks but aren't agile enough to deal with the extra second [Article 13059]. - The leap second glitch in 2012 affected various web outfits, including Reddit, Gawker Media, and Mozilla, due to a glitch in the Linux kernel, particularly the hrtimer subsystem, which got confused by the time change and caused hyperactivity on servers, overloading the CPUs [Article 13059]. - The leap second glitch in 2012 led to issues with systems running Java apps such as Hadoop and ElasticSearch, causing servers to run into problems due to the leap second happening at midnight GMT [Article 13163]. (b) The software failure incident occurring due to human actions: - The leap second glitch in 2012 was exacerbated by the fact that some versions of Linux had not been updated to include a patch for the hrtimer glitch, which was fixed in the Linux kernel in March by a Linux kernel hacker named John Stultz [Article 13059]. - Some systems, such as Linux, use the Network Time Protocol (NTP) to plug into the world's atomic clocks and check the time. On Friday before the leap second in 2012, NTP began warning servers about the upcoming leap second, causing at least some Opera servers to lock up when they received the announcement [Article 13059].
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - The incident involving the leap second glitch in 2012 caused problems with computing systems that plug into atomic clocks but aren't agile enough to deal with the extra second, leading to issues with servers and CPUs [Article 13059]. - The leap second glitch in 2012 affected web servers worldwide, including Reddit, Mozilla, Gawker, LinkedIn, FourSquare, and Yelp, due to a glitch in the Linux kernel and the hrtimer subsystem, causing hyperactivity on servers and CPU overload [Article 13163]. (b) The software failure incident occurring due to software: - The leap second glitch in 2012 was traced to a glitch in the Linux kernel, particularly the hrtimer subsystem, which got confused by the time change and caused hyperactivity on servers, leading to CPU overload [Article 13059]. - The leap second glitch in 2012 affected web servers like Reddit, Mozilla, Gawker, LinkedIn, FourSquare, and Yelp, due to issues with Linux and Java-based systems, such as Cassandra, Hadoop, and Tomcat, which failed to accommodate the leap second properly, causing CPU overload and system crashes [Article 13163].
Objective (Malicious/Non-malicious) non-malicious (a) The articles discuss software failure incidents related to leap seconds being added to the world's atomic clocks, causing problems with computing systems. These incidents were non-malicious in nature, as they were caused by the need to keep atomic clocks in sync with the Earth's rotation, leading to issues with various systems such as Reddit, Mozilla, Gawker, and others [32853, 37326, 131435, 13163, 13059]. (b) The software failures were not due to malicious intent but rather a consequence of the complexities involved in adjusting time systems to account for the Earth's rotation and the need to synchronize atomic clocks with the planet's rotation. The leap second adjustments caused glitches in various systems, leading to outages and disruptions, highlighting the challenges of dealing with time adjustments in computing systems.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The intent of the software failure incident: - The software failure incident related to the leap second glitch in the Linux kernel causing system errors and CPU overloads was not due to poor decisions but rather to contributing factors introduced by the leap second adjustment and the way systems handled the extra second [32853, 37326, 131435]. - The leap second glitch in the Linux kernel was a result of the hrtimer subsystem getting confused by the time change, leading to hyperactivity on servers and CPU overload, impacting various web services like Reddit, Mozilla, and Gawker [13163, 13059]. (b) The intent of the software failure incident: - The leap second glitch that affected various web services like Reddit, Mozilla, and Gawker was an unintended consequence of the leap second adjustment and the systems' inability to properly accommodate the extra second, leading to system errors and CPU overloads [13163, 13059]. - The leap second glitch causing problems with computing systems was an accidental decision due to the complexities of syncing up Earth time with atomic time and the challenges it poses to software systems [37326, 131435].
Capability (Incompetence/Accidental) accidental (a) The software failure incident occurring due to development incompetence: - The leap second glitch in 2012 caused problems with computing systems that plug into atomic clocks but aren't agile enough to deal with the extra second, leading to system errors and CPU overloads [Article 13163]. - The Linux kernel glitch related to the leap second in 2012 caused hyperactivity on servers, leading to CPU overload and system failures [Article 13059]. (b) The software failure incident occurring accidentally: - The leap second glitch in 2012 was introduced as a necessary adjustment to keep atomic clocks in sync with the Earth's rotation, but it caused problems with computer systems powering the internet due to the added time [Article 37326]. - The leap second glitch in 2012 was not intentional but was a result of the Earth's rotation slowing down, requiring the adjustment to keep atomic time in sync with Earth time [Article 13163].
Duration temporary (a) The software failure incident was temporary: - The software failure incident related to the leap second glitch was temporary and lasted for more than two hours on Sunday [Article 13163]. - Reddit, Mozilla, Gawker, LinkedIn, FourSquare, and Yelp were affected by the leap second glitch, causing confusion in web servers around the world [Article 13163]. - The leap second glitch caused problems with computing systems that plug into atomic clocks but are not agile enough to handle the extra second [Article 13059]. - The glitch in the Linux kernel, specifically the hrtimer subsystem, caused hyperactivity on servers, leading to CPU overload [Article 13059]. - Reddit experienced issues with its Cassandra servers, which failed to pause Java processes, resulting in CPU power consumption and server reboots to resolve the problem [Article 13059]. - Other systems like Hadoop and Tomcat also faced issues due to the leap second glitch, indicating a widespread impact on various platforms [Article 13059]. - Some servers started locking up before the leap second arrived when NTP began warning about the upcoming leap second, causing operational disruptions [Article 13059]. (b) The software failure incident was permanent: - There is no information in the articles to suggest that the software failure incident was permanent.
Behaviour crash, omission, value, other (a) crash: The software failure incident described in the articles can be categorized as a crash. The incident involved systems losing state and not performing their intended functions, leading to disruptions in services. For example, the Amadeus airline reservation system experienced disruptions for more than two hours, and high-profile websites like Mozilla, Reddit, Gawker, LinkedIn, FourSquare, and Yelp were affected by the "leap second" glitch, causing confusion and delays [Article 13163]. (b) omission: The software failure incident can also be categorized as an omission. The system omitted to perform its intended functions at an instance(s), particularly related to the leap second adjustment. Some systems experienced issues where the computer clock showed 60 seconds instead of rolling over to the next minute, causing the computer to see a leap second as time going backward, leading to system errors and CPU overloads [Article 37326]. (c) timing: The software failure incident can be categorized as a timing issue. The incident was related to the timing of the leap second adjustment and how systems handled the additional second. The leap second caused problems with computer systems that were not synchronized properly, leading to issues such as system errors and CPU overloads due to the discrepancy in timing between Earth time and atomic time [Article 131435]. (d) value: The software failure incident can be categorized as a value issue. The incident involved the system performing its intended functions incorrectly, particularly in handling the leap second adjustment. The incorrect handling of the leap second led to disruptions in services, crashes, and system errors, impacting various websites and services [Article 13059]. (e) byzantine: The software failure incident does not align with a byzantine behavior, which involves systems behaving erroneously with inconsistent responses and interactions. The incident primarily focused on the challenges related to the leap second adjustment and the impact on computer systems, rather than inconsistent or conflicting behaviors within the systems. (f) other: The software failure incident can be categorized as a glitch or anomaly. The incident involved unexpected disruptions and errors in the system due to the leap second adjustment, causing system failures, crashes, and delays in services. The unexpected behavior of the systems in response to the leap second can be considered as a glitch or anomaly in the software operation [Article 32853].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human, theoretical_consequence (a) death: People lost their lives due to the software failure - No information about people losing their lives due to the software failure was mentioned in the articles. (b) harm: People were physically harmed due to the software failure - No information about people being physically harmed due to the software failure was mentioned in the articles. (c) basic: People's access to food or shelter was impacted because of the software failure - No information about people's access to food or shelter being impacted due to the software failure was mentioned in the articles. (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incident caused delays and disruptions in various services, such as airline reservations, flight check-ins, and online services like Reddit, Mozilla, LinkedIn, and Yelp. These disruptions could have potentially impacted businesses and users financially [Article 13163]. (e) delay: People had to postpone an activity due to the software failure - The software failure incident led to delays in airline flights, manual check-ins, and disruptions in online services, causing inconvenience and operational delays [Article 32853, Article 13163]. (f) non-human: Non-human entities were impacted due to the software failure - The software failure incident affected various systems, including airline reservation systems, online services, and computer servers, causing disruptions and glitches in their operations [Article 32853, Article 13163]. (g) no_consequence: There were no real observed consequences of the software failure - The software failure incidents described in the articles had observable consequences, such as delays, disruptions, and glitches in various systems and services. (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - The articles discussed potential consequences of the software failure related to the leap second adjustments causing problems with computer systems, internet outages, and the need for system adjustments to handle leap seconds. These potential consequences were addressed through measures like "leap smear" to mitigate the impact of leap seconds on computer clocks [Article 37326, Article 131435]. (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - No other specific consequences of the software failure were mentioned in the articles.
Domain information, transportation, utilities (a) The failed system was intended to support the information industry, specifically affecting websites and services related to information dissemination and online platforms [13163]. (b) The transportation industry was indirectly impacted by the software failure incident as Qantas Airways experienced delays due to the system crash, affecting passengers traveling to and from various locations [32853]. (g) The utilities industry was indirectly affected as the software failure incident impacted systems powering the internet, potentially causing internet outages [37326]. (m) The software failure incident was not directly related to any other industry mentioned in the options.

Sources

Back to List