Incident: Japanese New Year Tweet Storm Crashes Twitter Worldwide

Published Date: 2014-09-02

Postmortem Analysis
Timeline 1. The software failure incident happened as the year 2012 arrived in Japan, causing Twitter's entire service to crash worldwide [29695]. Therefore, the software failure incident occurred in January 2012.
System The system that failed in the software failure incident reported in Article 29695 was Twitter's entire service, worldwide, due to the crash caused by the synchronized tweets during the arrival of the year 2012 in Japan. 1. Twitter's entire service [29695]
Responsible Organization 1. The synchronized tweets from tens of thousands of Japanese users caused the software failure incident on Twitter during the New Year in 2012 [29695].
Impacted Organization 1. Twitter [29695]
Software Causes 1. The software cause of the failure incident was the inability of Twitter's system to handle the massive wave of synchronized Japanese tweets during the New Year in 2012, which led to the crash of Twitter's entire service worldwide [29695].
Non-software Causes 1. Synchronized tweets from tens of thousands of Japanese users at exactly midnight during the New Year [29695]. 2. Real-time nature of the site where people expect instant sending and receiving at all times [29695]. 3. Massive amounts of real traffic across the globe, with 240 million users generating about 5,700 tweets per second [29695].
Impacts 1. The software failure incident caused Twitter's entire service to crash worldwide as the year 2012 arrived in Japan, due to the massive wave of synchronized Japanese tweets [29695]. 2. The incident led to the realization that Twitter needed a better system to handle such synchronized events, prompting the development of a new software framework for stress testing and monitoring [29695]. 3. The failure incident resulted in the implementation of new monitoring tools and stress testing methods to prevent future crashes during high-traffic events like the Japanese New Year tweets [29695]. 4. The impact of the software failure incident prompted Twitter to rebuild its site using Scala programming technology and expand into data centers in other parts of the world to better serve countries like Japan [29695].
Preventions 1. Implementing a robust stress testing framework: Twitter's development of a new stress testing framework allowed them to mimic events like the Japanese New Year tweet storm and run synthetic tests on a large scale to ensure the site could handle such traffic spikes [29695]. 2. Continuous monitoring and scaling: The new monitoring tools integrated into the stress testing framework enabled Twitter's engineers to closely track the results of the tests on a second-by-second basis and scale them back as needed, ensuring the site's stability during peak usage periods [29695]. 3. Utilizing software programming technology like Scala: Twitter's rebuilding of the site using Scala, a software programming technology, played a significant role in enhancing the site's performance and stability, contributing to preventing future software failure incidents [29695].
Fixes 1. Implementing a new system known as a software "framework" to mimic events like a Japanese New Year tweet storm and run synthetic creations on the thousands of computers that drive the live site [29695]. 2. Conducting stress testing on a massive scale using new monitoring tools to closely track the results of the tests on a second-by-second basis and scale them back as needed [29695]. 3. Rebuilding the site using a software programming technology called Scala [29695]. 4. Expanding into data centers in other parts of the world to serve foreign countries like Japan with dedicated local machines [29695].
References 1. Mazdak Hashemi, Twitter's director of site reliability engineering [29695] 2. Raffi Krikorian, one of Twitter's lead engineers [29695] 3. Ali Alzabarah, who works alongside Hashemi [29695] 4. Adrian Cockcroft, a technology fellow with Battery Ventures [29695]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident related to Twitter crashing due to synchronized Japanese tweets happened again within the same organization. After the incident in 2012 when the Japanese New Year tweets brought down Twitter's service worldwide, Twitter's engineers, led by Mazdak Hashemi, worked on building a new stress testing framework to prevent such failures in the future. The new system proved successful, and Twitter managed to stay up during subsequent events like the Japanese tweets at the arrival of a moment in the television airing of an animated movie [29695]. (b) The software failure incident related to handling massive traffic and stress testing, similar to what Twitter faced, is a common challenge for other online services as well. Adrian Cockcroft, a technology fellow with Battery Ventures, mentioned that as services grow to enormous scale, off-the-shelf testing products fail, and companies need to synthesize traffic patterns that actually matter. He highlighted that Netflix, another company dealing with high online traffic, has open-sourced tools for testing its site, similar to Twitter's approach of sharing software creations with the larger community [29695].
Phase (Design/Operation) design (a) The software failure incident related to the design phase can be seen in the article where it mentions how Twitter experienced a crash during the New Year in Japan in 2012 due to the synchronized tweets from Japanese users. This incident prompted Twitter's director of site reliability engineering to work on building a new system or software framework to handle such events in the future [29695]. (b) The software failure incident related to the operation phase is evident in the same article when it discusses the stress testing conducted by Twitter's engineering team to mimic events like the Japanese New Year tweet storm and run synthetic creations on the live site. This testing was crucial due to the real-time nature of Twitter's service, where users expect instant sending and receiving of tweets at all times, making it essential to ensure the system could handle such massive traffic without crashing [29695].
Boundary (Internal/External) within_system (a) within_system: The software failure incident mentioned in the article was primarily due to factors originating from within the system. Specifically, the failure occurred when Twitter's service crashed worldwide as a result of the synchronized tweets from Japan during the New Year in 2012 [29695]. This incident prompted Twitter's engineers to develop a new stress testing framework to mimic and handle such massive events within the system to prevent future failures. The stress testing framework included new monitoring tools to track the results of tests and scale them back as needed, ultimately ensuring the site stayed up during subsequent events like the Japanese tweet storm at the arrival of a particular moment in a television airing [29695].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident related to non-human actions occurred when the synchronized tweets from Japan at the arrival of the New Year in 2012 caused Twitter's entire service to crash worldwide [29695]. This incident was not due to human actions but rather the massive influx of tweets at exactly midnight from tens of thousands of Japanese users, overwhelming the system. (b) The software failure incident related to human actions involved the response from Twitter's lead engineers, particularly Raffi Krikorian, who urged the director of site reliability engineering, Mazdak Hashemi, to find a better way to handle the next wave of synchronized Japanese tweets after the 2012 New Year crash [29695]. This incident highlights the importance of human intervention and decision-making in addressing and preventing software failures.
Dimension (Hardware/Software) software (a) The software failure incident mentioned in the article was not due to hardware issues but rather due to the overwhelming synchronized tweets from Japanese users causing Twitter's service to crash [29695]. (b) The software failure incident was attributed to the software itself, as the synchronized tweets from Japanese users during the New Year in 2012 caused Twitter's service to crash globally. This incident prompted Twitter's engineers to develop a new software framework for stress testing and monitoring to prevent such failures in the future [29695].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident related to the Japanese New Year tweets crashing Twitter's entire service in 2012 was non-malicious. It was caused by the massive wave of synchronized tweets from Japanese users overwhelming the system, leading to a crash [29695]. The incident prompted Twitter's engineers to develop a new stress testing framework to simulate and handle such large-scale events in the future to prevent similar failures.
Intent (Poor/Accidental Decisions) (a) The software failure incident related to the Japanese New Year tweets crashing Twitter's entire service in 2012 was not due to poor decisions but rather due to the overwhelming synchronized tweets from Japan causing the site to go down [29695]. The incident prompted Twitter's lead engineers to find a better way to handle such events in the future, leading to the development of a new stress testing framework to mimic and handle massive traffic spikes like the Japanese New Year tweet storm. This incident was more about the challenge of handling unexpected high traffic loads rather than poor decisions.
Capability (Incompetence/Accidental) accidental (a) The software failure incident related to development incompetence is not evident in the provided article. (b) The software failure incident related to accidental factors is highlighted in the article. The incident occurred when the Japanese synchronized tweets at the arrival of the New Year in 2012, causing Twitter's entire service to crash worldwide. This incident was not intentional but rather a result of the massive wave of synchronized tweets overwhelming the system [29695].
Duration temporary The software failure incident mentioned in the article was temporary. It occurred specifically during the arrival of the New Year in Japan in 2012 when the synchronized tweets from Japanese users caused Twitter's entire service to crash worldwide [29695]. This incident prompted Twitter's engineers to develop a new stress testing framework to ensure the site could handle similar events in the future. The new system proved successful as it helped the site stay up during subsequent New Year events and even when the Japanese set a new tweets-per-second record during the airing of an animated movie [29695].
Behaviour crash, timing, other (a) crash: The software failure incident described in the article was a crash. Specifically, on the arrival of the year 2012 in Japan, the synchronized tweets from the Japanese users caused Twitter's entire service to crash worldwide [29695]. (b) omission: There is no specific mention of a failure due to omission in the provided article. (c) timing: The software failure incident was related to timing. The crash occurred when the Japanese users tweeted at exactly midnight, causing a massive influx of synchronized tweets that overwhelmed Twitter's service [29695]. (d) value: There is no indication of a failure due to the system performing its intended functions incorrectly in the provided article. (e) byzantine: The software failure incident does not align with a byzantine failure, which involves inconsistent responses and interactions. (f) other: The behavior of the software failure incident can be categorized as a unique case of overwhelming demand due to synchronized events, specifically the New Year tweets from Japanese users, which stressed the system beyond its capacity [29695].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay The consequence of the software failure incident related to the Twitter service crash due to synchronized Japanese tweets on New Year's in 2012 was primarily a **delay** [(29695)]. The incident caused Twitter's entire service to crash worldwide as the synchronized tweets from Japan overwhelmed the system. This resulted in a significant delay in the service being restored and operational again, impacting users' ability to access and use the platform during that time. The article highlights the efforts made by Twitter's engineering team to prevent a similar incident in the future, emphasizing the importance of stress testing and building a new software framework to handle such high-traffic events effectively.
Domain information (a) The failed system was related to the information industry, specifically Twitter's mini-messaging service, which experienced a crash due to the synchronized tweets from Japanese users during the New Year in 2012 [29695]. The incident led to the development of a new stress testing framework to prevent such failures in the future.

Sources

Back to List